# Exercise: Complexity and Style

In this exercise we will first replicate a simplified form of Mosteller & Wallace's famous stylometric analysis of the disputed Federalist Papers. We will then study complexity and style in speeches by US presidents.

In [None]:
%pip install py-readability-metrics
%pip install lexical-diversity



In [1]:
import nltk
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
nltk.download('punkt')
from nltk import word_tokenize
from tqdm import tqdm
import statsmodels.formula.api as smf
from lexical_diversity import lex_div as ld
from readability import Readability


[nltk_data] Downloading package punkt to /Users/kzc744/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1: Load Mosteller & Wallace data

Load the text of all federalist papers, available as 'federalist.csv' on Absalon.

Paper no. 58 is attributed to Madison in the data, but Mosteller and Wallace consider this paper to have disputed authorship. Fix no. 58 to "HAMILTON OR MADISON" as the author.

In [2]:
# loading data
df = pd.read_csv("federalist.csv")

In [3]:
df['author'][57]="HAMILTON OR MADISON" #changing index 57 (for paperid 58) to madison or hamilton

A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df['author'][57]="HAMILTON OR MADISON" #changing index 57 (for paperid 58) to madison or hamilton


In [4]:
df[57:58]

Unnamed: 0,author,text,date,title,paper_id,venue
57,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,,Objection That The Number of Members Will Not ...,58,


## 2: Stylometric feature engineering

To help ourselves a bit, we will lean on Mosteller & Wallace's finding that  Madison tended to use the word "whilst", while Hamilton would tend to use the word "while" in similar contexts. Call the number of uses of "whilst" in a text $wh_1$ and the number of "while" counts $wh_2$. For each text, calculate the metric

$$
whfrac = log( \frac{wh_1 +1}{wh_2 +1} )
$$

This is a so-called regularized log odds ratio. The +1s in the numerator and denominator are regularization terms, and are arbitratily set at 1 here. You can optionally  verify that $whfrac$ follows a nice symmetrical distribution.

In [6]:
# count occurences of whilst/while in each text
df['wh1'] = df.text.apply(lambda x: str(x).split().count("whilst"))
df['wh2'] = df.text.apply(lambda x: str(x).split().count("while"))
df['whfrac'] = np.log((df['wh1']+1)/(df['wh2']+1))


In [7]:
# inspect wh1, wh2 and whfrac
df.groupby('author')[['wh1','wh2','whfrac']].mean()

Unnamed: 0_level_0,wh1,wh2,whfrac
author,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
HAMILTON,0.019608,0.647059,-0.330341
HAMILTON AND MADISON,0.333333,0.0,0.231049
HAMILTON OR MADISON,0.5,0.0,0.3226
JAY,0.0,0.4,-0.277259
MADISON,0.714286,0.0,0.454008


It seems to have worked! Indeed it is the case, that Madison writes 'whilst' a lot, whilst/while Hamilton has a preference for 'while'.

## 3: Testing predictiveness

The disputed papers are papers 49 to 58, 62, and 63. Create separate data frames for the disputed and undisputed papers.

Among the undisputed papers, create a dummy variable  indicating whether the text was written by Madison  vs. anyone  else.

Fit a logistic regression model with this new indicator as the dependent variable and $whfrac$ as  the independent variable. What does the model tell you about the predictiveness of the use of 'whilst' vs. 'while'?

In [8]:
# disputed papers
df_disp = df[df['author']=="HAMILTON OR MADISON"]
# undisputed papers
df_no_disp = df[df['author']!="HAMILTON OR MADISON"]

In [9]:
#creating dummy for Madison as author
df_no_disp['mad_dummy'] = df_no_disp['author'].apply(lambda x: (x=='MADISON')*1)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_no_disp['mad_dummy'] = df_no_disp['author'].apply(lambda x: (x=='MADISON')*1)


In [10]:
X = df_no_disp.whfrac.values.reshape(-1,1)
y = df_no_disp.mad_dummy.values

clf = LogisticRegression(random_state=0).fit(X, y)

print("The coefficient of whfrac is: ", clf.coef_[0][0])
print("The accuracy of the model is: ", clf.score(X, y))

The coefficient of whfrac is:  2.435144121959235
The accuracy of the model is:  0.9041095890410958


As we saw previously, using "*whilst* " more than "*while* " results in a *whfrac* above zero. The results from the model suggest a strong and positive relation between *whfrac* and the indicator for Madison as author, with high predictive power. Since we know that Madison indeed has a tendency to use "*whilst* " a lot, this tells us that the stylometric feature we have created here is good for the purpose of predicting authorship, and thus shedding light on a previously latent variable of disputed authorship!     

## 4: Predicting authorship

Using the logistic regression model, calculate the predicted probability of Madison authorship among the disputed papers using the  $whfrac$  indicator. What do the predictions indicate about the author of the disputed federalist papers? How could this stylometric analysis be improved?

In [11]:
X = df_disp.whfrac.values.reshape(-1,1)

df_disp['prob_mad']=clf.predict_proba(X)[:,1]
df_disp

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  df_disp['prob_mad']=clf.predict_proba(X)[:,1]


Unnamed: 0,author,text,date,title,paper_id,venue,wh1,wh2,whfrac,prob_mad
48,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 5, 1788",Method of Guarding Against the Encroachments o...,49,From the New York Packet,1,0,0.693147,0.544638
49,HAMILTON OR MADISON,To the People of the State of New York:\n\nIT ...,"Tuesday, February 5, 1788",Periodical Appeals to the People Considered,50,From the New York Packet,0,0,0.0,0.181104
50,HAMILTON OR MADISON,To the People of the State of New York:\n\nTO ...,"Friday, February 8, 1788",The Structure of the Government Must Furnish t...,51,From the New York Packet,1,0,0.693147,0.544638
51,HAMILTON OR MADISON,To the People of the State of New York:\n\nFRO...,"Friday, February 8, 1788",The House of Representatives,52,From the New York Packet,0,0,0.0,0.181104
52,HAMILTON OR MADISON,To the People of the State of New York:\n\nI S...,"Tuesday, February 12, 1788",The Same Subject Continued (The House of Repre...,53,From the New York Packet,1,0,0.693147,0.544638
53,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 12, 1788",The Apportionment of Members Among the States,54,From the New York Packet,0,0,0.0,0.181104
54,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Friday, February 15, 1788",The Total Number of the House of Representatives,55,From the New York Packet,0,0,0.0,0.181104
55,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 19, 1788",The Same Subject Continued (The Total Number o...,56,From the New York Packet,0,0,0.0,0.181104
56,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 19, 1788",The Alleged Tendency of the New Plan to Elevat...,57,From the New York Packet,2,0,1.098612,0.762493
57,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,,Objection That The Number of Members Will Not ...,58,,0,0,0.0,0.181104


From the predictions it seems to be the case that many of the disputed papers are more likely to be Madison's papers. However, this is based solely on the use of "*whilst* ", since it is the case that "*while* " is not used in any of the texts, as is evident in the $wh2$-feature. So, to improve stylometric analysis, one could find more traits from the idiosyncratic styles of Madison and Hamilton, and add these as features to be included in the regressions.

We now turn to a different topic, analyzing complexity in rhetorical style among US  presidents.

## 5: Retrieve US presidential speech data

The Miller Center of Public Affairs at University of Virginia hosts a collection of speeches by US presidents. Follow the directions here to retrieve the data in JSON format: https://data.millercenter.org/

## 6: Formatting data

The speeches are downloaded from the Miller Center in JSON format. Load the speeches as a data frame.

In [12]:
import pandas as pd

spdf = pd.read_json('speeches.json')


For simplicity we want to compare only speeches by Obama and subsequent presidents, so subset to speeches given in 2009 or later.

In [13]:
import datetime
presidents = ['Joe Biden','Donald Trump','Barack Obama']
spdf=spdf[spdf['president'].isin(presidents)]


In [14]:
spdf.date=pd.to_datetime(spdf['date'], errors='coerce')
spdf = spdf.dropna(subset=['date'])
spdf = spdf[spdf['date']>"2009-01-01"]

## 7: Lexical diversity

Lowercase and tokenize the text of each speech, then calculate the type-token ratio (TTR) for each speech. Use the ld.ttr()-function from the imported lexical diversity package.


In [15]:
# lowercase
spdf['transcript_lowercase'] = spdf['transcript'].str.lower()
# tokenizing
spdf['transcript_tokenized'] = spdf['transcript_lowercase'].apply(lambda x: word_tokenize(x))

In [16]:
spdf['ttr'] = spdf['transcript_tokenized'].apply(lambda x: ld.ttr(x))

In [17]:
#inspecting calulated values
print(min(spdf.ttr))
print(max(spdf.ttr))
print((spdf.ttr.mean()))

0.0933911556315661
0.48484848484848486
0.2522111943788095


Regress TTR on president, so you can compare TTR across Obama, Trump, and Biden. How do the presidents differ in terms of lexical diversity?

In [18]:
model = smf.ols(formula='ttr ~ C(president)', data=spdf).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                    ttr   R-squared:                       0.010
Model:                            OLS   Adj. R-squared:                 -0.008
Method:                 Least Squares   F-statistic:                    0.5562
Date:                Fri, 26 Apr 2024   Prob (F-statistic):              0.575
Time:                        15:46:17   Log-Likelihood:                 126.44
No. Observations:                 112   AIC:                            -246.9
Df Residuals:                     109   BIC:                            -238.7
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept       

TTR is higher if there is a tendency to use different words. We see that Trump has a higher coefficient than Obama (who is the reference category), which is somewhat surprising given a preconcieved notion of Obama as the more eloquent speaker.

# 8: Readability

We now turn to readability measures instead. Using the presidential speeches data, calculate Flesch Reading Ease (FRE) for each speech. Use the Readability()-function from the imported package

In [19]:
# running the FRE calculation
tqdm.pandas()

spdf['fre'] = spdf['transcript_lowercase'].progress_apply(lambda x: Readability(str(x)).flesch().score)

  0%|          | 0/112 [00:00<?, ?it/s]

100%|██████████| 112/112 [00:07<00:00, 15.90it/s]


In [20]:
#inspecting calulated values
print(min(spdf.fre))
print(max(spdf.fre))
print((spdf.fre.mean()))

46.1831041025425
93.49627694610783
64.62083023051426


Regress FRE on president. Are the results similar to the regression results for TTR? Which result do you believe in the most?

In [21]:
model = smf.ols(formula='fre ~ C(president)', data=spdf).fit()

print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                    fre   R-squared:                       0.023
Model:                            OLS   Adj. R-squared:                  0.005
Method:                 Least Squares   F-statistic:                     1.297
Date:                Fri, 26 Apr 2024   Prob (F-statistic):              0.277
Time:                        15:46:47   Log-Likelihood:                -407.77
No. Observations:                 112   AIC:                             821.5
Df Residuals:                     109   BIC:                             829.7
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept       

FRE is high for texts that are easy to read. Given a preconcieved notion that Obama gives speeches that are more complex and thus harder to read than both Biden and Trump, one would tend to beleive this metric more, since Trump and Biden have higher coefficients than Obama (the reference category).