# Exercise: Complexity and Style

In this exercise we will first replicate a simplified form of Mosteller & Wallace's famous stylometric analysis of the disputed Federalist Papers. We will then study complexity and style in speeches by US presidents.

In [27]:
%pip install py-readability-metrics
%pip install lexical-diversity



In [28]:
import nltk
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
nltk.download('punkt')
from nltk import word_tokenize
from tqdm import tqdm
import statsmodels.formula.api as smf
from lexical_diversity import lex_div as ld
from readability import Readability

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1: Load Mosteller & Wallace data

Load the text of all federalist papers, available as 'federalist.csv' on Absalon.

Paper no. 58 is attributed to Madison in the data, but Mosteller and Wallace consider this paper to have disputed authorship. Fix no. 58 to "HAMILTON OR MADISON" as the author.

In [29]:
!wget https://github.com/nglage/asds2/releases/download/v1.0.0/federalist.csv

--2025-05-14 08:32:29--  https://github.com/nglage/asds2/releases/download/v1.0.0/federalist.csv
Resolving github.com (github.com)... 140.82.113.3
Connecting to github.com (github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/982071095/6ea77020-ae3e-4ff2-8f5e-8c73dd2145b4?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250514%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250514T083229Z&X-Amz-Expires=300&X-Amz-Signature=d40858790063753cdb9cb45004a330850a0fbb35b5f93a032c09cccf0d040d10&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dfederalist.csv&response-content-type=application%2Foctet-stream [following]
--2025-05-14 08:32:29--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/982071095/6ea77020-ae3e-4ff2-8f5e-8c73dd2145b4?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=

In [30]:
df = pd.read_csv('federalist.csv')
df.head()

Unnamed: 0,author,text,date,title,paper_id,venue
0,HAMILTON,To the People of the State of New York:\n\nAFT...,,General Introduction,1,For the Independent Journal
1,JAY,To the People of the State of New York:\n\nWHE...,,Concerning Dangers from Foreign Force and Infl...,2,For the Independent Journal
2,JAY,To the People of the State of New York:\n\nIT ...,,The Same Subject Continued (Concerning Dangers...,3,For the Independent Journal
3,JAY,To the People of the State of New York:\n\nMY ...,,The Same Subject Continued (Concerning Dangers...,4,For the Independent Journal
4,JAY,To the People of the State of New York:\n\nQUE...,,The Same Subject Continued (Concerning Dangers...,5,For the Independent Journal


In [31]:
# change authorship of paper no. 58 to
df.loc[df['paper_id'] == 58, 'author'] = 'HAMILTON OR MADISON'
df[50:60]

Unnamed: 0,author,text,date,title,paper_id,venue
50,HAMILTON OR MADISON,To the People of the State of New York:\n\nTO ...,"Friday, February 8, 1788",The Structure of the Government Must Furnish t...,51,From the New York Packet
51,HAMILTON OR MADISON,To the People of the State of New York:\n\nFRO...,"Friday, February 8, 1788",The House of Representatives,52,From the New York Packet
52,HAMILTON OR MADISON,To the People of the State of New York:\n\nI S...,"Tuesday, February 12, 1788",The Same Subject Continued (The House of Repre...,53,From the New York Packet
53,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 12, 1788",The Apportionment of Members Among the States,54,From the New York Packet
54,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Friday, February 15, 1788",The Total Number of the House of Representatives,55,From the New York Packet
55,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 19, 1788",The Same Subject Continued (The Total Number o...,56,From the New York Packet
56,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 19, 1788",The Alleged Tendency of the New Plan to Elevat...,57,From the New York Packet
57,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,,Objection That The Number of Members Will Not ...,58,
58,HAMILTON,To the People of the State of New York:\n\nTHE...,"Friday, February 22, 1788",Concerning the Power of Congress to Regulate t...,59,From the New York Packet
59,HAMILTON,To the People of the State of New York:\n\nWE ...,"Tuesday, February 26, 1788",The Same Subject Continued (Concerning the Pow...,60,From the New York Packet


## 2: Stylometric feature engineering

To help ourselves a bit, we will lean on Mosteller & Wallace's finding that  Madison tended to use the word "whilst", while Hamilton would tend to use the word "while" in similar contexts. Call the number of uses of "whilst" in a text $wh_1$ and the number of "while" counts $wh_2$. For each text, calculate the metric

$$
whfrac = log( \frac{wh_1 +1}{wh_2 +1} )
$$

This is a so-called regularized log odds ratio. The +1s in the numerator and denominator are regularization terms, and are arbitratily set at 1 here. You can optionally  verify that $whfrac$ follows a nice symmetrical distribution.

In [32]:
# count occurences of whilst/while in each text
df['wh1'] = df['text'].apply(lambda x: str(x).split().count("whilst"))
df['wh2'] = df['text'].apply(lambda x: str(x).split().count("while"))
df['whfrac'] = np.log((df['wh1'] + 1) / (df['wh2'] + 1))
df.head()


Unnamed: 0,author,text,date,title,paper_id,venue,wh1,wh2,whfrac
0,HAMILTON,To the People of the State of New York:\n\nAFT...,,General Introduction,1,For the Independent Journal,0,0,0.0
1,JAY,To the People of the State of New York:\n\nWHE...,,Concerning Dangers from Foreign Force and Infl...,2,For the Independent Journal,0,1,-0.693147
2,JAY,To the People of the State of New York:\n\nIT ...,,The Same Subject Continued (Concerning Dangers...,3,For the Independent Journal,0,0,0.0
3,JAY,To the People of the State of New York:\n\nMY ...,,The Same Subject Continued (Concerning Dangers...,4,For the Independent Journal,0,0,0.0
4,JAY,To the People of the State of New York:\n\nQUE...,,The Same Subject Continued (Concerning Dangers...,5,For the Independent Journal,0,0,0.0


## 3: Testing predictiveness

The disputed papers are papers 49 to 58, 62, and 63. Create separate data frames for the disputed and undisputed papers.

Among the undisputed papers, create a dummy variable  indicating whether the text was written by Madison  vs. anyone  else.

Fit a logistic regression model with this new indicator as the dependent variable and $whfrac$ as  the independent variable. What does the model tell you about the predictiveness of the use of 'whilst' vs. 'while'?

In [38]:
disputed_df = df[df['paper_id'].isin(list(range(49,59)) + list(range(62,64)))].copy()
undisputed_df = df[~df['paper_id'].isin(disputed_df['paper_id'])].copy()

In [34]:
# dummy variable for Madison vs. anyone else
undisputed_df.loc[:, 'madison_dummy'] = undisputed_df['author'].apply(lambda x: 1 if x == 'MADISON' else 0)

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  undisputed_df.loc[:, 'madison_dummy'] = undisputed_df['author'].apply(lambda x: 1 if x == 'MADISON' else 0)


In [35]:
X = undisputed_df['whfrac'].values.reshape(-1, 1)
y = undisputed_df['madison_dummy'].values

clf = LogisticRegression(random_state=0).fit(X, y)

print("The coefficient of whfrac is:", clf.coef_[0][0])
print("the coefficient of the model is:", clf.score(X, y))

The coefficient of whfrac is: 2.4341734222801774
the coefficient of the model is: 0.9041095890410958


The model results suggest that a strong and positive relationship between whfrac and indicator for Madison. This means that this stylometric feature is good for the purpose of predicting authorship.

## 4: Predicting authorship

Using the logistic regression model, calculate the predicted probability of Madison authorship among the disputed papers using the  $whfrac$  indicator. What do the predictions indicate about the author of the disputed federalist papers? How could this stylometric analysis be improved?

In [41]:
X = disputed_df['whfrac'].values.reshape(-1, 1)

disputed_df['prob_madison'] = clf.predict_proba(X)[:,1]
disputed_df.head()

Unnamed: 0,author,text,date,title,paper_id,venue,wh1,wh2,whfrac,prob_madison
48,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 5, 1788",Method of Guarding Against the Encroachments o...,49,From the New York Packet,1,0,0.693147,0.544387
49,HAMILTON OR MADISON,To the People of the State of New York:\n\nIT ...,"Tuesday, February 5, 1788",Periodical Appeals to the People Considered,50,From the New York Packet,0,0,0.0,0.181054
50,HAMILTON OR MADISON,To the People of the State of New York:\n\nTO ...,"Friday, February 8, 1788",The Structure of the Government Must Furnish t...,51,From the New York Packet,1,0,0.693147,0.544387
51,HAMILTON OR MADISON,To the People of the State of New York:\n\nFRO...,"Friday, February 8, 1788",The House of Representatives,52,From the New York Packet,0,0,0.0,0.181054
52,HAMILTON OR MADISON,To the People of the State of New York:\n\nI S...,"Tuesday, February 12, 1788",The Same Subject Continued (The House of Repre...,53,From the New York Packet,1,0,0.693147,0.544387


In [46]:
# count strong probabilities for madison (> 0.5)
count = disputed_df[disputed_df['prob_madison'] > 0.5]['prob_madison'].count()
print(count)
print(f'{round(count/len(disputed_df)*100, 2)}%')

5
41.67%


We now turn to a different topic, analyzing complexity in rhetorical style among US  presidents.

## 5: Retrieve US presidential speech data

The Miller Center of Public Affairs at University of Virginia hosts a collection of speeches by US presidents. Follow the directions here to retrieve the data in JSON format: https://data.millercenter.org/

In [60]:
import json, requests, sys

endpoint = "https://api.millercenter.org/speeches"
out_file = "speeches.json"

r = requests.post(url=endpoint)
data = r.json()
items = data['Items']

while 'LastEvaluatedKey' in data:
    parameters = {"LastEvaluatedKey": data['LastEvaluatedKey']['doc_name']}
    r = requests.post(url = endpoint, params = parameters)
    data = r.json()
    items += data['Items']
    print(f'{len(items)} speeches')

with open(out_file, "w") as out:
    out.write(json.dumps(items))
    print(f'wrote results to file: {out_file}')

94 speeches
137 speeches
175 speeches
233 speeches
275 speeches
328 speeches
369 speeches
414 speeches
460 speeches
506 speeches
550 speeches
594 speeches
625 speeches
670 speeches
716 speeches
770 speeches
807 speeches
852 speeches
902 speeches
931 speeches
984 speeches
1040 speeches
1061 speeches
wrote results to file: speeches.json


## 6: Formatting data

The speeches are downloaded from the Miller Center in JSON format. Load the speeches as a data frame.

In [64]:
spdf = pd.read_json("speeches.json")
spdf.head()

Unnamed: 0,doc_name,date,transcript,president,title
0,january-22-1807-special-message-congress-burr-...,1807-01-22,TO THE SENATE AND HOUSE OF REPRESENTATIVES OF ...,Thomas Jefferson,"January 22, 1807: Special Message to Congress ..."
1,may-25-1813-message-special-congressional-sess...,1813-05-25,Fellow-Citizens of the Senate and of the House...,James Madison,"May 25, 1813: Message on the Special Congressi..."
2,april-2-1917-address-congress-requesting-decla...,1917-04-02,I have called the Congress into extraordinary ...,Woodrow Wilson,"April 2, 1917: Address to Congress Requesting ..."
3,april-10-1975-address-us-foreign-policy,1975-04-10,"Mr. Speaker, Mr. President, distinguished gues...",Gerald Ford,"April 10, 1975: Address on U.S. Foreign Policy"
4,july-6-1848-message-regarding-treaty-guadalupe...,1848-07-06,To the House of Representatives of the United ...,James K. Polk,"July 6, 1848: Message Regarding the Treaty of ..."


For simplicity we want to compare only speeches by Obama and subsequent presidents, so subset to speeches given in 2009 or later.

In [69]:
import datetime
spdf['date'] = pd.to_datetime(spdf['date'], errors='coerce', utc=True)

recent_speeches = spdf[spdf['date'].dt.year >= 2009]

# check president names
print(recent_speeches['president'].unique())

# filter out Bush
recent_speeches = recent_speeches[recent_speeches['president'] != 'George W. Bush']

# check president names again
print(recent_speeches['president'].unique())

['Donald Trump' 'Joe Biden' 'Barack Obama' 'George W. Bush']
['Donald Trump' 'Joe Biden' 'Barack Obama']


## 7: Lexical diversity

Lowercase and tokenize the text of each speech, then calculate the type-token ratio (TTR) for each speech. Use the ld.ttr()-function from the imported lexical diversity package.


In [75]:
import nltk
nltk.download('punkt_tab')
from nltk import word_tokenize

# lowercase
recent_speeches['transcript_lowercase'] = recent_speeches['transcript'].str.lower()

# tokenizing
recent_speeches['transcript_tokenized'] = recent_speeches['transcript_lowercase'].apply(lambda x: word_tokenize(x))

[nltk_data] Downloading package punkt_tab to /root/nltk_data...
[nltk_data]   Package punkt_tab is already up-to-date!


In [76]:
# calculate type-token ration (TTR) for each speech
recent_speeches['ttr'] = recent_speeches['transcript_tokenized'].apply(lambda x: ld.ttr(x))

In [77]:
# inspect calculated values
print(min(recent_speeches['ttr']))
print(max(recent_speeches['ttr']))
print(recent_speeches['ttr'].mean())

0.0933911556315661
0.48484848484848486
0.2505238292071765


Regress TTR on president, so you can compare TTR across Obama, Trump, and Biden. How do the presidents differ in terms of lexical diversity?

In [80]:
model = smf.ols(formula='ttr ~ C(president)', data = recent_speeches).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                    ttr   R-squared:                       0.012
Model:                            OLS   Adj. R-squared:                 -0.005
Method:                 Least Squares   F-statistic:                    0.7128
Date:                Wed, 14 May 2025   Prob (F-statistic):              0.492
Time:                        10:05:01   Log-Likelihood:                 141.49
No. Observations:                 123   AIC:                            -277.0
Df Residuals:                     120   BIC:                            -268.5
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept       

TTR is higher when there is a tendency to use different words. Donald Trump has a higher coefficient than Obama (=reference category).

# 8: Readability

We now turn to readability measures instead. Using the presidential speeches data, calculate Flesch Reading Ease (FRE) for each speech. Use the Readability()-function from the imported package

In [85]:
# calculating FRE for each speech
tqdm.pandas()

recent_speeches['fre'] = recent_speeches['transcript_lowercase'].progress_apply(lambda x: Readability(str(x)).flesch().score)

100%|██████████| 123/123 [00:32<00:00,  3.77it/s]


In [87]:
# inspecting calculated values
print(min(recent_speeches['fre']))
print(max(recent_speeches['fre']))
print(recent_speeches['fre'].mean())

46.1831041025425
93.49627694610783
64.1683777227306


Regress FRE on president. Are the results similar to the regression results for TTR? Which result do you believe in the most?

In [88]:
model = smf.ols(formula='fre ~ C(president)', data=recent_speeches).fit()
print(model.summary())

                            OLS Regression Results                            
Dep. Variable:                    fre   R-squared:                       0.019
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     1.177
Date:                Wed, 14 May 2025   Prob (F-statistic):              0.312
Time:                        10:10:20   Log-Likelihood:                -446.28
No. Observations:                 123   AIC:                             898.6
Df Residuals:                     120   BIC:                             907.0
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
                                   coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------------------
Intercept       

FRE is higher for texts that are easy to read. Therefore one might tend to be believe this model more, since Trump and Biden have higher coefficients than Obama (the reference categoy).