# Exercise: Complexity and Style

In this exercise we will first replicate a simplified form of Mosteller & Wallace's famous stylometric analysis of the disputed Federalist Papers. We will then study complexity and style in speeches by US presidents.

In [1]:
%pip install py-readability-metrics
%pip install lexical-diversity

Collecting py-readability-metrics
  Downloading py_readability_metrics-1.4.5-py3-none-any.whl.metadata (8.8 kB)
Downloading py_readability_metrics-1.4.5-py3-none-any.whl (26 kB)
Installing collected packages: py-readability-metrics
Successfully installed py-readability-metrics-1.4.5
Collecting lexical-diversity
  Downloading lexical_diversity-0.1.1-py3-none-any.whl.metadata (4.1 kB)
Downloading lexical_diversity-0.1.1-py3-none-any.whl (117 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m117.8/117.8 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: lexical-diversity
Successfully installed lexical-diversity-0.1.1


In [4]:
import nltk
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
nltk.download('punkt')
from nltk import word_tokenize
from tqdm import tqdm
import statsmodels.formula.api as smf
from lexical_diversity import lex_div as ld
from readability import Readability

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


## 1: Load Mosteller & Wallace data

Load the text of all federalist papers, available as 'federalist.csv' on Absalon.

Paper no. 58 is attributed to Madison in the data, but Mosteller and Wallace consider this paper to have disputed authorship. Fix no. 58 to "HAMILTON OR MADISON" as the author.

In [6]:
!wget https://github.com/nglage/asds2/releases/download/v1.0.0/federalist.csv

--2025-05-13 15:00:23--  https://github.com/nglage/asds2/releases/download/v1.0.0/federalist.csv
Resolving github.com (github.com)... 140.82.112.4
Connecting to github.com (github.com)|140.82.112.4|:443... connected.
HTTP request sent, awaiting response... 302 Found
Location: https://objects.githubusercontent.com/github-production-release-asset-2e65be/982071095/6ea77020-ae3e-4ff2-8f5e-8c73dd2145b4?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=releaseassetproduction%2F20250513%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250513T150023Z&X-Amz-Expires=300&X-Amz-Signature=8a5a5bc212a4b47e12b41c419ad0b3eb7e728e78364620f45aa46c0e75fdc016&X-Amz-SignedHeaders=host&response-content-disposition=attachment%3B%20filename%3Dfederalist.csv&response-content-type=application%2Foctet-stream [following]
--2025-05-13 15:00:23--  https://objects.githubusercontent.com/github-production-release-asset-2e65be/982071095/6ea77020-ae3e-4ff2-8f5e-8c73dd2145b4?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=

In [7]:
df = pd.read_csv('federalist.csv')
df.head()

Unnamed: 0,author,text,date,title,paper_id,venue
0,HAMILTON,To the People of the State of New York:\n\nAFT...,,General Introduction,1,For the Independent Journal
1,JAY,To the People of the State of New York:\n\nWHE...,,Concerning Dangers from Foreign Force and Infl...,2,For the Independent Journal
2,JAY,To the People of the State of New York:\n\nIT ...,,The Same Subject Continued (Concerning Dangers...,3,For the Independent Journal
3,JAY,To the People of the State of New York:\n\nMY ...,,The Same Subject Continued (Concerning Dangers...,4,For the Independent Journal
4,JAY,To the People of the State of New York:\n\nQUE...,,The Same Subject Continued (Concerning Dangers...,5,For the Independent Journal


In [12]:
# change authorship of paper no. 58 to
df.loc[df['paper_id'] == 58, 'author'] = 'HAMILTON OR MADISON'
df[50:60]

Unnamed: 0,author,text,date,title,paper_id,venue
50,HAMILTON OR MADISON,To the People of the State of New York:\n\nTO ...,"Friday, February 8, 1788",The Structure of the Government Must Furnish t...,51,From the New York Packet
51,HAMILTON OR MADISON,To the People of the State of New York:\n\nFRO...,"Friday, February 8, 1788",The House of Representatives,52,From the New York Packet
52,HAMILTON OR MADISON,To the People of the State of New York:\n\nI S...,"Tuesday, February 12, 1788",The Same Subject Continued (The House of Repre...,53,From the New York Packet
53,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 12, 1788",The Apportionment of Members Among the States,54,From the New York Packet
54,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Friday, February 15, 1788",The Total Number of the House of Representatives,55,From the New York Packet
55,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 19, 1788",The Same Subject Continued (The Total Number o...,56,From the New York Packet
56,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,"Tuesday, February 19, 1788",The Alleged Tendency of the New Plan to Elevat...,57,From the New York Packet
57,HAMILTON OR MADISON,To the People of the State of New York:\n\nTHE...,,Objection That The Number of Members Will Not ...,58,
58,HAMILTON,To the People of the State of New York:\n\nTHE...,"Friday, February 22, 1788",Concerning the Power of Congress to Regulate t...,59,From the New York Packet
59,HAMILTON,To the People of the State of New York:\n\nWE ...,"Tuesday, February 26, 1788",The Same Subject Continued (Concerning the Pow...,60,From the New York Packet


## 2: Stylometric feature engineering

To help ourselves a bit, we will lean on Mosteller & Wallace's finding that  Madison tended to use the word "whilst", while Hamilton would tend to use the word "while" in similar contexts. Call the number of uses of "whilst" in a text $wh_1$ and the number of "while" counts $wh_2$. For each text, calculate the metric

$$
whfrac = log( \frac{wh_1 +1}{wh_2 +1} )
$$

This is a so-called regularized log odds ratio. The +1s in the numerator and denominator are regularization terms, and are arbitratily set at 1 here. You can optionally  verify that $whfrac$ follows a nice symmetrical distribution.

In [14]:
# count occurences of whilst/while in each text
df['wh1'] = df['text'].apply(lambda x: str(x).split().count("whilst"))
df['wh2'] = df['text'].apply(lambda x: str(x).split().count("while"))
df['whfrac'] = np.log((df['wh1'] + 1) / (df['wh2'] + 1))
df.head()


Unnamed: 0,author,text,date,title,paper_id,venue,wh1,wh2,whfrac
0,HAMILTON,To the People of the State of New York:\n\nAFT...,,General Introduction,1,For the Independent Journal,0,0,0.0
1,JAY,To the People of the State of New York:\n\nWHE...,,Concerning Dangers from Foreign Force and Infl...,2,For the Independent Journal,0,1,-0.693147
2,JAY,To the People of the State of New York:\n\nIT ...,,The Same Subject Continued (Concerning Dangers...,3,For the Independent Journal,0,0,0.0
3,JAY,To the People of the State of New York:\n\nMY ...,,The Same Subject Continued (Concerning Dangers...,4,For the Independent Journal,0,0,0.0
4,JAY,To the People of the State of New York:\n\nQUE...,,The Same Subject Continued (Concerning Dangers...,5,For the Independent Journal,0,0,0.0


## 3: Testing predictiveness

The disputed papers are papers 49 to 58, 62, and 63. Create separate data frames for the disputed and undisputed papers.

Among the undisputed papers, create a dummy variable  indicating whether the text was written by Madison  vs. anyone  else.

Fit a logistic regression model with this new indicator as the dependent variable and $whfrac$ as  the independent variable. What does the model tell you about the predictiveness of the use of 'whilst' vs. 'while'?

In [None]:
#your code here

## 4: Predicting authorship

Using the logistic regression model, calculate the predicted probability of Madison authorship among the disputed papers using the  $whfrac$  indicator. What do the predictions indicate about the author of the disputed federalist papers? How could this stylometric analysis be improved?

In [None]:
#your code here

We now turn to a different topic, analyzing complexity in rhetorical style among US  presidents.

## 5: Retrieve US presidential speech data

The Miller Center of Public Affairs at University of Virginia hosts a collection of speeches by US presidents. Follow the directions here to retrieve the data in JSON format: https://data.millercenter.org/

## 6: Formatting data

The speeches are downloaded from the Miller Center in JSON format. Load the speeches as a data frame.

In [None]:
#your code here

For simplicity we want to compare only speeches by Obama and subsequent presidents, so subset to speeches given in 2009 or later.

In [None]:
#your code here


## 7: Lexical diversity

Lowercase and tokenize the text of each speech, then calculate the type-token ratio (TTR) for each speech. Use the ld.ttr()-function from the imported lexical diversity package.


In [None]:
#your code here

Regress TTR on president, so you can compare TTR across Obama, Trump, and Biden. How do the presidents differ in terms of lexical diversity?

In [None]:
#your code here

# 8: Readability

We now turn to readability measures instead. Using the presidential speeches data, calculate Flesch Reading Ease (FRE) for each speech. Use the Readability()-function from the imported package

In [None]:
#your code here

Regress FRE on president. Are the results similar to the regression results for TTR? Which result do you believe in the most?

In [None]:
#your code here