# Exercise: Complexity and Style

In this exercise we will first replicate a simplified form of Mosteller & Wallace's famous stylometric analysis of the disputed Federalist Papers. We will then study complexity and style in speeches by US presidents.

In [None]:
%pip install py-readability-metrics
%pip install lexical-diversity

In [None]:
import nltk
import pandas as pd
import numpy as np
from sklearn.linear_model import LogisticRegression
nltk.download('punkt')
from nltk import word_tokenize
from tqdm import tqdm
import statsmodels.formula.api as smf
from lexical_diversity import lex_div as ld
from readability import Readability


## 1: Load Mosteller & Wallace data

Load the text of all federalist papers, available as 'federalist.csv' on Absalon.

Paper no. 58 is attributed to Madison in the data, but Mosteller and Wallace consider this paper to have disputed authorship. Fix no. 58 to "HAMILTON OR MADISON" as the author.

In [38]:
#your code here

## 2: Stylometric feature engineering

To help ourselves a bit, we will lean on Mosteller & Wallace's finding that  Madison tended to use the word "whilst", while Hamilton would tend to use the word "while" in similar contexts. Call the number of uses of "whilst" in a text $wh_1$ and the number of "while" counts $wh_2$. For each text, calculate the metric

$$
whfrac = log( \frac{wh_1 +1}{wh_2 +1} )
$$

This is a so-called regularized log odds ratio. The +1s in the numerator and denominator are regularization terms, and are arbitratily set at 1 here. You can optionally  verify that $whfrac$ follows a nice symmetrical distribution.

In [39]:
#your code here


## 3: Testing predictiveness

The disputed papers are papers 49 to 58, 62, and 63. Create separate data frames for the disputed and undisputed papers.

Among the undisputed papers, create a dummy variable  indicating whether the text was written by Madison  vs. anyone  else.

Fit a logistic regression model with this new indicator as the dependent variable and $whfrac$ as  the independent variable. What does the model tell you about the predictiveness of the use of 'whilst' vs. 'while'?

In [None]:
#your code here

## 4: Predicting authorship

Using the logistic regression model, calculate the predicted probability of Madison authorship among the disputed papers using the  $whfrac$  indicator. What do the predictions indicate about the author of the disputed federalist papers? How could this stylometric analysis be improved?

In [None]:
#your code here

We now turn to a different topic, analyzing complexity in rhetorical style among US  presidents.

## 5: Retrieve US presidential speech data

The Miller Center of Public Affairs at University of Virginia hosts a collection of speeches by US presidents. Follow the directions here to retrieve the data in JSON format: https://data.millercenter.org/

## 6: Formatting data

The speeches are downloaded from the Miller Center in JSON format. Load the speeches as a data frame.

In [None]:
#your code here

For simplicity we want to compare only speeches by Obama and subsequent presidents, so subset to speeches given in 2009 or later.

In [None]:
#your code here


## 7: Lexical diversity

Lowercase and tokenize the text of each speech, then calculate the type-token ratio (TTR) for each speech. Use the ld.ttr()-function from the imported lexical diversity package.


In [None]:
#your code here

Regress TTR on president, so you can compare TTR across Obama, Trump, and Biden. How do the presidents differ in terms of lexical diversity?

In [None]:
#your code here

# 8: Readability

We now turn to readability measures instead. Using the presidential speeches data, calculate Flesch Reading Ease (FRE) for each speech. Use the Readability()-function from the imported package

In [None]:
#your code here

Regress FRE on president. Are the results similar to the regression results for TTR? Which result do you believe in the most?

In [None]:
#your code here