# Classifying Political Rhetoric

## Table of Contents

1. Background
2. Data Understanding
3. Data Preparation
    - 3.a. Webscraping & Cleaning Presidential Debates
    - 3.b. Webscraping & Cleaning RNC/DNC Conventions
    - 3.c. Webscraping & Cleaning Inaugural Speeches
    - 3.d. NLP Preprocessing all 3 Datasets
        - 3.d.i. NLP Preprocessing the Presidential Debates
        - 3.d.ii. NLP Preprocessing the RNC/DNC Conventions
        - 3.d.i. NLP Preprocessing the Inaugural Speeches
    - 3.e. Combining all 3 Datasets
    - 3.f. Basic Visualization
    - 3.g. Additional Stopwords Removal
4. Modelling
    - 4.a. Multinomial Naive Bayes Model
        - 4.a.i. MNB Grid Searches
    - 4.b. Gaussian Bayes Model
        - 4.b.i. GB Grid Searches
    - 4.c. Random Forest Model
        - 4.c.i. RF Grid Searches
5. Final Results: MNB 
6. Next Steps

In [2]:
# Imports
import pandas as pd
import matplotlib.pyplot as plt
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin

import re 

from nltk.probability import FreqDist
from nltk.corpus import stopwords
from nltk.tokenize import regexp_tokenize, RegexpTokenizer
from nltk.stem import WordNetLemmatizer

from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import accuracy_score, confusion_matrix, ConfusionMatrixDisplay

## 1. Background

What words/phrases do politicians say most often?

Can we identify a politician’s party using their speech?

How has speech variance between parties changed over time?



## 2. Data Understanding

*talk about how much data we have


## 3. Data Preparation

### 3.a. Webscraping & Cleaning Presidential Debates

Below we 

In [59]:
df = pd.DataFrame(columns=["date", "text", "link"])
debatelink = []

base_url = "https://www.debates.org"
page = requests.get(urljoin(base_url, "/voter-education/debate-transcripts/"))
soup = BeautifulSoup(page.content, 'html.parser')

links = soup.find_all("a")
for link in links:
    if "debate-transcript" in link.get("href") and "Media" not in link.get("href") and "vice" not in link.get("href"):
        debatelink.append(urljoin(base_url, link.get("href")))
        df.loc[len(df)] = [None, None, urljoin(base_url, link.get("href"))]

for link in df["link"]:
    try:
        page = requests.get(link)
        soup = BeautifulSoup(page.content, 'html.parser')
        text = soup.find_all("p")
        text = [t.get_text() for t in text]
        text = " ".join(text)
        df.loc[df["link"] == link, "text"] = text

        date_link = link.split("/")[5:6] 
        date_link = "-".join(date_link[0].split("-")[:3])
        df.loc[df["link"] == link, "date"] = date_link
    except requests.exceptions.RequestException as e:
        print(f"Error accessing {link}: {e}")

df

Unnamed: 0,date,text,link
0,,Unofficial transcripts of most presidential an...,https://www.debates.org/voter-education/debate...
1,october-22-2020,Presidential Debate at Belmont University in N...,https://www.debates.org/voter-education/debate...
2,september-29-2020,Presidential Debate at Case Western Reserve Un...,https://www.debates.org/voter-education/debate...
3,october-19-2016,Presidential Debate at the University of Nevad...,https://www.debates.org/voter-education/debate...
4,october-9-2016,Presidential Debate at Washington University i...,https://www.debates.org/voter-education/debate...
...,...,...,...
88,october-22-1976,"\nOctober 22, 1976\n The Third Carter-Ford Pre...",https://www.debates.org/voter-education/debate...
89,september-26-1960,"\nSeptember 26, 1960\n The First Kennedy-Nixon...",https://www.debates.org/voter-education/debate...
90,october-7-1960,"\nOctober 7, 1960\n The Second Kennedy-Nixon P...",https://www.debates.org/voter-education/debate...
91,october-13-1960,"\nOctober 13, 1960\n The Third Kennedy-Nixon P...",https://www.debates.org/voter-education/debate...


We want to strip the last 4 characters of each string in the 'date' column to get a column with just the year of each speech. Below we check every row to make sure there would be no inconsistencies with this method

In [60]:
df['date'][:50]

0                            
1             october-22-2020
2           september-29-2020
3             october-19-2016
4              october-9-2016
5              october-4-2016
6           september-26-2016
7             october-22-2012
8             october-16-2012
9              october-3-2012
10     2008-debate-transcript
11     2008-debate-transcript
12             october-7-2008
13            october-15-2008
14            october-13-2004
15             october-8-2004
16             october-5-2004
17          september-30-2004
18            october-17-2000
19            october-11-2000
20             october-5-2000
21             october-3-2000
22            october-16-1996
23             october-9-1996
24             october-6-1996
25            october-19-1992
26            october-15-1992
27            october-15-1992
28            october-13-1992
29            october-11-1992
30            october-11-1992
31            october-13-1988
32             october-5-1988
33        

In [61]:
df['date'][50:100]

50             october-4-2016
51             october-9-2016
52            october-19-2016
53             october-3-2012
54            october-16-2012
55            october-22-2012
56     2008-debate-transcript
57     2008-debate-transcript
58             october-7-2008
59            october-15-2008
60            october-13-2004
61             october-8-2004
62             october-5-2004
63          september-30-2004
64             october-3-2000
65             october-5-2000
66            october-11-2000
67            october-17-2000
68    2000-debate-transcripts
69             october-6-1996
70             october-9-1996
71            october-16-1996
72            october-11-1992
73            october-11-1992
74            october-13-1992
75            october-15-1992
76            october-15-1992
77            october-19-1992
78          september-25-1988
79             october-5-1988
80            october-13-1988
81             october-7-1984
82            october-11-1984
83        

From this exploration we can see that there are 6 rows that do not properly end in the year, so we cannot split the last four characters from them. Instead, we will .loc into those cells and enter their years by hand.

In [62]:
df['date'][10] = '2008'
df['date'][11] = '2008'
df['date'][46] = '2000'
df['date'][56] = '2008'
df['date'][57] = '2008'
df['date'][68] = '2000'

Now we'll take a look at one row in the 'text' column to see how it is structured. This will inform how we isolate the candidates' speech and remove all other speakers' speech.

In [63]:
df['text'][5]

'Vice Presidential Debate at Longwood University in Farmville, Virginia October 4, 2016 PARTICIPANTS: Senator Tim Kaine (D-VA) and Governor Mike Pence (R-IN) MODERATOR: Elaine Quijano (CBS News) QUIJANO: Good evening. From Longwood University in Farmville, Virginia, and welcome to the first, and only, vice presidential debate of 2016, sponsored by the Commission on Presidential Debates. I’m Elaine Quijano, anchor at CBSN, and correspondent for CBS News. It’s an honor to moderate this debate between Senator Tim Kaine and Governor Mike Pence. Both are longtime public servants who are also proud fathers of sons serving in the U.S. Marines. The campaigns have agreed to the rules of this 90-minute debate. There will be nine different segments covering domestic and foreign policy issues. Each segment will begin with a question to both candidates who will each have two minutes to answer. Then I’ll ask follow-up questions to facilitate a discussion between the candidates. By coin toss, it’s be

We can see that the candidates' speech is introduced by their name in all caps, followed by a colon. For example in the site above, Hillary Clinton's speech follows the following string "CLINTON:". 

Thus, in order to extract just the text that Hillary Clinton said, we need to extract just the text between all instances of the text "CLINTON:".

To accomplish that, we create a function that takes in text data and a name (in this case, 'CLINTON'). It loops through all of df['text'], using a regex pattern to search for all instances of 'CLINTON:'. It then extracts only the text starting at 'CLINTON:' and ending at the next instance of any all caps name followed by a colon. Then it adds that extracted text to a new row in a new dataframe called 'CLINTON'.

In [64]:
# Function to extract text between all-caps words followed by a colon
def extract_text_segments(text, name):
    pattern = f"{name}:\s*([\s\S]*?)(?=(?:[A-Z]+\s*:\s*|$))"
    matches = re.finditer(pattern, text)
    indices = [(match.start(), match.end()) for match in matches]

    extracted_texts = []
    for i in range(len(indices) - 1):
        start = indices[i][0]
        end = indices[i][1]
        extracted_text = text[start:end].strip()
        extracted_texts.append(extracted_text)

    return extracted_texts

extract_text_segments(df['text'][5], 'CLINTON')

[]

Below we define another function that creates a new dataframe containing only 3 columns: the year a speech was given, the name of the speaker we are interested in, and the text of what that speaker said in that speech.

We get the 'Year' column data by stripping the last four characters of the date strings in df['date'].

We get the 'Name' column data from a list of candidates initialized below.

We get the 'Text' column data by calling our previous function 'extract_text_segments' on each row in df['text']. To reiterate, that function will grab only the words said by candidates from the names list below. 

In [65]:
names = ['CLINTON', 'TRUMP', 'OBAMA', 'ROMNEY', 'MCCAIN', 'BUSH', 'KERRY', 'GORE', 'CLINTON', 'DOLE', 'DUKAKIS', 'REAGAN', 'MONDALE', 'CARTER', 'FORD', 'NIXON', 'KENNEDY', 'ANDERSON']

dict2 = {'Year': [], 'Name': [], 'Text': []}
df2 = pd.DataFrame(dict2)

def text_extractor(data):
    for name in names:
        for index, row in df.iterrows():
            if name in row['text']:
                extracted_text = extract_text_segments(row['text'], name)
                year = row['date'][-4:]
                df2.loc[len(df2.index)] = [year, name, extracted_text]
    return df2

result_df = text_extractor(df)
print(result_df)


  element = np.asarray(element)


     Year      Name                                               Text
0    2016   CLINTON  [CLINTON: Thank you very much, Chris. And than...
1    2016   CLINTON  [CLINTON: Well, thank you. Are you a teacher? ...
2    2016   CLINTON  [CLINTON: How are you, Donald? [applause], CLI...
3    1996   CLINTON  [CLINTON: I was going to applaud, too. Well, t...
4    1996   CLINTON  [CLINTON: Thank you, Jim. And thank you to the...
..    ...       ...                                                ...
163  1960   KENNEDY  [KENNEDY: In the first place I’ve never sugges...
164  1960   KENNEDY  [KENNEDY: Good evening, Mr. Shadel. MR., KENNE...
165  1960   KENNEDY  [KENNEDY: Good evening, Mr. Howe. MR., KENNEDY...
166  1980  ANDERSON  [ANDERSON: Miss Loomis, I think it’s very appr...
167  1980  ANDERSON  [ANDERSON: Miss Loomis, I think it’s very appr...

[168 rows x 3 columns]


In [66]:
result_df['Text'][1]

['CLINTON: Well, thank you. Are you a teacher? Yes, I think that that’s a very good question, because I’ve heard from lots of teachers and parents about some of their concerns about some of the things that are being said and done in this campaign. And I think it is very important for us to make clear to our children that our country really is great because we’re good. And we are going to respect one another, lift each other up. We are going to be looking for ways to celebrate our diversity, and we are going to try to reach out to every boy and girl, as well as every adult, to bring them in to working on behalf of our country. I have a very positive and optimistic view about what we can do together. That’s why the slogan of my campaign is “Stronger Together,” because I think if we work together, if we overcome the divisiveness that sometimes sets Americans against one another, and instead we make some big goals—and I’ve set forth some big goals, getting the economy to work for everyone,

In the resultant dataframe, the 'Text' column is a list of strings of each instance our candidate of interest spoke. We combine this list into one string below.

In [67]:
result_df['Text']= [" ".join(x) for x in result_df['Text']]

In [68]:
# Checking that the 'Text' is now one string
result_df['Text'][1]

'CLINTON: Well, thank you. Are you a teacher? Yes, I think that that’s a very good question, because I’ve heard from lots of teachers and parents about some of their concerns about some of the things that are being said and done in this campaign. And I think it is very important for us to make clear to our children that our country really is great because we’re good. And we are going to respect one another, lift each other up. We are going to be looking for ways to celebrate our diversity, and we are going to try to reach out to every boy and girl, as well as every adult, to bring them in to working on behalf of our country. I have a very positive and optimistic view about what we can do together. That’s why the slogan of my campaign is “Stronger Together,” because I think if we work together, if we overcome the divisiveness that sometimes sets Americans against one another, and instead we make some big goals—and I’ve set forth some big goals, getting the economy to work for everyone, 

In [69]:
type(result_df['Text'][1])

str

In [70]:
# Checking our text column
result_df['Text']

0      CLINTON: Thank you very much, Chris. And thank...
1      CLINTON: Well, thank you. Are you a teacher? Y...
2      CLINTON: How are you, Donald? [applause] CLINT...
3      CLINTON: I was going to applaud, too. Well, th...
4      CLINTON: Thank you, Jim. And thank you to the ...
                             ...                        
163    KENNEDY: In the first place I’ve never suggest...
164    KENNEDY: Good evening, Mr. Shadel. MR. KENNEDY...
165    KENNEDY: Good evening, Mr. Howe. MR. KENNEDY: ...
166    ANDERSON: Miss Loomis, I think it’s very appro...
167    ANDERSON: Miss Loomis, I think it’s very appro...
Name: Text, Length: 168, dtype: object

Above it seems like rows 166 and 167 are exact duplicates.

In [71]:
# Checking to see if rows 16 and 167 are identical
result_df.iloc[166] == result_df.iloc[167]

Year    True
Name    True
Text    True
dtype: bool

In [72]:
# Keeping only the first instance of any duplicate rows

result_df['Text'] = result_df['Text'].drop_duplicates(keep='first')
result_df

Unnamed: 0,Year,Name,Text
0,2016,CLINTON,"CLINTON: Thank you very much, Chris. And thank..."
1,2016,CLINTON,"CLINTON: Well, thank you. Are you a teacher? Y..."
2,2016,CLINTON,"CLINTON: How are you, Donald? [applause] CLINT..."
3,1996,CLINTON,"CLINTON: I was going to applaud, too. Well, th..."
4,1996,CLINTON,"CLINTON: Thank you, Jim. And thank you to the ..."
...,...,...,...
163,1960,KENNEDY,
164,1960,KENNEDY,
165,1960,KENNEDY,
166,1980,ANDERSON,"ANDERSON: Miss Loomis, I think it’s very appro..."


In [73]:
result_df.iloc[166] == result_df.iloc[167]

Year     True
Name     True
Text    False
dtype: bool

In [74]:
result_df[result_df['Text'].isna() == True]

Unnamed: 0,Year,Name,Text
10,2016,CLINTON,
11,2016,CLINTON,
12,2016,CLINTON,
13,1996,CLINTON,
14,1996,CLINTON,
...,...,...,...
162,1960,KENNEDY,
163,1960,KENNEDY,
164,1960,KENNEDY,
165,1960,KENNEDY,


Now we check for nulls.

In [75]:
result_df.dropna(inplace=True)

In [76]:
result_df[result_df['Text'].isna() == True]

Unnamed: 0,Year,Name,Text


In [77]:
result_df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 74 entries, 0 to 166
Data columns (total 3 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   Year    74 non-null     object
 1   Name    74 non-null     object
 2   Text    74 non-null     object
dtypes: object(3)
memory usage: 2.3+ KB


In [78]:
# Function for re-naming both Clintons and Bushes
def rename(currentname, newname, year):
    mask = (result_df['Name'] == currentname) & (result_df['Year'] == year)
    result_df.loc[mask, 'Name'] = newname

In [79]:
rename("CLINTON", "H_CLINTON", "2016")
rename("CLINTON", "B_CLINTON", "1992")
rename("CLINTON", "B_CLINTON", "1996")
rename("BUSH", "BUSH_2", "2008")
rename("BUSH", "BUSH_2", "2004")
rename("BUSH", "BUSH_2", "2000")
rename("BUSH", "BUSH_1", "1988")
rename("BUSH", "BUSH_1", "1992")

In [80]:
result_df['Name'].value_counts()

Name
B_CLINTON    7
BUSH_1       7
OBAMA        6
BUSH_2       6
GORE         5
TRUMP        5
KENNEDY      4
NIXON        4
FORD         4
CARTER       4
H_CLINTON    3
KERRY        3
MCCAIN       3
ROMNEY       3
DOLE         2
DUKAKIS      2
REAGAN       2
MONDALE      2
BUSH         1
ANDERSON     1
Name: count, dtype: int64

In [81]:
# Looking at the un-renamed 'BUSH' rows
result_df[result_df['Name'] == 'BUSH']

Unnamed: 0,Year,Name,Text
67,1984,BUSH,"BUSH: Well, I don’t think there’s a great diff..."


We noticed 2 instance of 'BUSH' in our dataframe. They are both for 1984, when George H.W. Bush ran as Reagan's V.P. Since we are interested in presidential candidates, we drop these rows below.

In [82]:
result_df = result_df[result_df['Name'] != "BUSH"]

Now we need to create our target column. The loop below assigns a 0 to a 'Target' column if the person in that row is a democrat, and assigns a 1 otherwise. 

We also drop our one independent candidate, as we do not have enough data for independent candidates' speech to 

In [83]:
dems = ['B_CLINTON', 'OBAMA', 'GORE', 'CARTER', 'KENNEDY', 'H_CLINTON', 'KERRY', 'DUKAKIS', 'MONDALE']
ind = ['ANDERSON']

result_df['Target'] = 1  # Default value for rows not matching dems or ind

for row in result_df['Name']:
    if row in dems:
        result_df.loc[result_df['Name'] == row, 'Target'] = 0
    else:
        result_df.loc[result_df['Name'] == row, 'Target'] = 1

result_df = result_df[result_df['Name'] != "ANDERSON"]
result_df

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  result_df['Target'] = 1  # Default value for rows not matching dems or ind


Unnamed: 0,Year,Name,Text,Target
0,2016,H_CLINTON,"CLINTON: Thank you very much, Chris. And thank...",0
1,2016,H_CLINTON,"CLINTON: Well, thank you. Are you a teacher? Y...",0
2,2016,H_CLINTON,"CLINTON: How are you, Donald? [applause] CLINT...",0
3,1996,B_CLINTON,"CLINTON: I was going to applaud, too. Well, th...",0
4,1996,B_CLINTON,"CLINTON: Thank you, Jim. And thank you to the ...",0
...,...,...,...,...
153,1960,NIXON,"NIXON: Mr. Smith, Senator Kennedy. The things ...",1
158,1960,KENNEDY,"KENNEDY: Good evening, Mr. Howe. MR. KENNEDY: ...",0
159,1960,KENNEDY,"KENNEDY: Good evening, Mr. Shadel. MR. KENNEDY...",0
160,1960,KENNEDY,KENNEDY: In the first place I’ve never suggest...,0


In [84]:
result_df[result_df['Name'] == 'Anderson']

Unnamed: 0,Year,Name,Text,Target


In [85]:
result_df[result_df['Name'] == "REAGAN"]

Unnamed: 0,Year,Name,Text,Target
126,1980,REAGAN,REAGAN: I don’t know what the differences migh...,1
127,1980,REAGAN,REAGAN: I believe that the only unpopular meas...,1


In [86]:
result_df['Target'].value_counts()

Target
0    36
1    36
Name: count, dtype: int64

Removing the names from the transcripts

In [87]:
candidates = ['CLINTON:', 'OBAMA:', 'GORE:', 'CARTER:', 'KENNEDY:', 'KERRY:', 'DUKAKIS:', 'MONDALE:', 'DOLE:', 'ANDERSON:', 'BUSH:', "TRUMP:", "NIXON:", "MCCAIN:", 'FORD:', 'ROMNEY:', 'REAGAN:']

# Iterate through each row in the "Text" column
for index, row in result_df.iterrows():
    text = row['Text']
    
    # Iterate through each candidate in the list
    for candidate in candidates:
        # Check if candidate is in the text
        if candidate in text:
            # Replace candidate with an empty string
            text = text.replace(candidate, "")
    
    # Update the "Text" column with the modified text
    result_df.at[index, 'Text'] = text

result_df['Text'][1]

' Well, thank you. Are you a teacher? Yes, I think that that’s a very good question, because I’ve heard from lots of teachers and parents about some of their concerns about some of the things that are being said and done in this campaign. And I think it is very important for us to make clear to our children that our country really is great because we’re good. And we are going to respect one another, lift each other up. We are going to be looking for ways to celebrate our diversity, and we are going to try to reach out to every boy and girl, as well as every adult, to bring them in to working on behalf of our country. I have a very positive and optimistic view about what we can do together. That’s why the slogan of my campaign is “Stronger Together,” because I think if we work together, if we overcome the divisiveness that sometimes sets Americans against one another, and instead we make some big goals—and I’ve set forth some big goals, getting the economy to work for everyone, not just

In [88]:
# Lower casing the text and assigning it to a new column
result_df['text_lower'] = ["".join(item for item in lst).lower() for lst in result_df['Text']]

In [90]:
type(result_df['text_lower'][1])

str

### 3.b. Webscraping & Cleaning RNC/DNC Conventions

In [None]:
conventions_df['Name'].value_counts()

Renaming Clintons and Bushes in the conventions_df

In [None]:
mask = (conventions_df['Name'] == "CLINTON") & (conventions_df['Year'] < 2010)
conventions_df.loc[mask, 'Name'] = "B_CLINTON"
mask = (conventions_df['Name'] == "CLINTON") & (conventions_df['Year'] > 2010)
conventions_df.loc[mask, 'Name'] = "H_CLINTON"
mask = (conventions_df['Name'] == "BUSH") & (conventions_df['Year'] > 1990)
conventions_df.loc[mask, 'Name'] = "BUSH_2"
mask = (conventions_df['Name'] == "BUSH") & (conventions_df['Year'] < 1990)
conventions_df.loc[mask, 'Name'] = "BUSH_1"

In [None]:
# Checking that the Clintons and Bushes were renamed properly
conventions_df['Name'].value_counts()

### 3.c. Webscraping & Cleaning Inaugural Speeches

Creating a dataframe for scraped text from inaugural speeches website. 

In [4]:
inaugural_df = pd.DataFrame(columns=["date", "name", "text", "link"])
inauglinks = []
base_url = "https://www.presidency.ucsb.edu/documents/"
# URL of the page to scrape
url = "https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/inaugural-addresses"

# Send a GET request to the URL
response = requests.get(url)

# Create a BeautifulSoup object to parse the HTML content
soup = BeautifulSoup(response.content, 'html.parser')

# Find all the links on the page
links = soup.find_all("a")


for link in links:
    if "inaugural-address-" in str(link) or "the-presidents-inaugural-address" in str(link) or "https://www.presidency.ucsb.edu/documents/inaugural-address" in str(link):
        url = link.get("href")
        date = link.text
        page = requests.get(url)
        soup = BeautifulSoup(page.content, 'html.parser')
        text = soup.find("div", class_="field-docs-content").find_all("p")
        text = [t.text for t in text]
        text = " ".join(text)
        links2 = soup.find_all("a")
        for link in links2:
            if "people" in str(link):
                person = str(link.get("href")).split("-")[-1]
        inaugural_df.loc[len(inaugural_df.index)] = [date, person, text, url]


inaugural_df = inaugural_df.drop(index=range(41))
inaugural_df.head()

Unnamed: 0,date,name,text,link
41,"January 20, 1961",kennedy,"Vice President Johnson, Mr. Speaker, Mr. Chief...",https://www.presidency.ucsb.edu/documents/inau...
42,"January 20, 1965",johnson,My fellow countrymen: On this occasion the oat...,https://www.presidency.ucsb.edu/documents/the-...
43,"January 20, 1969",nixon,"Senator Dirksen, Mr. Chief Justice, Mr. Vice p...",https://www.presidency.ucsb.edu/documents/inau...
44,"January 20, 1977",carter,"For myself and for our Nation, I want to thank...",https://www.presidency.ucsb.edu/documents/inau...
45,"January 20, 1981",reagan,"Senator Hatfield, Mr. Chief Justice, Mr. Presi...",https://www.presidency.ucsb.edu/documents/inau...


Cleaning the text column and date column for analysis.

In [5]:
inaugural_df['text_lower'] = ["".join(item for item in lst).lower() for lst in inaugural_df['text']]
inaugural_df['date'] = [year[-4:] for year in inaugural_df['date']]

In [6]:
inaugural_df.reset_index(drop=True, inplace=True)

Creating the binary target column for the inaugural speeches dataset:

In [8]:
#inaugural_df.drop(columns='Unnamed: 0', inplace=True)
inaugural_df.rename(columns = {'date':'Year', 'name':'Name','text':'Text', 'tokens':'list_tokens'}, inplace = True)

dems = ['kennedy', 'johnson', 'carter', 'clinton', 'obama', 'biden']

inaugural_df['Target'] = 1  # Default value for rows not matching dems or ind

for row in inaugural_df['Name']:
    if row in dems:
        inaugural_df.loc[inaugural_df['Name'] == row, 'Target'] = 0
    else:
        inaugural_df.loc[inaugural_df['Name'] == row, 'Target'] = 1
        
inaugural_df['Name'] = [name.upper() for name in inaugural_df['Name']]

inaugural_df.head()

Unnamed: 0,Year,Name,Text,link,text_lower,Target
0,1961,KENNEDY,"Vice President Johnson, Mr. Speaker, Mr. Chief...",https://www.presidency.ucsb.edu/documents/inau...,"vice president johnson, mr. speaker, mr. chief...",0
1,1965,JOHNSON,My fellow countrymen: On this occasion the oat...,https://www.presidency.ucsb.edu/documents/the-...,my fellow countrymen: on this occasion the oat...,0
2,1969,NIXON,"Senator Dirksen, Mr. Chief Justice, Mr. Vice p...",https://www.presidency.ucsb.edu/documents/inau...,"senator dirksen, mr. chief justice, mr. vice p...",1
3,1977,CARTER,"For myself and for our Nation, I want to thank...",https://www.presidency.ucsb.edu/documents/inau...,"for myself and for our nation, i want to thank...",0
4,1981,REAGAN,"Senator Hatfield, Mr. Chief Justice, Mr. Presi...",https://www.presidency.ucsb.edu/documents/inau...,"senator hatfield, mr. chief justice, mr. presi...",1


In [None]:
inaugural_df['Name'].value_counts()

Renaming the Clintons and Bushes

In [None]:
inaugural_df['Name'] = inaugural_df['Name'].replace(to_replace={'CLINTON':'B_CLINTON'})
mask = (inaugural_df['Name'] == "BUSH") & (inaugural_df['Year'] > 1990)
inaugural_df.loc[mask, 'Name'] = "BUSH_2"
mask = (inaugural_df['Name'] == "BUSH") & (inaugural_df['Year'] < 1990)
inaugural_df.loc[mask, 'Name'] = "BUSH_1"

In [None]:
# Checking that the Clintons and Bushes were renamed properly
inaugural_df['Name'].value_counts()

### 3.d. NLP Preprocessing all 3 Datasets

In [11]:
tokenizer = RegexpTokenizer(r"(?u)\w{3,}") # This pattern finds words that are at least 3 letters long
stopwords = stopwords.words("english")
lemmatizer = WordNetLemmatizer()

def preprocessing(text, tokenizer, stopwords, lemmatizer):
    # Make everything in the df["Text"] column into a lower-case string
    #text = ["".join(item for item in lst).lower() for lst in text]

    # Tokenize
    tokens = tokenizer.tokenize(text)
    
    # Remove stopwords
    tokens = [token for token in tokens if token not in stopwords]
    
    # Lemmatize
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    
    
    return tokens

#### 3.d.i. NLP Preprocessing the Presidential Debates

In [95]:
# Apply the preprocessing function to the 'Text' column
result_df['list_tokens'] = result_df['text_lower'].apply(lambda x: preprocessing(x, tokenizer, stopwords, lemmatizer))
result_df

Unnamed: 0,Year,Name,Text,Target,text_lower,list_tokens
0,2016,H_CLINTON,"Thank you very much, Chris. And thanks to UNL...",0,"thank you very much, chris. and thanks to unl...","[thank, much, chris, thanks, unlv, hosting, kn..."
1,2016,H_CLINTON,"Well, thank you. Are you a teacher? Yes, I th...",0,"well, thank you. are you a teacher? yes, i th...","[well, thank, teacher, yes, think, good, quest..."
2,2016,H_CLINTON,"How are you, Donald? [applause] Well, thank ...",0,"how are you, donald? [applause] well, thank ...","[donald, applause, well, thank, lester, thanks..."
3,1996,B_CLINTON,"I was going to applaud, too. Well, thank you,...",0,"i was going to applaud, too. well, thank you,...","[going, applaud, well, thank, jim, thanks, peo..."
4,1996,B_CLINTON,"Thank you, Jim. And thank you to the people o...",0,"thank you, jim. and thank you to the people o...","[thank, jim, thank, people, hartford, host, wa..."
...,...,...,...,...,...,...
153,1960,NIXON,"Mr. Smith, Senator Kennedy. The things that S...",1,"mr. smith, senator kennedy. the things that s...","[smith, senator, kennedy, thing, senator, kenn..."
158,1960,KENNEDY,"Good evening, Mr. Howe. MR. Mr. Howe, Mr. Vi...",0,"good evening, mr. howe. mr. mr. howe, mr. vi...","[good, evening, howe, howe, vice, president, f..."
159,1960,KENNEDY,"Good evening, Mr. Shadel. MR. Mr. McGee, we ...",0,"good evening, mr. shadel. mr. mr. mcgee, we ...","[good, evening, shadel, mcgee, contractual, ri..."
160,1960,KENNEDY,In the first place I’ve never suggested that ...,0,in the first place i’ve never suggested that ...,"[first, place, never, suggested, cuba, lost, e..."


In [96]:
type(result_df['list_tokens'][0])

list

In [None]:
# making the tokens list of strings into one big string 
result_df['string_tokens'] = result_df['list_tokens'].apply(lambda x: ' '.join(x))
result_df['string_tokens'][0]

In [None]:
# Saving df as a csv
result_df.to_csv('../data/result_df.csv')

#### 3.d.ii. NLP Preprocessing the RNC/DNC Conventions

#### 3.d.iii. NLP Preprocessing the Inaugural Speeches

Run preprocessing on inaugural speeches df. Using preprocessing function from earlier.

In [12]:
inaugural_df['tokens'] = inaugural_df['text_lower'].apply(lambda x: preprocessing(x, tokenizer, stopwords, lemmatizer))

In [15]:
inaugural_df['tokens'][0][0:10]

['vice',
 'president',
 'johnson',
 'speaker',
 'chief',
 'justice',
 'president',
 'eisenhower',
 'vice',
 'president']

### 3.e. Combining all 3 Datasets

We need to make sure all three datasets' columns of interest have the same names so that we can join them together properly

In [None]:
result_df.info()

In [None]:
result_df.rename(columns = {'tokens':'string_tokens'}, inplace = True)
result_df.head()

In [None]:
conventions_df.info()

In [None]:
conventions_df = pd.read_csv('../data/conventions.csv')
conventions_df.rename(columns = {'date':'Year', 'speaker':'Name','text':'Text', 'convention':'Target','tokens':'list_tokens'}, inplace = True)
conventions_df['Target'] = conventions_df['Target'].replace(to_replace={'DNC':0, 'RNC':1})
conventions_df.loc[(conventions_df['Name'] == 'johnson') & (conventions_df['Year'] == 1964), 'Target'] = 0
conventions_df['Name'] = [name.upper() for name in conventions_df['Name']]
# conventions_df.iloc[14]['Target'] = 0
# conventions_df[conventions_df['Target'].isna()]
conventions_df.head()

Now that the columns have consistent names, we can concatenate them all by those shared columns

In [None]:
final_df = pd.concat([result_df, conventions_df, inaugural_df], axis=0)
final_df

In [None]:
final_df.isna().sum()

It is okay that we have nulls in the 'text_lower' and 'link' columns, as only our presidential debates (result_df) dataset had the former, and only our conventions and inaugural speeches datasets had the latter. They are not necessary for our analysis.

However, we do want every row to have the tokens in the form of a string: i.e., we want no nulls in the 'string_tokens' column. So, we simply replace that column with the 'list_tokens' column typecasted into a string.

In [None]:
import ast

# Function to convert string to list
def convert_string_to_list(string):
    try:
        return ast.literal_eval(string)
    except ValueError:
        return string  # returns the original string in case of an error

# Apply this function to the desired column
final_df['list_tokens'] = final_df['list_tokens'].apply(convert_string_to_list)

In [None]:
final_df['string_tokens']= [" ".join(x) for x in final_df['list_tokens']]

In [None]:
final_df.isna().sum()

We will not save this final dataset until we inspect the most common words and confirm that our stopwords list is as complete as possible. 

### 3.f. EDA on preprocessed debate dataset

Now that we have the data prepared for vectorization, we created functions for visualizations to perform EDA.

In [None]:
# Create a frequency distribution for each speaker
speaker_freq_dist = {}
for speaker in result_df['Name'].unique():
    #tokens = [token for sublist in result_df[result_df['Name'] == speaker]['tokens'] for token in sublist]
    tokens = [token for sublist in result_df[result_df['Name'] == speaker]['tokens'] for token in sublist]

    freq_dist = FreqDist(tokens)
    
    # Check if the frequency distribution is not empty
    if freq_dist and freq_dist.N():
        # Get the top 10 tokens
        top_tokens = freq_dist.most_common(10)
        
        # Create a frequency distribution for the top 10 tokens
        top_freq_dist = FreqDist(dict(top_tokens))
        speaker_freq_dist[speaker] = top_freq_dist

# Plot the frequency distribution for each speaker using a line graph
for speaker, freq_dist in speaker_freq_dist.items():
    plt.figure(figsize=(10, 6))
    
    # Extract words and frequencies
    words, frequencies = zip(*freq_dist.items())
    tickvals = range(0,len(words))
    
# Use Pandas Series plot function with kind='line'
    pd.Series(frequencies, index=words).plot(kind='line', marker='o', linestyle='-', color='b')
    
    plt.title(f"Top 10 Word Usage Frequency Distribution for {speaker}")
    plt.xlabel("words")
    plt.ylabel("Frequency")
    plt.xticks(ticks=tickvals, labels= words, rotation=45)
    plt.show()

In [None]:
def plot_speaker_comparison(speaker1, speaker2, df, num_words=15):
    """
    Plot the top 'num_words' used by two speakers side by side.

    :param speaker1: Name of the first speaker.
    :param speaker2: Name of the second speaker.
    :param df: DataFrame containing the speakers and tokens.
    :param num_words: Number of top words to plot (default is 15).
    """
    
    # Initialize subplots
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))

    # Loop over speakers
    for idx, speaker in enumerate([speaker1, speaker2]):
        # Extract tokens for the speaker
        tokens = [token for token_list in df[df['Name'] == speaker]['tokens'] for token in token_list]

        # Create frequency distribution and get top tokens
        freq_dist = FreqDist(tokens)
        top_tokens = freq_dist.most_common(num_words)

        # Extract words and frequencies
        words, frequencies = zip(*top_tokens)

        # Plot
        axes[idx].bar(words, frequencies)
        axes[idx].set_title(f"Top {num_words} Words for {speaker}")
        axes[idx].tick_params(axis='x', rotation=45)
        axes[idx].set_ylabel("Frequency")

    plt.tight_layout()
    plt.show()

# Example usage
plot_speaker_comparison('H_CLINTON', 'TRUMP', result_df)

In [None]:
def plot_speaker_comparison(speaker1, speaker2, df, num_words=15):
    """
    Plot the top 'num_words' used by two speakers side by side.

    :param speaker1: Name of the first speaker.
    :param speaker2: Name of the second speaker.
    :param df: DataFrame containing the speakers and tokens.
    :param num_words: Number of top words to plot (default is 15).
    """
    
    # Initialize subplots
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))

    # Loop over speakers
    for idx, speaker in enumerate([speaker1, speaker2]):
        # Extract tokens for the speaker
        tokens = [token for token_list in df[df['Name'] == speaker]['tokens'] for token in token_list]

        # Create frequency distribution and get top tokens
        freq_dist = FreqDist(tokens)
        top_tokens = freq_dist.most_common(num_words)

        # Extract words and frequencies
        words, frequencies = zip(*top_tokens)

        # Plot
        axes[idx].bar(words, frequencies)
        axes[idx].set_title(f"Top {num_words} Words for {speaker}")
        axes[idx].tick_params(axis='x', rotation=45)
        axes[idx].set_ylabel("Frequency")

    plt.tight_layout()
    plt.show()

# Example usage
plot_speaker_comparison('H_CLINTON', 'TRUMP', result_df)

In [None]:
def plot_speaker_comparison(speaker1, speaker2, df, num_words=15):
    """
    Plot the top 'num_words' used by two speakers side by side.

    :param speaker1: Name of the first speaker.
    :param speaker2: Name of the second speaker.
    :param df: DataFrame containing the speakers and tokens.
    :param num_words: Number of top words to plot (default is 15).
    """
    
    # Initialize subplots
    fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(15, 6))

    # Loop over speakers
    for idx, speaker in enumerate([speaker1, speaker2]):
        # Extract tokens for the speaker
        tokens = [token for token_list in df[df['Name'] == speaker]['tokens'] for token in token_list]

        # Create frequency distribution and get top tokens
        freq_dist = FreqDist(tokens)
        top_tokens = freq_dist.most_common(num_words)

        # Extract words and frequencies
        words, frequencies = zip(*top_tokens)

        # Plot
        axes[idx].bar(words, frequencies)
        axes[idx].set_title(f"Top {num_words} Words for {speaker}")
        axes[idx].tick_params(axis='x', rotation=45)
        axes[idx].set_ylabel("Frequency")

    plt.tight_layout()
    plt.show()

# Example usage
plot_speaker_comparison('H_CLINTON', 'TRUMP', result_df)

### 3.g. Additional Stopwords Removal

In [None]:
# Nate's code here

In [None]:
add_stopwords = ['mccain','bush','donald','romney','ford','nixon','george','john','dole','dan','richard','reagan','trump','quayle','jim','obama','hillary','joe','clinton','carter','khrushchev','kennedy','biden','crosstalk','bernie','sander']

In [None]:
# Function to remove stopwords from a string
def remove_stopwords(text):
    words = text.split()
    filtered_words = [word for word in words if word not in add_stopwords]
    return ' '.join(filtered_words)

# Apply the remove_stopwords function to the 'string_tokens' column
final_df['string_tokens'] = final_df['string_tokens'].apply(lambda x: remove_stopwords(x))

# Display the updated DataFrame
final_df.head()


Finally we can save off our final dataset

In [None]:
final_df.to_csv('../data/final_df.csv', index=False)

## 4. Modelling

In [None]:
final_df.isna().sum()

In [None]:
final_df['string_tokens'][1]

First we define some functions that will make it easier for us to evaluate our grid searches of each future model type

In [None]:
def evaluate_grid(gs):
    y_pred = gs.predict(X_test)
    print("Best Params: " + str(gs.best_params_)) 
    print("Best CV Accuracy: " + str(gs.best_score_)) 
    print("Train Accuracy: " + str(gs.score(X_train, y_train)))

In [None]:
def plot_cm(y_test, gs):
    '''
    Takes in true values and predicted values and plots a confusion matrix
    '''
    y_pred = gs.predict(X_test)    
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['dem', 'ind', 'rep'])
    disp.plot();

In [None]:
# Train Test split
X_train, X_test, y_train, y_test = train_test_split(final_df["string_tokens"], final_df['Target'], random_state=42)

### 4.a. Multinomial Naive Bayes Model

In [None]:
pipe = Pipeline(steps=[
    ('tfidf', TfidfVectorizer()), 
    ('mnb', MultinomialNB())])
pipe.fit(X_train, y_train)

In [None]:
cross_val_score(pipe, X_train, y_train)

In [None]:
pipe.score(X_test, y_test)

#### 4.a.i. MNB Grid Searches

In [None]:
gs = GridSearchCV(pipe, param_grid= {
    'tfidf__max_df': [0.75, 0.9, 1.0], # default 1.0
    'tfidf__min_df': [0.0, 0.05, 0.1], # default 1
    'tfidf__ngram_range': [(1, 1), (1, 2)], # default (1,1)
    'tfidf__max_features': [None, 2, 20], # default (None)
    'tfidf__norm': ['l1', 'l2', None]
})
gs.fit(X_train, y_train)

In [None]:
evaluate_grid(gs)

In [None]:
plot_cm(y_test, gs)

In [None]:
gs2 = GridSearchCV(pipe, param_grid= {
    'tfidf__max_df': [0.95, 1.0], # default 1.0 was best
    'tfidf__min_df': [0.0, 0.05], # default 1, 0 was best
    'tfidf__ngram_range': [(2, 2), (1, 2)], # default (1,1)
    'tfidf__max_features': [None, 1], # default (None) was best
    'tfidf__norm': ['l1', 'l2', None] # best was None
})
gs2.fit(X_train, y_train)

In [None]:
evaluate_grid(gs2)

In [None]:
plot_cm(y_test, gs2)

In [None]:
gs3 = GridSearchCV(pipe, param_grid= {
    'tfidf__max_df': [0.95, 1.0], # earlier 0.95 was best
    'tfidf__min_df': [0.05, 0.0], # 0.05 was best
    'tfidf__ngram_range': [(2,2), (1,2), (1,3)], # default (1,1)
   'tfidf__max_features': [None, 1], # default (None) was best so commenting it out
    'tfidf__norm': [None] # None was best
})
gs3.fit(X_train, y_train)

In [None]:
evaluate_grid(gs3)

In [None]:
plot_cm(y_test, gs3)

In [None]:
gs4 = GridSearchCV(pipe, param_grid= {
    'tfidf__max_df': [0.3, 0.35, 0.4, 0.45], # earlier 0.4 was best
    'tfidf__min_df': [0.0], # 0.0 was best
    'tfidf__ngram_range': [(2,2), (3,3)], # default (1,1)
   'tfidf__max_features': [None, 2, 20], # default (None) was best so commenting it out
    'tfidf__norm': [None] # None was best
})
gs4.fit(X_train, y_train)


In [None]:
evaluate_grid(gs4)

In [None]:
plot_cm(y_test, gs4)

Our scores past grid search 2 did not improve, so we will assign it as our best model and score it on testing data

In [None]:
gs2.score(X_test, y_test)

### 4.b. Gaussian Bayes Model

#### 4.b.i. GB Grid Searches

### 4.c. Random Forest Model

In [None]:
from sklearn.ensemble import RandomForestClassifier

pipe3 = Pipeline(steps=[
    ('tfidf', TfidfVectorizer()), 
    ('rf', RandomForestClassifier(random_state=42))])
pipe3.fit(X_train, y_train)

In [None]:
pipe3.score(X_train, y_train)

In [None]:
cross_val_score(pipe3, X_train, y_train)

#### 4.c.i. RF Grid Searches

In [None]:
gs_rf = GridSearchCV(pipe3, param_grid = {
        'rf__max_depth': [None, 5, 10],
        'rf__min_samples_split': [2, 5, 10],
        'rf__min_samples_leaf': [1, 5, 10],
        'rf__n_estimators': [100, 200, 300],
})
gs_rf.fit(X_train, y_train)

In [None]:
evaluate_grid(gs_rf)

In [None]:
gs2_rf = GridSearchCV(pipe3, param_grid = {
        'rf__max_depth': [None, 1],
        'rf__min_samples_split': [2, 3, 1],
        'rf__min_samples_leaf': [4, 5, 6],
        'rf__n_estimators': [190, 200, 210],
})
gs2_rf.fit(X_train, y_train)

In [None]:
evaluate_grid(gs2_rf)

In [None]:
final_rf = gs2_rf.best_estimator_
final_rf.feature_importances_

## 5. Final Results: Multinomial Naive Bayes Model

## 6. Next Steps

Model again using a more data and stricter processing
Track changes in candidate rhetoric 
Deployment
Nate

So what about going forward?
First, the model would benefit from a larger data training set. In particular, it would be helpful to pull in campaign stops and other less formal speech occasions. Including candidates for party nominations who nonetheless failed to become the party nominee would also be worthwhile.  It is worth considering bringing in other political rhetoric, not merely from those seeking presidential office, although that may go beyond the scope of this particular dataset and model.
With our trained model, there are other analyses that would be worth pursuing. To name a few: how much does rhetoric change before and after a politician becomes his or her party’s nominee? What about once they win the election? And how much does context affect rhetoric: a town hall, versus cable news, versus a formal press conference, and so on?
Finally, we would want to allow others to make use of this model as they see fit.
