# Final Dataframe Cleanup

The purpose of this notebook is to add additional columns for speaker type to the final_speaker_df.pickle file created in secondary_text_clean_speakers.ipynb.  This notebook will also have the text-preprocessing for NLP of this data.

Importing packages:

In [1]:
import nltk
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10

Pickling in data:

In [2]:
with open('Data/final_speaker_df.pickle','rb') as read_file:
    speaker_df = pickle.load(read_file)

Cleaning the speaker column to help with ID'ing the speakers:
- make everything lowercase
- remove common headers like sen., mr., ms., mrs.

In [3]:
for i, speaker in enumerate(speaker_df.Speaker):
    lower_speaker = speaker.lower().strip(".,")
    speaker_df.iloc[i, 3] = lower_speaker.replace("mr. ", "").replace("ms. ", "").replace("mrs. ", "").replace("sen. ", "").replace("gov. ", "").replace("mayor ","").replace("rep. ", "")

In [4]:
speaker_df.head()

Unnamed: 0,Date,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type
0,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Good evening, and welcome to the first...",lehrer,Commission for Presidential Debates,General-President
1,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,PEROT: I think the principal that separates me...,perot,Commission for Presidential Debates,General-President
2,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Governor Clinton, a one minute response.",lehrer,Commission for Presidential Debates,General-President
3,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,CLINTON: The most important distinction in thi...,clinton,Commission for Presidential Debates,General-President
4,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: President Bush, one minute response, sir.",lehrer,Commission for Presidential Debates,General-President


Adding a year column, since that will help with filtering:

In [5]:
speaker_df['Date_Time_Date'] = pd.to_datetime(speaker_df['Date'])
speaker_df.drop(columns=['Date'], inplace=True)

In [6]:
speaker_df['Year'] = speaker_df.Date_Time_Date.dt.year

In [7]:
speaker_df.head()

Unnamed: 0,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type,Date_Time_Date,Year
0,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Good evening, and welcome to the first...",lehrer,Commission for Presidential Debates,General-President,1992-10-11,1992
1,The First Clinton-Bush-Perot Presidential Deb...,PEROT: I think the principal that separates me...,perot,Commission for Presidential Debates,General-President,1992-10-11,1992
2,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Governor Clinton, a one minute response.",lehrer,Commission for Presidential Debates,General-President,1992-10-11,1992
3,The First Clinton-Bush-Perot Presidential Deb...,CLINTON: The most important distinction in thi...,clinton,Commission for Presidential Debates,General-President,1992-10-11,1992
4,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: President Bush, one minute response, sir.",lehrer,Commission for Presidential Debates,General-President,1992-10-11,1992


In [8]:
speaker_df.shape

(76700, 7)

Removing Transcript rows with "(Applause)" or "(laughter)" as the only text:

In [9]:
del_rows = []
for i, transcript in enumerate(speaker_df.Transcript):
    text = transcript.lower()
    if text == '(laughter)' or text == '(applause)' or text == '(applause.)':
        del_rows.append(i)
speaker_df.drop(del_rows, inplace=True)
speaker_df.shape

(76657, 7)

## Updating Initial Data:

To help with seeing the effects of NLP, I am going to add a column for speaker type (candidate, moderator, etc.).

In [10]:
speaker_df['Speaker_Type'] = 'none_listed'
speaker_df.head()

Unnamed: 0,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type,Date_Time_Date,Year,Speaker_Type
0,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Good evening, and welcome to the first...",lehrer,Commission for Presidential Debates,General-President,1992-10-11,1992,none_listed
1,The First Clinton-Bush-Perot Presidential Deb...,PEROT: I think the principal that separates me...,perot,Commission for Presidential Debates,General-President,1992-10-11,1992,none_listed
2,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Governor Clinton, a one minute response.",lehrer,Commission for Presidential Debates,General-President,1992-10-11,1992,none_listed
3,The First Clinton-Bush-Perot Presidential Deb...,CLINTON: The most important distinction in thi...,clinton,Commission for Presidential Debates,General-President,1992-10-11,1992,none_listed
4,The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: President Bush, one minute response, sir.",lehrer,Commission for Presidential Debates,General-President,1992-10-11,1992,none_listed


In [11]:
speaker_df.Speaker.value_counts().head(20)

clinton          3337
romney           2798
trump            2432
biden            2407
sanders          2256
mccain           2127
blitzer          2091
obama            2071
paul             1652
bush             1590
cooper           1412
edwards          1321
gore             1296
santorum         1287
moderator        1275
huckabee         1040
king             1029
the president    1006
wallace          1000
gingrich          971
Name: Speaker, dtype: int64

### As a note, I am iterating through the below, entering in the names I get into the right category and then re-running to see who the new names that pop up are.

This list contains a mix of candidates and moderators.  I'll create a database of common republicans, democrats, and moderators from these, to at least get those taken care of:

In [42]:
republicans = ['steve forbes','john mccain','kemp','orrin hatch','gary bauer','alan keyes','j. king','reagan','romney. ','nixon','ford','amb. keyes','walker','sen thompson','pawlenty','gilmore','rep. tancredo','palin','tancredo','brownback','hunter','romney.','rep. hunter','mayor giuliani','hatch','quayle','rep. paul','pataki','cheney','senator dole','huntsman','bauer','thompson','jindal','trump', 'romney', 'paul', 'santorum', 'mccain', 'bush', 'gingrich', 'rubio', 'huckabee', 'cruz', 'giuliani', 'kasich', 'bachmann', 'keyes', 'christie', 'forbes', 'cain', 'fiorina', 'pence', 'carson', 'graham', 'perry', 'president bush']
democrats = ['bentsen','ferraro','swalwell','clark','bennet','chafee','hickenlooper','kennedy','webb','dean','williamson','carter','de blasio','inslee','delaney','kerry','gillibrand','gabbard','sharpton','lieberman','dukakis','senator obama','senator clinton','mondale','castro','bloomberg',"o'rourke",'gravel','yang','rep. kucinich','steyer','booker','richardson','bradley','kaine','senator kerry','clinton', 'sanders', 'obama', 'gore', 'edwards', 'biden', 'warren', 'klobuchar', 'buttigieg', ' obama', ' clinton', "o'malley", 'dodd', 'kucinich']
moderators = ['epperson',"o'brien",'frederick','moyers','ferrechio','q. ','member of audience','calderon','mcelveen','greenfield','goldman','parker','brennan','george','vanocur','(unknown)','hauc','s. king','alcindor','york','davis','distaso','shadel','mcgee','demint','malveaux','arrarás','jackson','nawaz','bullock','baker','hook', 'novak','salinas','lacey','spradling','tumulty','pfannenstiel', 'phillip','newman','alberta','burnett','bachman','announcer','schieffer. ','cameron','unidentified male','cokie roberts','q','question','mitchell','stanton','gregory','crowley','vaughn','koppel',"o'donnell",'sawyer','ramos','bruno','page','matthews','seib','quijano','pelley','garrett','walters','regan','smith','goler','unknown','brown','maccallum','hemmer','cavuto','quick','quintanilla','griffith','rose','welker','schieffer','jennings','guthrie','harwood','washburn','hewitt','ryerson','dickerson','cuomo','lemon','olbermann','ifill','smiley','maddow','bartiromo','williams','diaz-balart','shaw','gibson','woodruff','bash','cooper', 'blitzer', 'moderator', 'king', 'wallace', 'tapper', 'baier', 'mr. blitzer', 'russert', 'stephanopoulos', 'lehrer', 'holt', 'hume', 'todd', 'brokaw', 'muir', 'raddatz', 'kelly']
independents = ['stockdale','perot']

Loop specifically for speaker of "The President:"

In [43]:
for i, speaker in enumerate(speaker_df.Speaker):
    if speaker == 'the president':
        if speaker_df.iloc[i, 6] == 1984:
            speaker_df.iloc[i, 7] = 'Republican'
            speaker_df.iloc[i, 2] = 'reagan'
        elif speaker_df.iloc[i, 6] == 1996:
            speaker_df.iloc[i, 7] = 'Democrat'
            speaker_df.iloc[i, 2] = 'clinton'
        elif speaker_df.iloc[i, 6] == 2012:
            speaker_df.iloc[i, 7] = 'Democrat'
            speaker_df.iloc[i, 2] = 'obama'
        else:
            speaker_df.iloc[i, 7] = 'Republican'
            speaker_df.iloc[i, 2] = 'trump'

Looping through above lists:

In [44]:
for i, speaker in enumerate(speaker_df.Speaker):
    #Kamala Harris and a moderator:
    if speaker == 'harris':
        if speaker_df.iloc[i, 6] == 2011:
            speaker_df.iloc[i, 7] = 'Moderator/Other'
        else:
            speaker_df.iloc[i, 7] = 'Democrat'
    #Paul Ryan in 2012, Tim Ryan in 2019:
    if speaker == 'ryan':
        if speaker_df.iloc[i, 6] == 2012:
            speaker_df.iloc[i, 7] = 'Republican'
        else:
            speaker_df.iloc[i, 7] = 'Democrat'
    #Looping through above lists:
    if speaker in republicans:
        speaker_df.iloc[i,7] = 'Republican'
    elif speaker in democrats:
        speaker_df.iloc[i,7] = 'Democrat'
    elif speaker in moderators:
        speaker_df.iloc[i,7] = 'Moderator/Other'
    elif speaker in independents:
        speaker_df.iloc[i, 7] = 'Independent'
    #else:
    #    speaker_df.iloc[i,7] = 'Moderator/Other'

In [45]:
speaker_df['Speaker_Type'].value_counts()

Republican         26877
Moderator/Other    24992
Democrat           22427
none_listed         2274
Independent           87
Name: Speaker_Type, dtype: int64

Iterating through to see some of the next top names:

In [46]:
filter_df = speaker_df[speaker_df['Speaker_Type'] == 'none_listed']

In [48]:
filter_df.Speaker.value_counts().tail(25)

kilminster                                             1
mandy garber                                           1
greenberg                                              1
tax rates/job creation                                 1
hillary clinton                                        1
role of the federal government                         1
gergen                                                 1
2012 presidential election/perception of candidates    1
singiser                                               1
small-business promotion/education                     1
number two                                             1
israel/the president's foreign policy agenda           1
alonso                                                 1
gen. chuck yeager (usaf, ret.)                         1
nikki washington                                       1
gloria borger                                          1
health care for senior citizens/medicare               1
sander vanocur                 

**Below is a general formula to plug in and filter names for determining which bucket they are in:**

In [None]:
filter_df[filter_df.Speaker == "kemp"].Date_Time_Date.value_counts()

In [None]:
filter_df[filter_df.Speaker == "kemp"].Debate_Name.value_counts()

In [None]:
filter_df[filter_df.Speaker == '(applause)'].Data_Source.value_counts()

In [41]:
filter_df[filter_df.Speaker == ""]

Unnamed: 0,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type,Date_Time_Date,Year,Speaker_Type
27417,"Republican Candidates Debate in Cleveland, Ohio",To treat the mentally ill. Ten thousand of the...,,American Presidency Project,Primary-Republican,2015-08-06,2015,none_listed
27418,"Republican Candidates Debate in Cleveland, Ohio","Secondly, we are rehabbing the drug-addicted. ...",,American Presidency Project,Primary-Republican,2015-08-06,2015,none_listed
27419,"Republican Candidates Debate in Cleveland, Ohio",So we're treating them and getting them on the...,,American Presidency Project,Primary-Republican,2015-08-06,2015,none_listed
27420,"Republican Candidates Debate in Cleveland, Ohio",And do you know what?,,American Presidency Project,Primary-Republican,2015-08-06,2015,none_listed
27421,"Republican Candidates Debate in Cleveland, Ohio",Everybody has a right to their God-given purpose.,,American Presidency Project,Primary-Republican,2015-08-06,2015,none_listed
27422,"Republican Candidates Debate in Cleveland, Ohio","And finally, our Medicaid is growing at one of...",,American Presidency Project,Primary-Republican,2015-08-06,2015,none_listed
27430,"Republican Candidates Debate in Cleveland, Ohio",And there should be a path to earned legal sta...,,American Presidency Project,Primary-Republican,2015-08-06,2015,none_listed
53003,"Democratic Candidates Debate in Manchester, Ne...",MR. : Can you put your mike up?,,American Presidency Project,Primary-Democrat,2007-06-03,2007,none_listed
53005,"Democratic Candidates Debate in Manchester, Ne...",MR. : Hard to --,,American Presidency Project,Primary-Democrat,2007-06-03,2007,none_listed
53006,"Democratic Candidates Debate in Manchester, Ne...",MR. : Can't hear you.,,American Presidency Project,Primary-Democrat,2007-06-03,2007,none_listed
