# Text Cleaning
The purpose of this notebook is to clean and pre-process text data scraped in the debate_scraping notebook.

Ideally, I'd have a Dataframe with the following columns: 
- debate date
- debate type (general election/primary (republican/democrat)
- speaker  
- speaker party (republican or democrat) - can add based on the speaker/debate type  
- speaker type (candidate, moderator, etc.)
- line (i.e. the text data)

Importing packages:

In [1]:
import nltk
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10

# American Presidency Project

### Pickling in Data

Starting with the American Presidency Project, let's see what I can do:

In [2]:
pwd

'/Users/patrickbovard/Documents/GitHub/presidential_debate_analysis'

In [3]:
cd Data

/Users/patrickbovard/Documents/GitHub/presidential_debate_analysis/Data


In [4]:
with open('app_transcripts.pickle','rb') as read_file:
    app_transcripts = pickle.load(read_file)

In [5]:
len(app_transcripts)

169

In [6]:
counter = 0
for transcript in app_transcripts:
    counter += len(transcript)
print(counter)

79607


This matches the lengths from debate_scraping, so all the data is there.

## Cleaning Data
Each transcript will have a few things to clean:
- remove participants, moderators, other headers to only keep response data 
- make sure each "row" (i.e. line) of data has a speaker attached to it.  Currently, new paragraphs from the same preceding speaker will not have the "NAME:" as the first characters  

### Introductions - starting each transcript at the right spot:

First, I'll check out the introductions:

In [7]:
#Checking out the second item of each transcript, to see if that will be the correct starting point:
for i, transcript in enumerate(app_transcripts):
    print('Transcript {}:'.format(i))
    print(transcript[1]+ '\n')

Transcript 0:
MODERATOR:Kristen Welker (NBC News)

Transcript 1:
MODERATOR:Chris Wallace (Fox News)

Transcript 2:
MODERATOR:Susan Page (USA Today)

Transcript 3:
MODERATORS:Dana Bash (CNN);Ilia Calderón (Univision); andJake Tapper (CNN)

Transcript 4:
MODERATORS:Margaret Brennan (CBS News);Major Garrett (CBS News);Gayle King (CBS News);Norah O'Donnell (CBS News); andBill Whitaker (CBS News)

Transcript 5:
MODERATORS:Vanessa Hauc (Telemundo);Lester Holt (NBC News);Hallie Jackson (NBC News);Jon Ralston (Nevada Independent); andChuck Todd (NBC News)

Transcript 6:
MODERATORS:Linsey Davis (ABC News);Monica Hernandez (WMUR-TV News);David Muir (ABC News);Adam Sexton (WMUR-TV News); andGeorge Stephanopoulos (ABC News)

Transcript 7:
MODERATORS:Wolf Blitzer (CNN);Brianne Pfannenstiel (The Des Moines Register); andAbby Phillip (CNN)

Transcript 8:
MODERATORS:Tim Alberta (Politico);Yamiche Alcindor (PBS);Amna Nawaz (PBS); andJudy Woodruff (PBS)

Transcript 9:
MODERATORS:Rachel Maddow (MSNBC);An

Based on the above, the following transcripts have started the debate speech by index 1:

In [8]:
early = [49, 50, 51, 52, 73, 74, 75, 76, 94, 102, 109, 110, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 139, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168]

Do those all start at 0?

In [9]:
for i, transcript in enumerate(app_transcripts):
    if i in early:
        print('Transcript {}:'.format(i))
        print(transcript[0]+ '\n')

Transcript 49:
Moderator Bob Schieffer. Good evening from the campus of Lynn University here in Boca Raton, Florida. This is the fourth and last debate of the 2012 campaign, brought to you by the Commission on Presidential Debates. This one is on foreign policy. I'm Bob Schieffer of CBS News.


Transcript 50:
Moderator Candy Crowley. Good evening from Hofstra University in Hempstead, New York. I'm Candy Crowley from CNN's "State of the Union."


Transcript 51:
Moderator Jim Lehrer. Good evening from the Magness Arena at the University of Denver, in Denver, Colorado. I'm Jim Lehrer of the PBS NewsHour, and I welcome you to the first of the 2012 Presidential Debates between President Barack Obama, the Democratic nominee, and former Massachusetts Governor Mitt Romney, the Republican nominee.


Transcript 52:
MARTHA RADDATZ, MODERATOR: Good evening, and welcome to the first and only vice presidential debate of 2012, sponsored by the Commission on Presidential Debates. I'm Martha Raddatz of

Based on the above, the following transcripts start at index 0:

In [10]:
starts_at_zero = [49, 50, 51, 52, 94, 112, 113, 114, 116, 117, 118, 119, 120, 121, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168]

The remaining in the early list start at index 1:

In [11]:
starts_at_one = [73, 74, 75, 76, 102, 109, 110, 115, 139]

Taking a closer look at those that start at 2 or later:

In [12]:
for i, transcript in enumerate(app_transcripts):
    if i not in early:
        print('Transcript {}:'.format(i))
        print(transcript[2]+ '\n')

Transcript 0:
WELKER: A very good evening to both of you. This debate will cover six major topics. At the beginning of each section, each candidate will have two minutes, uninterrupted, to answer my first question. The debate commission will then turn on their microphone only when it is their turn to answer, and the commission will turn it off exactly when the two minutes have expired. After that, both microphones will remain on, but on behalf of the voters, I'm going to ask you to please speak one at a time. The goal is for you to hear each other and for the American people to hear every word of what you both have to say. And so with that, if you're ready, let's start.

Transcript 1:
WALLACE: Good evening from the Health Education Campus of Case Western Reserve University and the Cleveland Clinic. I'm Chris Wallace of Fox News and I welcome you to the first of the 2020 presidential debates between President Donald J. Trump and former Vice President Joe Biden. This debate is sponsored 

The following transcripts start after index 2:

In [13]:
starts_at_three = [20, 21, 32, 33, 43, 93, 103, 104, 106, 107, 122]

In [14]:
for i, transcript in enumerate(app_transcripts):
    if i in starts_at_three:
        print('Transcript {}:'.format(i))
        print(transcript[3]+ '\n')

Transcript 20:
BLITZER: Secretary Clinton and Senator Sanders, you can now move to your lecterns while I explain a few ground rules. As moderator, I'll guide the discussion, asking questions and follow-ups. You'll also get questions from Dana Bash and Errol Louis. You'll each have one minute and 15 seconds to answer questions, 30 seconds for follow- ups. Timing lights will signal when your time is up. Both candidates have agreed to these rules now. Opening statements, you'll each have two minutes.

Transcript 21:
SALINAS [through translator]: This will be the first and only debate the candidates will do, taking into account the millions of [inaudible] voters [inaudible] Univision News, together with the Washington Post.

Transcript 32:
BLITZER: We're live here at the University of Houston for the 10th Republican presidential debate. [applause]

Transcript 33:
DICKERSON: Good evening. I'm John Dickerson. This holiday weekend, as America honors our first president, we're about to hear fr

Great, these are all set now.  I can no slice the transcripts to begin at the right point.  The below start at indexes other than 2, with the remainder starting at 2:

In [15]:
starts_at_zero = [49, 50, 51, 52, 94, 112, 113, 114, 116, 117, 118, 119, 120, 121, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168]
starts_at_one = [73, 74, 75, 76, 102, 109, 110, 115, 139]
starts_at_three = [20, 21, 32, 33, 43, 93, 103, 104, 106, 107, 122]

Creating a new list, with the correct starting points:

In [16]:
new_app_transcripts = []
for i, transcript in enumerate(app_transcripts):
    if i in starts_at_zero:
        new_app_transcripts.append(transcript)
    elif i in starts_at_one:
        new_app_transcripts.append(transcript[1:])
    elif i in starts_at_three:
        new_app_transcripts.append(transcript[3:])
    else:
        new_app_transcripts.append(transcript[2:])

The list new_app_transcripts now houses the right text for each.

In [17]:
len(new_app_transcripts)

169

### Adding Date / Debate Name:

In order to add these, I need to scrape the debate dates from the American Presidency Project Page, along with the debate name.  

In [18]:
from bs4 import BeautifulSoup
import requests
import re
from IPython.core.display import display, HTML 

In [19]:
app_link = 'https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/presidential-campaigns-debates-and-endorsements-0'

Pulling Dates:

In [122]:
response = requests.get(app_link)
page = response.text
app_soup_object = BeautifulSoup(page, 'lxml')

In [123]:
dates = app_soup_object.find_all('td', class_='xl71')
date_list = [date.get_text() for date in dates]

In [124]:
len(date_list)

209

In [125]:
date_list

['October 22, 2020',
 'October 15, 2020',
 'September 29, 2020',
 'October 7, 2020',
 'March 15, 2020',
 'February 25, 2020',
 'February 19, 2020',
 'February 7, 2020',
 'January 14, 2020',
 'December 19, 2020',
 'November 20, 2019',
 'October 15, 2019',
 'September 12, 2019',
 'July 31, 2019',
 'July 30, 2019',
 'June 27, 2019',
 'June 26, 2019',
 'October 19, 2016',
 'October 9, 2016',
 'Presidential Debate at Washington University in St. Louis, Missouri',
 'September 26, 2016',
 'October 4, 2016',
 'Vice Presidential Debate at Longwood University in Farmville, Virginia',
 'April 14, 2016',
 'March 9, 2016',
 'March 6, 2016',
 'February 11, 2016',
 'February 4, 2016',
 'January 25, 2016',
 'January 17, 2016',
 'December 19, 2015',
 'November 14, 2015',
 'October 13, 2015',
 'March 10, 2016',
 'March 3, 2016',
 'February 25, 2016',
 'February 13, 2016',
 'February 6, 2016',
 'January 28, 2016',
 'January 14, 2016',
 'December 15, 2015',
 'November 10, 2015',
 'October 28, 2015',
 'Sep

Need to remove the non-dates.

In [129]:
for date in date_list:
    if "\xa0" in date or "Presidential" in date or "Republican" in date or "Democratic" in date:
        date_list.remove(date) #Removes the ones with text
    elif len(date) < 5:
        date_list.remove(date) #Removes just years

In [127]:
#Adding Biden/Trump townhalls from 2020
date_list.append('October 15, 2020')
date_list.append('October 15, 2020')

In [130]:
len(date_list)

170

Length is still one longer than the transcript list - which is it?

In [131]:
date_list

['October 22, 2020',
 'October 15, 2020',
 'September 29, 2020',
 'October 7, 2020',
 'March 15, 2020',
 'February 25, 2020',
 'February 19, 2020',
 'February 7, 2020',
 'January 14, 2020',
 'December 19, 2020',
 'November 20, 2019',
 'October 15, 2019',
 'September 12, 2019',
 'July 31, 2019',
 'July 30, 2019',
 'June 27, 2019',
 'June 26, 2019',
 'October 19, 2016',
 'October 9, 2016',
 'September 26, 2016',
 'October 4, 2016',
 'April 14, 2016',
 'March 9, 2016',
 'March 6, 2016',
 'February 11, 2016',
 'February 4, 2016',
 'January 25, 2016',
 'January 17, 2016',
 'December 19, 2015',
 'November 14, 2015',
 'October 13, 2015',
 'March 10, 2016',
 'March 3, 2016',
 'February 25, 2016',
 'February 13, 2016',
 'February 6, 2016',
 'January 28, 2016',
 'January 14, 2016',
 'December 15, 2015',
 'November 10, 2015',
 'October 28, 2015',
 'September 16, 2015',
 'August 6, 2015',
 'January 28, 2016',
 'January 14, 2016',
 'December 15, 2015',
 'November 10, 2015',
 'October 28, 2015',
 'S

It is October 15, 2020 - the first one (index 1).  THis was the cancelled debate between Trump and Biden.  Solo town-hall transcripts are the last two indices in the list.

In [132]:
date_list.pop(1)

'October 15, 2020'

In [133]:
len(date_list)

169

Dates are all set.

Adding the Debate Names (should be 169)

In [30]:
names = app_soup_object.find('div', class_='col-sm-12').find_all('a')
name_list = [name.get_text() for name in names]

In [31]:
name_list

['Presidential Debate at Belmont University in Nashville, Tennessee',
 'Presidential Debate in Cleveland, Ohio',
 'Vice Presidential Debate the University of Utah in Salt Lake City',
 'Democratic Candidates Debate in Washington, DC',
 'Democratic Candidates Debate in Charleston, South Carolina',
 'Democratic Candidates Debate in Las Vegas, Nevada',
 'Democratic Candidates Debate in Manchester, New Hampshire',
 'Democratic Candidates Debate in Des Moines, Iowa',
 'Democratic Candidates Debate in Los Angeles, California',
 'Democratic Candidates Debate in Atlanta, Georgia',
 'Democratic Candidates Debate in Westerville, Ohio',
 'Democratic Candidates Debate in Houston, Texas',
 'Democratic Candidates Debate in Detroit, Michigan: Group 2',
 'Democratic Candidates Debate in Detroit, Michigan: Group 1',
 'Democratic Candidates Debate in Miami, Florida: Group 2',
 'Democratic Candidates Debate in Miami, Florida: Group 1',
 'Presidential Debate at the University of Nevada in Las Vegas',
 'Pre

In [32]:
len(name_list)

172

There are three missing - looking at the list, they are the last three indices (links from teh bottom of the page).

In [33]:
name_list.pop()
name_list.pop()
name_list.pop()

'‹ 2016 Presidential Election Documents'

In [34]:
len(name_list)

169

Ok, now i have all the names and dates, along with the debates.

### Debate Transcript Data Frame:

Using the names, dates, and the text, I am going to build a basic dataframe using these lists:

In [35]:
first_df = pd.DataFrame(list(zip(date_list, name_list,new_app_transcripts)), columns=['Date', 'Debate_Name', 'Transcript'])

A few other things that could be added: debate type (general-president, general-vice president, republican primary, democrat primary).  These should be helpful in classifying data points later.

In [36]:
first_df['Debate_Type'] = 0
for i, name in enumerate(first_df.Debate_Name):
    if 'Republican Candidates' in name:
        first_df.iloc[i, 3] = 'Primary-Republican'
    elif 'Democratic Candidates' in name:
        first_df.iloc[i, 3] = 'Primary-Democrat'
    elif 'Vice Presidential' in name:
        first_df.iloc[i, 3] = 'General-VP'
    else:
        first_df.iloc[i, 3] = 'General-President'

In [37]:
first_df.Debate_Type.value_counts()

Primary-Republican    67
Primary-Democrat      53
General-President     38
General-VP            11
Name: Debate_Type, dtype: int64

From there, df.explode can be used to separate the individual debate "responses" into individual rows.

In [38]:
transcript_df = first_df.explode('Transcript')

In [39]:
transcript_df.reset_index(drop=True, inplace=True)
transcript_df.head()

Unnamed: 0,Date,Debate_Name,Transcript,Debate_Type
0,"October 22, 2020",Presidential Debate at Belmont University in N...,WELKER: A very good evening to both of you. Th...,General-President
1,"October 22, 2020",Presidential Debate at Belmont University in N...,And we will begin with the fight against the c...,General-President
2,"October 22, 2020",Presidential Debate at Belmont University in N...,"TRUMP: So, as you know, more 2.2 million peopl...",General-President
3,"October 22, 2020",Presidential Debate at Belmont University in N...,"WELKER: OK, former Vice President Biden, to yo...",General-President
4,"October 22, 2020",Presidential Debate at Belmont University in N...,"BIDEN: 220,000 Americans dead. If you hear not...",General-President


In [40]:
transcript_df.shape

(79347, 4)

From here, I'll want to add the speaker of each row in the transcript column.  This will be helpful for knowing who actually said each line.  Once that is in place, I can add speaker type (candidate or moderator), as well as party tags (Democrat/Republican).

Test on a small subset:

In [41]:
my_list = list(transcript_df.Transcript[0:10])

In [42]:
for i, line in enumerate(my_list):
    #If the speaker is listed, it will be in the NAME: format.  This checks for the :
    if ":" not in line[0:35]:
        #If no speaker change, combines into one answer, tagged with the speaker at the start.
        my_list[i-1] = my_list[i-1] + ' ' + line
        my_list.remove(line)
my_list

["WELKER: A very good evening to both of you. This debate will cover six major topics. At the beginning of each section, each candidate will have two minutes, uninterrupted, to answer my first question. The debate commission will then turn on their microphone only when it is their turn to answer, and the commission will turn it off exactly when the two minutes have expired. After that, both microphones will remain on, but on behalf of the voters, I'm going to ask you to please speak one at a time. The goal is for you to hear each other and for the American people to hear every word of what you both have to say. And so with that, if you're ready, let's start. And we will begin with the fight against the coronavirus. President Trump, the first question is for you. The country is heading into a dangerous new phase. More than 40,000 Americans are in the hospital tonight with COVID, including record numbers here in Tennessee. And since the two of you last shared a stage, 16,000 Americans ha

In [43]:
len(my_list)

9

So that worked - will it work on the whole data frame?

No, there's too much inconsistency in the text structure across debates (example [here](https://www.presidency.ucsb.edu/documents/presidential-debate-tempe-arizona)).  Will need to determine how to best do this later.

In [44]:
transcript_df['Data_Source'] = 'American Presidency Project'

Next step: doing similar work for the Commission for Presidential Debates site, since that may be easier to parse through for speaker info.

## Making a Dataframe of the Commission for Presidential Debate List
This data does seem to be more consistent in terms of speaker

In [45]:
with open('cpd_transcript_list.pickle','rb') as read_file:
    cpd_transcripts = pickle.load(read_file)

In [46]:
len(cpd_transcripts)

48

In [47]:
#23rd element is just a link to translations, not an actual transcript:
cpd_transcripts.pop(23)

['Debate',
 'Transcript Translation',
 'The First Gore-Bush Presidential Debate',
 'University of Massachusetts Boston, Massachusetts October 3, 2000',
 'French German Italian Japanese Portuguese Spanish',
 'The Second Gore-Bush Presidential Debate',
 'Wake Forest University Winston-Salem, North Carolina October 11, 2000',
 'French German Italian Japanese Portuguese Spanish',
 'The Third Gore-Bush Presidential Debate',
 'Washington University St. Louis, Missouri October 17, 2000',
 'French German Italian Japanese Portuguese Spanish',
 'The Lieberman-Cheney Vice Presidential Debate',
 'Centre College Danville, Kentucky October 5, 2000',
 'French German Italian Japanese Portuguese Spanish',
 'Transcripts provided by Speche Communications, Inc.',
 '']

In [48]:
len(cpd_transcripts)

47

In [49]:
counter = 0
for transcript in cpd_transcripts:
    counter += len(transcript)
print(counter)

15561


So there are 48 total debates, with 15,577 total paragraphs in the transcript.

### Adding Date / Debate Name:

In order to add these, I need to scrape the debate dates from the American Presidency Project Page, along with the debate name.  

In [50]:
cpd_link = 'https://www.debates.org/voter-education/debate-transcripts/'

Pulling Dates:

In [51]:
response = requests.get(cpd_link)
page = response.text
cpd_soup_object = BeautifulSoup(page, 'lxml')

In [52]:
info = cpd_soup_object.find('div', id='content-sm').find_all('a')
info_list = [stuff.get_text() for stuff in info]

In [53]:
len(info_list)

48

The length matches, so it is set up well.  Some debates are broken in half, so adding the debate date/title info:

In [54]:
info_list[30] = 'October 15, 1992: The Second Clinton-Bush-Perot Presidential Debate ' + info_list[30]

In [55]:
info_list[31] = 'October 15, 1992: The Second Clinton-Bush-Perot Presidential Debate ' + info_list[31]

In [56]:
info_list[27] = 'October 11, 1992: The First Clinton-Bush-Perot Presidential Debate ' + info_list[27]
info_list[28] = 'October 11, 1992: The First Clinton-Bush-Perot Presidential Debate ' + info_list[28]

Popping off 23rd item, since that is just a link to translations of the 2000 debates

In [57]:
info_list.pop(23)

'The 2000 Debate Transcripts: Transcripts of the debates translated into six languages'

In [58]:
len(info_list)

47

Now, pulling debate dates and names into separate list:

In [59]:
date_list = [info.split(':')[0] for info in info_list]
name_list = [info.split(':')[1] for info in info_list]
name_list[0] = 'First Trump-Biden Presidential Debate'
name_list[1] = 'Pence-Harris Vice Presidential Debate'
name_list[2] = 'Second Trump-Biden Presidential Debate'

Now, slicing the transcripts so they all start right:

In [60]:
starts_at_2 = [19, 20, 21, 22, 23, 24, 25, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46]

In [61]:
starts_at_3 = [27, 29, 30]

In [62]:
starts_at_4 = [0, 1, 2, 8, 26, 28, 31]

In [63]:
starts_at_5 = [7,9,10]
starts_at_6 = [17,18]
starts_at_7 = [3, 4, 6, 11, 12, 13, 14, 15, 16]

In [64]:
for i, transcript in enumerate(cpd_transcripts):
    if i not in starts_at_7 and i not in starts_at_6 and i not in starts_at_2 and i not in starts_at_3 and i not in starts_at_4 and i not in starts_at_5:
        print('Transcript {}:'.format(i))
        print(transcript[8]+ '\n')

Transcript 5:
RADDATZ: Good evening. I’m Martha Raddatz from ABC News.



In [65]:
new_cpd_transcripts = []
for i, transcript in enumerate(cpd_transcripts):
    if i in starts_at_2:
        new_cpd_transcripts.append(transcript[2:])
    elif i in starts_at_3:
        new_cpd_transcripts.append(transcript[3:])
    elif i in starts_at_4:
        new_cpd_transcripts.append(transcript[4:])
    elif i in starts_at_5:
        new_cpd_transcripts.append(transcript[5:])
    elif i in starts_at_6:
        new_cpd_transcripts.append(transcript[6:])
    elif i in starts_at_7:
        new_cpd_transcripts.append(transcript[7:])
    else:
        new_cpd_transcripts.append(transcript[8:])

Making a dataframe of the new data:

In [66]:
cpd_df = pd.DataFrame(list(zip(date_list, name_list,new_cpd_transcripts)), columns=['Date', 'Debate_Name', 'Transcript'])

In [67]:
cpd_df['Debate_Type'] = 0
for i, name in enumerate(cpd_df.Debate_Name):
    if 'Vice Presidential' in name:
        cpd_df.iloc[i, 3] = 'General-VP'
    else:
        cpd_df.iloc[i, 3] = 'General-President'

In [68]:
cpd_df.shape

(47, 4)

From there, df.explode can be used to separate the individual debate "responses" into individual rows.

In [69]:
cpd_transcript_df = cpd_df.explode('Transcript')

In [70]:
cpd_transcript_df.reset_index(drop=True, inplace=True)
cpd_transcript_df.shape

(15382, 4)

In [71]:
cpd_transcript_df['Data_Source'] = 'Commission For Presidential Debates'

## Concatenating the Dataframes

So I have all the data to work with, I'll be concatenating the CPD and APP dataframes:

In [72]:
transcript_df.shape

(79347, 5)

In [73]:
cpd_transcript_df.shape

(15382, 5)

In [74]:
combined_transcript_df = pd.concat([transcript_df, cpd_transcript_df])

In [75]:
combined_transcript_df.head()

Unnamed: 0,Date,Debate_Name,Transcript,Debate_Type,Data_Source
0,"October 22, 2020",Presidential Debate at Belmont University in N...,WELKER: A very good evening to both of you. Th...,General-President,American Presidency Project
1,"October 22, 2020",Presidential Debate at Belmont University in N...,And we will begin with the fight against the c...,General-President,American Presidency Project
2,"October 22, 2020",Presidential Debate at Belmont University in N...,"TRUMP: So, as you know, more 2.2 million peopl...",General-President,American Presidency Project
3,"October 22, 2020",Presidential Debate at Belmont University in N...,"WELKER: OK, former Vice President Biden, to yo...",General-President,American Presidency Project
4,"October 22, 2020",Presidential Debate at Belmont University in N...,"BIDEN: 220,000 Americans dead. If you hear not...",General-President,American Presidency Project


In [76]:
combined_transcript_df.shape

(94729, 5)

In [77]:
combined_transcript_df.Data_Source.value_counts()

American Presidency Project            79347
Commission For Presidential Debates    15382
Name: Data_Source, dtype: int64

That matches up, so I will be pickling these dataframes to work with further.

### Pickling Combined Dataframe:

In [78]:
#with open('combined_transcript_df.pickle', 'wb') as to_write:
#    pickle.dump(combined_transcript_df, to_write)

### Next steps: secondary_text_clean_eda.ipynb
- Add speaker, speaker type, party tags to the rows
- Perform initial EDA of the data

# Adding in Speaker:

Using speaker lists from model_tagging.ipynb, I will be adding in speakers to this dataframe.

In [79]:
with open('new_app_speakers.pickle','rb') as read_file:
    app_speakers = pickle.load(read_file)

In [80]:
with open('new_cpd_speakers.pickle','rb') as read_file:
    cpd_speakers = pickle.load(read_file)

### First, CPD:

In [135]:
speaker_cpd_df = pd.DataFrame(list(zip(date_list, name_list,new_cpd_transcripts)), columns=['Date', 'Debate_Name', 'Transcript'])

In [136]:
speaker_cpd_df.shape

(47, 3)

In [137]:
speaker_cpd_df['Debate_Type'] = 0
for i, name in enumerate(speaker_cpd_df.Debate_Name):
    if 'Vice Presidential' in name:
        speaker_cpd_df.iloc[i, 3] = 'General-VP'
    else:
        speaker_cpd_df.iloc[i, 3] = 'General-President'

In [138]:
speaker_cpd_df['Data_Source'] = 'Commission For Presidential Debates'

In [139]:
speaker_cpd_df.tail()

Unnamed: 0,Date,Debate_Name,Transcript,Debate_Type,Data_Source
42,"January 28, 2016",The Third Carter-Ford Presidential Debate,"[MS. WALTERS: Good evening, I’m Barbara Walter...",General-President,Commission For Presidential Debates
43,"January 14, 2016",The First Kennedy-Nixon Presidential Debate,"[HOWARD K. SMITH, MODERATOR: Good evening. The...",General-President,Commission For Presidential Debates
44,"December 15, 2015",The Second Kennedy-Nixon Presidential Debate,"[FRANK McGEE, MODERATOR: Good evening. This is...",General-President,Commission For Presidential Debates
45,"November 10, 2015",The Third Kennedy-Nixon Presidential Debate,"[BILL SHADEL, MODERATOR: Good evening. I’m Bil...",General-President,Commission For Presidential Debates
46,"October 28, 2015",The Fourth Kennedy-Nixon Presidential Debate,"[QUINCY HOWE, MODERATOR: I am Quincy Howe of C...",General-President,Commission For Presidential Debates


In [86]:
speaker_df = pd.Series(cpd_speakers)

Before Exploding out the rows, I want to only take in the ones I need:

- 1992 Presidential Debates
- 1984 Presidential Debates
- 1980 All Debates
- 1976 All Debates

The rest will come from the American Presidency Project Data. 

Will need to only save indices: 26, 27, 35, 37, 38, 39, 40,41,42

In [87]:
save_indexes = [26, 27, 35, 37, 38, 39, 40,41,42]

In [88]:
right_cpd_df = speaker_cpd_df.iloc[save_indexes, :]

In [89]:
right_speaker_df = speaker_df.iloc[save_indexes]

Pickling to work with in secondary_text_clean_speakers:

In [120]:
with open('cpd_transcript_list_filtered.pickle', 'wb') as to_write:
    pickle.dump(list(right_cpd_df.Transcript), to_write)

In [121]:
with open('cpd_speaker_list_filtered.pickle', 'wb') as to_write:
    pickle.dump(list(right_speaker_df), to_write)

In [90]:
right_cpd_df

Unnamed: 0,Date,Debate_Name,Transcript,Debate_Type,Data_Source
26,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"[LEHRER: Good evening, and welcome to the firs...",General-President,Commission For Presidential Debates
27,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"[LEHRER: All right, moving on now to divisions...",General-President,Commission For Presidential Debates
35,"October 7, 1984",The First Reagan-Mondale Presidential Debate,[MS. RIDINGS: Good evening from the Kentucky C...,General-President,Commission For Presidential Debates
37,"October 21, 1984",The Second Reagan-Mondale Presidential Debate,[MS. RIDINGS: Good evening from the Municipal ...,General-President,Commission For Presidential Debates
38,"September 21, 1980",The Anderson-Reagan Presidential Debate,"[RUTH J. HINERFELD, CHAIR, LEAGUE OF WOMEN VOT...",General-President,Commission For Presidential Debates
39,"October 28, 1980",The Carter-Reagan Presidential Debate,"[RUTH HINERFELD, LEAGUE OF WOMEN VOTERS, EDUCA...",General-President,Commission For Presidential Debates
40,"September 23, 1976",The First Carter-Ford Presidential Debate,"[EDWIN NEWMAN, MODERATOR: Good evening. I’m Ed...",General-President,Commission For Presidential Debates
41,"October 6, 1976",The Second Carter-Ford Presidential Debate,[MS. FREDERICK: Good evening. I’m Pauline Fred...,General-President,Commission For Presidential Debates
42,"October 22, 1976",The Third Carter-Ford Presidential Debate,"[MS. WALTERS: Good evening, I’m Barbara Walter...",General-President,Commission For Presidential Debates


In [91]:
right_cpd_df_full = right_cpd_df.explode('Transcript')

In [92]:
right_cpd_df_full.reset_index(drop=True, inplace=True)
right_cpd_df_full.shape

(1305, 5)

In [None]:
with open('app_speaker_list_filtered.pickle', 'wb') as to_write:
    pickle.dump(list(app_speaker_df), to_write)

In [93]:
right_speaker_df_exp = right_speaker_df.explode()

In [94]:
right_speaker_df_exp.reset_index(drop=True, inplace=True)
right_speaker_df_exp.shape

(1305,)

In [95]:
right_cpd_df_full['Speaker'] = right_speaker_df_exp

In [96]:
right_cpd_df_full.sample(10)

Unnamed: 0,Date,Debate_Name,Transcript,Debate_Type,Data_Source,Speaker
1285,"October 22, 1976",The Third Carter-Ford Presidential Debate,"MR. CARTER: Okay, I thought I answered it by s...",General-President,Commission For Presidential Debates,MR. CARTER
536,"October 7, 1984",The First Reagan-Mondale Presidential Debate,THE PRESIDENT: I don’t think so. I’m all confu...,General-President,Commission For Presidential Debates,THE PRESIDENT
226,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: President Bush, one minute.",General-President,Commission For Presidential Debates,LEHRER
723,"October 21, 1984",The Second Reagan-Mondale Presidential Debate,Armageddon,General-President,Commission For Presidential Debates,no speaker!
997,"October 28, 1980",The Carter-Reagan Presidential Debate,"MR. SMITH: Governor Reagan, you have the last ...",General-President,Commission For Presidential Debates,MR. SMITH
863,"September 21, 1980",The Anderson-Reagan Presidential Debate,REAGAN: I’ll catch up with it later.,General-President,Commission For Presidential Debates,REAGAN
976,"October 28, 1980",The Carter-Reagan Presidential Debate,"MR. STONE: Yes. President Carter, both of you ...",General-President,Commission For Presidential Debates,MR. STONE
1001,"October 28, 1980",The Carter-Reagan Presidential Debate,MR. REAGAN: The Social Security system was bas...,General-President,Commission For Presidential Debates,MR. REAGAN
345,"October 7, 1984",The First Reagan-Mondale Presidential Debate,"MR. BARNES: Mr. Mondale, would you describe yo...",General-President,Commission For Presidential Debates,MR. BARNES
1185,"October 6, 1976",The Second Carter-Ford Presidential Debate,"MR. FRANKEL: Mr. President, just clarify one p...",General-President,Commission For Presidential Debates,MR. FRANKEL


## American Presidency Project

In [97]:
first_df.head()

Unnamed: 0,Date,Debate_Name,Transcript,Debate_Type
0,"October 22, 2020",Presidential Debate at Belmont University in N...,[WELKER: A very good evening to both of you. T...,General-President
1,"September 29, 2020","Presidential Debate in Cleveland, Ohio",[WALLACE: Good evening from the Health Educati...,General-President
2,"October 7, 2020",Vice Presidential Debate the University of Uta...,[PAGE: Good evening. From the University of Ut...,General-VP
3,"March 15, 2020","Democratic Candidates Debate in Washington, DC","[TAPPER: Good evening from Washington, D.C. An...",Primary-Democrat
4,"February 25, 2020","Democratic Candidates Debate in Charleston, So...","[O'DONNELL: Tonight, the battle for the 2020 D...",Primary-Democrat


Similar to above, I need to drop indices for the following debates from the following, since those debates' data will come from teh American Presidency Project:

1992 Presidential Debates
1984 Presidential Debates
1980 All Debates
1976 All Debates

These are indices: 147, 148, 149, 154, 155, 157, 158, 159, 160, 161, 162

In [98]:
right_app_df = first_df.drop(index=[147, 148, 149, 154, 155, 157, 158, 159, 160, 161, 162])

Pickling the list to work with in secondary_text_clean.ipynb:

In [119]:
with open('app_transcript_list_filtered.pickle', 'wb') as to_write:
    pickle.dump(list(right_app_df.Transcript), to_write)

In [99]:
right_app_df_full = right_app_df.explode('Transcript')

In [100]:
right_app_df_full.reset_index(drop=True, inplace=True)
right_app_df_full.shape

(76676, 4)

Making speakers df for APP:

In [101]:
app_speaker_df = pd.Series(app_speakers).drop(index=[147, 148, 149, 154, 155, 157, 158, 159, 160, 161, 162])

Pickling the speaker list to work with in secondary_text_clean_speakers.ipynb:

In [118]:
with open('app_speaker_list_filtered.pickle', 'wb') as to_write:
    pickle.dump(list(app_speaker_df), to_write)

In [102]:
right_speaker_explode = app_speaker_df.explode()
right_speaker_explode.reset_index(drop=True, inplace=True)

In [103]:
right_speaker_explode.shape

(76676,)

Adding speaker on as a column:

In [104]:
right_app_df_full['Speaker'] = right_speaker_explode

In [105]:
right_app_df_full['Data_Source'] = 'American Presidency Project'

In [106]:
right_app_df_full.sample(7)

Unnamed: 0,Date,Debate_Name,Transcript,Debate_Type,Speaker,Data_Source
36893,"October 18, 2011","Republican Candidates Debate in Las Vegas, Nevada",COOPER: We have another question. This one is ...,Primary-Republican,COOPER,American Presidency Project
1906,"March 15, 2020","Democratic Candidates Debate in Washington, DC","In addition to that, we also have to -- I woul...",Primary-Democrat,no speaker!,American Presidency Project
61093,"June 5, 2007","Republican Candidates Debate in Manchester, Ne...",GOV. HUCKABEE: I think the people of America a...,Primary-Republican,GOV. HUCKABEE,American Presidency Project
68003,"December 17, 1999","Democratic Candidates Town Hall in Nashua, New...",It means we'll be able to have better disease ...,Primary-Democrat,no speaker!,American Presidency Project
30762,"October 11, 2012",Vice Presidential Debate at Centre College in ...,RYAN: A nuclear-armed Iran which triggers a nu...,General-VP,RYAN,American Presidency Project
75802,1960,"Presidential Debate in Washington, DC","MR. NIVEN: Mr. Vice President, you said that w...",General-President,MR. NIVEN,American Presidency Project
37740,"September 22, 2011","Republican Candidates Debate in Orlando, Florida",Former Speaker of the House Newt Gingrich. [ap...,Primary-Republican,no speaker!,American Presidency Project


## Concatenating and Saving the Dataframes:

In [107]:
speaker_combined_transcript_df = pd.concat([right_app_df_full, right_cpd_df_full])

In [108]:
speaker_combined_transcript_df.head()

Unnamed: 0,Date,Debate_Name,Transcript,Debate_Type,Speaker,Data_Source
0,"October 22, 2020",Presidential Debate at Belmont University in N...,WELKER: A very good evening to both of you. Th...,General-President,WELKER,American Presidency Project
1,"October 22, 2020",Presidential Debate at Belmont University in N...,And we will begin with the fight against the c...,General-President,no speaker!,American Presidency Project
2,"October 22, 2020",Presidential Debate at Belmont University in N...,"TRUMP: So, as you know, more 2.2 million peopl...",General-President,TRUMP,American Presidency Project
3,"October 22, 2020",Presidential Debate at Belmont University in N...,"WELKER: OK, former Vice President Biden, to yo...",General-President,WELKER,American Presidency Project
4,"October 22, 2020",Presidential Debate at Belmont University in N...,"BIDEN: 220,000 Americans dead. If you hear not...",General-President,BIDEN,American Presidency Project


In [109]:
speaker_combined_transcript_df.shape

(77981, 6)

In [110]:
speaker_combined_transcript_df.Data_Source.value_counts()

American Presidency Project            76676
Commission For Presidential Debates     1305
Name: Data_Source, dtype: int64

Pickling this new dataframe:

In [111]:
pwd

'/Users/patrickbovard/Documents/GitHub/presidential_debate_analysis/Data'

In [None]:
with open('speaker_combined_transcript_df.pickle', 'wb') as to_write:
    pickle.dump(speaker_combined_transcript_df, to_write)

## NEXT: secondary_text_clean_speakers.ipynb