# Secondary Text Clean - w/ Speaker Data
The purpose of this notebook is to continue to clean/build out the dataframe with speaker information tagged to each row.  This was generated in text_cleaning.ipynb, saved as speaker_combined_transcript_df.pickle.

Installing packages:

In [1]:
import nltk
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10

Pickling in the lists to work with:

### APP:

In [3]:
#SPeaker list of lists:
with open('Data/app_speaker_list_filtered.pickle','rb') as read_file:
    new_app_speakers = pickle.load(read_file)

In [4]:
#Transcript list of lists:
with open('Data/app_transcript_list_filtered.pickle','rb') as read_file:
    app_transcripts = pickle.load(read_file)

Checking the lengths to make sure they match:

In [5]:
len(new_app_speakers), len(app_transcripts)

(158, 158)

Slicing the transcript formatting:

In [6]:
len(app_transcripts[0]), len(new_app_speakers[0])

(354, 354)

In [7]:
for i, speaker in enumerate(new_app_speakers):
    if len(speaker) != len(app_transcripts[i]):
        print('Index {} doesnt match!'.format(i))

Nothing printed, so they all match.  

Next goal is to condense the two lists into each other - in other words, if there is "no speaker!" listed, that means its a new paragraph in a current speaking line, so the previous speaker should be carried down.

The one exception will be for the first row of each debate - if that is "no speaker", I'll have to take a closer look.

Now, to test out which have a no speaker! in the first line:

In [8]:
for i, speaker in enumerate(new_app_speakers):
    if speaker[0] == 'no speaker!':
        print('Index {} starts with no speaker!'.format(i))

Index 91 starts with no speaker!
Index 139 starts with no speaker!
Index 154 starts with no speaker!


**Index 91:**

In [9]:
new_app_speakers[91][0:3]

['no speaker!', 'no speaker!', 'SEN. DODD']

In [10]:
app_transcripts[91][0:3]

['These formalities out of the way, the lucky recipient of our first question has been determined by lottery. Senator Dodd, that would be you.',
 'Obviously, in the light of what happened in Minnesota last week, maintaining infrastructure requires spending, and how tax dollars are spent is a matter of priorities. What should we not build, what should we not be funding to see to it that our highways and our bridges and our tunnels and our mines are all properly maintained?',
 "SEN. DODD: Well, thank you, first of all. And thank you for the warm welcome this evening. I'm a union guy -- (cheers) -- proudly a union man, and thank you for inviting us to be here tonight."]

This should be 'Moderator'

In [11]:
new_app_speakers[91][0] = 'Moderator'

**Index 139:**

In [12]:
new_app_speakers[139][0:3]

['no speaker!', 'JUDY WOODRUFF, CNN ANCHOR', 'no speaker!']

In [13]:
app_transcripts[139][0:3]

['From the Orpheum Theater in downtown Phoenix, here now is the moderator: Judy Woodruff.',
 'JUDY WOODRUFF, CNN ANCHOR: Thank you for joining us. Tonight we hope to give voters here in Arizona and all across the nation their best opportunity yet to compare the six Republican candidates for president and their specific views on the issues.',
 'We are in the spectacular 70-year-old Orpheum Theater. Built for vaudeville performances and movies, it is now a performing arts center. More than 1,300 people are in the audience here; they were invited by the Arizona Republican Party.']

This should be 'Moderator'.

In [14]:
new_app_speakers[139][0] = 'Moderator'

**Index 154:**

In [15]:
new_app_speakers[154][0:3]

['no speaker!', 'MR. NIVEN', 'MR. NIXON']

In [16]:
app_transcripts[154][0:3]

["FRANK McGEE, MODERATOR: Good evening. This is Frank McGee, NBC News in Washington. This is the second in a series of programs unmatched in history. Never have so many people seen the major candidates for president of the United States at the same time; and never until this series have Americans seen the candidates in face-to-face exchange. Tonight the candidates have agreed to devote the full hour to answering questions on any issue of the campaign. And here tonight are: the Republican candidate, Vice President Richard M. Nixon; and the Democratic candidate, Senator John F. Kennedy. Now representatives of the candidates and of all the radio and television networks have agreed on these rules: neither candidate will make an opening statement or a closing summation; each will be questioned in turn; each will have an opportunity to comment upon the answer of the other; each reporter will ask only one question in turn. He is free to ask any question he chooses. Neither candidate knows wha

Should be 'FRANK McGEE, MODERATOR'

In [17]:
new_app_speakers[154][0] = 'FRANK McGEE, MODERATOR'

Removing some certain transcript rows, such as those w/ (laughter), etc. as the speaker:

In [18]:
for i,transcript in enumerate(app_transcripts):
    for j, line in enumerate(transcript):
        text = line.strip('().,').lower()
        if text == 'laughter' or text == 'applause':
            app_transcripts[i].pop(j)
            new_app_speakers[i].pop(j)

With that corrected, running the below loop, which will assign the previous row's speaker for each speaker that has "no speaker!" listed:

In [19]:
for i, speaker_list in enumerate(new_app_speakers):
    for j, speaker in enumerate(speaker_list):
        if speaker == 'no speaker!':
            new_app_speakers[i][j] = new_app_speakers[i][j-1]

In [20]:
for i, speaker in enumerate(new_app_speakers):
    if len(speaker) != len(app_transcripts[i]):
        print('Index {} doesnt match!'.format(i))

Again, everything matches, so it seems to have worked.

### Commission for Presidential Debates:

In [21]:
#Transcript list of lists:
with open('Data/cpd_transcript_list_filtered.pickle','rb') as read_file:
    cpd_transcripts = pickle.load(read_file)

In [22]:
#Transcript list of lists:
with open('Data/cpd_speaker_list_filtered.pickle','rb') as read_file:
    cpd_speakers = pickle.load(read_file)

In [23]:
len(cpd_transcripts), len(cpd_speakers)

(9, 9)

In [24]:
for i, speaker in enumerate(cpd_speakers):
    if len(speaker) != len(cpd_transcripts[i]):
        print('Index {} doesnt match!'.format(i))

Lengths all match!  Testing no speakers:

In [25]:
for i, speaker in enumerate(cpd_speakers):
    if speaker[0] == 'no speaker!':
        print('Index {} starts with no speaker!'.format(i))

Removing laughter lines:

In [26]:
for i,transcript in enumerate(cpd_transcripts):
    for j, line in enumerate(transcript):
        text = line.strip('().,').lower()
        if text == 'laughter' or text == 'applause':
            cpd_transcripts[i].pop(j)
            cpd_speakers[i].pop(j)

None, so all set.

In [27]:
for i, speaker_list in enumerate(cpd_speakers):
    for j, speaker in enumerate(speaker_list):
        if speaker == 'no speaker!':
            cpd_speakers[i][j] = cpd_speakers[i][j-1]

In [28]:
for i, speaker in enumerate(cpd_speakers):
    if len(speaker) != len(cpd_transcripts[i]):
        print('Index {} doesnt match!'.format(i))

Great, all indices match.

Checking if there are any no speakers! left:

In [29]:
for i, speaker_list in enumerate(cpd_speakers):
    for j,speaker in enumerate(speaker_list):
        if speaker == 'no speaker!':
            print('no speaker is at cpd_speakers[{}][{}]'.format(i, j))
        

In [30]:
for i, speaker_list in enumerate(new_app_speakers):
    for j,speaker in enumerate(speaker_list):
        if speaker == 'no speaker!':
            print('no speaker is at new_app_speakers[{}][{}]'.format(i, j))

Great, no "no speaker!"'s are left.

Next step, to make a data frame with the new data set up:

## Building a new dataframe:

### APP:

Pulling in dates:

In [31]:
from bs4 import BeautifulSoup
import requests
import re
from IPython.core.display import display, HTML 

In [32]:
app_link = 'https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/presidential-campaigns-debates-and-endorsements-0'

In [33]:
response = requests.get(app_link)
page = response.text
app_soup_object = BeautifulSoup(page, 'lxml')

In [34]:
dates = app_soup_object.find_all('td', class_='xl71')
app_date_list = [date.get_text() for date in dates]

In [35]:
for date in app_date_list:
    if "\xa0" in date or "Presidential" in date or "Republican" in date or "Democratic" in date:
        app_date_list.remove(date) #Removes the ones with text
    elif len(date) < 5:
        app_date_list.remove(date) #Removes just years

In [36]:
app_date_list.append('October 15, 2020')
app_date_list.append('October 15, 2020')

In [37]:
len(app_date_list)

174

In [38]:
app_date_list.pop(1)

'October 15, 2020'

In [39]:
for date in app_date_list:
    if "\xa0" in date or "Presidential" in date or "Republican" in date or "Democratic" in date:
        app_date_list.remove(date) #Removes the ones with text
    elif len(date) < 5:
        app_date_list.remove(date) #Removes just years

In [40]:
len(app_date_list)

169

Pulling in debate names:

In [41]:
names = app_soup_object.find('div', class_='col-sm-12').find_all('a')
app_name_list = [name.get_text() for name in names]

In [42]:
app_name_list.pop()
app_name_list.pop()
app_name_list.pop()

'‹ 2016 Presidential Election Documents'

In [43]:
len(app_name_list)

169

Making an initial DF:

In [44]:
first_app_df = pd.DataFrame(list(zip(app_date_list, app_name_list)), columns=['Date', 'Debate_Name'])

Dropping out the unneeded rows:

In [45]:
first_app_df.drop(index=[147, 148, 149, 154, 155, 157, 158, 159, 160, 161, 162], inplace=True)

Adding in speaker, transcript, debate_type, and data_source:

In [46]:
first_app_df['Transcript'] = app_transcripts
first_app_df['Speaker'] = new_app_speakers
first_app_df['Data_Source'] = 'American Presidency Project'
first_app_df['Debate_Type'] = 0
for i, name in enumerate(first_app_df.Debate_Name):
    if 'Republican Candidates' in name:
        first_app_df.iloc[i, 5] = 'Primary-Republican'
    elif 'Democratic Candidates' in name:
        first_app_df.iloc[i, 5] = 'Primary-Democrat'
    elif 'Vice Presidential' in name:
        first_app_df.iloc[i, 5] = 'General-VP'
    else:
        first_app_df.iloc[i, 5] = 'General-President'

In [47]:
first_app_df.head()

Unnamed: 0,Date,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type
0,"October 22, 2020",Presidential Debate at Belmont University in N...,[WELKER: A very good evening to both of you. T...,"[WELKER, WELKER, TRUMP, WELKER, BIDEN, WELKER,...",American Presidency Project,General-President
1,"September 29, 2020","Presidential Debate in Cleveland, Ohio",[WALLACE: Good evening from the Health Educati...,"[WALLACE, WALLACE, BIDEN, TRUMP, BIDEN, WALLAC...",American Presidency Project,General-President
2,"October 7, 2020",Vice Presidential Debate the University of Uta...,[PAGE: Good evening. From the University of Ut...,"[PAGE, PENCE, PAGE, HARRIS, PAGE, PENCE, PAGE,...",American Presidency Project,General-VP
3,"March 15, 2020","Democratic Candidates Debate in Washington, DC","[TAPPER: Good evening from Washington, D.C. An...","[TAPPER, TAPPER, BASH, CALDERON, TAPPER, TAPPE...",American Presidency Project,Primary-Democrat
4,"February 25, 2020","Democratic Candidates Debate in Charleston, So...","[O'DONNELL: Tonight, the battle for the 2020 D...","[O'DONNELL, KING, KING, O'DONNELL, O'DONNELL, ...",American Presidency Project,Primary-Democrat


### Commission for Presidential Debates:

Pulling in dates/names:

In [48]:
cpd_link = 'https://www.debates.org/voter-education/debate-transcripts/'

In [49]:
response = requests.get(cpd_link)
page = response.text
cpd_soup_object = BeautifulSoup(page, 'lxml')
info = cpd_soup_object.find('div', id='content-sm').find_all('a')
info_list = [stuff.get_text() for stuff in info]
len(info_list)

48

Adding in ones that are broken into halves:

In [50]:
info_list[30] = 'October 15, 1992: The Second Clinton-Bush-Perot Presidential Debate ' + info_list[30]

In [51]:
info_list[31] = 'October 15, 1992: The Second Clinton-Bush-Perot Presidential Debate ' + info_list[31]

In [52]:
info_list[27] = 'October 11, 1992: The First Clinton-Bush-Perot Presidential Debate ' + info_list[27]
info_list[28] = 'October 11, 1992: The First Clinton-Bush-Perot Presidential Debate ' + info_list[28]

Popping off 23rd item, since that is just a link to translations of the 2000 debates

In [53]:
info_list.pop(23)

'The 2000 Debate Transcripts: Transcripts of the debates translated into six languages'

Now, pulling debate dates and names into separate list:

In [54]:
cpd_date_list = [info.split(':')[0] for info in info_list]
cpd_name_list = [info.split(':')[1] for info in info_list]
cpd_name_list[0] = 'First Trump-Biden Presidential Debate'
cpd_name_list[1] = 'Pence-Harris Vice Presidential Debate'
cpd_name_list[2] = 'Second Trump-Biden Presidential Debate'

In [55]:
first_cpd_df = pd.DataFrame(list(zip(cpd_date_list, cpd_name_list)), columns=['Date', 'Debate_Name'])

Keeping only the needed indices:

In [56]:
right_first_cpd_df = first_cpd_df.iloc[[26, 27, 35, 37, 38, 39, 40,41,42], :]

In [57]:
right_first_cpd_df.shape

(9, 2)

Adding in other columns:

In [58]:
right_first_cpd_df['Transcript'] = cpd_transcripts
right_first_cpd_df['Speaker'] = cpd_speakers
right_first_cpd_df['Data_Source'] = 'Commission for Presidential Debates'
right_first_cpd_df['Debate_Type'] = 0
for i, name in enumerate(right_first_cpd_df.Debate_Name):
    if 'Republican Candidates' in name:
        right_first_cpd_df.iloc[i, 5] = 'Primary-Republican'
    elif 'Democratic Candidates' in name:
        right_first_cpd_df.iloc[i, 5] = 'Primary-Democrat'
    elif 'Vice Presidential' in name:
        right_first_cpd_df.iloc[i, 5] = 'General-VP'
    else:
        right_first_cpd_df.iloc[i, 5] = 'General-President'

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  right_first_cpd_df['Transcript'] = cpd_transcripts
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  right_first_cpd_df['Speaker'] = cpd_speakers
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy
  right_first_cpd_df['Data_Source'] = 'Commission for Presidential Debates'
A value is trying to be set on a co

## Concatenating the Dataframes:

Combining the two current ones:

In [60]:
final_combined_transcript_df = pd.concat([right_first_cpd_df, first_app_df])

In [61]:
final_combined_transcript_df.reset_index(drop=True, inplace=True)

In [62]:
final_combined_transcript_df.head()

Unnamed: 0,Date,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type
0,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"[LEHRER: Good evening, and welcome to the firs...","[LEHRER, PEROT, LEHRER, CLINTON, LEHRER, PRESI...",Commission for Presidential Debates,General-President
1,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"[LEHRER: All right, moving on now to divisions...","[LEHRER, COMPTON, CLINTON, CLINTON, CLINTON, L...",Commission for Presidential Debates,General-President
2,"October 7, 1984",The First Reagan-Mondale Presidential Debate,[MS. RIDINGS: Good evening from the Kentucky C...,"[MS. RIDINGS, MS. RIDINGS, MS. RIDINGS, MS. RI...",Commission for Presidential Debates,General-President
3,"October 21, 1984",The Second Reagan-Mondale Presidential Debate,[MS. RIDINGS: Good evening from the Municipal ...,"[MS. RIDINGS, MS. RIDINGS, MS. RIDINGS, MR. NE...",Commission for Presidential Debates,General-President
4,"September 21, 1980",The Anderson-Reagan Presidential Debate,"[RUTH J. HINERFELD, CHAIR, LEAGUE OF WOMEN VOT...","[RUTH J. HINERFELD, CHAIR, LEAGUE OF WOMEN VOT...",Commission for Presidential Debates,General-President


Exploding Transcripts, to make each row one paragraph:

In [63]:
t_exp_final_combined_transcript_df = final_combined_transcript_df.explode('Transcript')

In [64]:
t_exp_final_combined_transcript_df.reset_index(drop=True, inplace=True)

In [65]:
t_exp_final_combined_transcript_df.shape

(76700, 6)

In [66]:
speakers = final_combined_transcript_df.Speaker

In [67]:
exploded_speakers = speakers.explode()
exploded_speakers.reset_index(drop=True, inplace=True)
exploded_speakers.shape

(76700,)

Reassigning the speaker column to the exploded version:

In [68]:
t_exp_final_combined_transcript_df['Speaker'] = exploded_speakers

In [69]:
t_exp_final_combined_transcript_df.head()

Unnamed: 0,Date,Debate_Name,Transcript,Speaker,Data_Source,Debate_Type
0,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Good evening, and welcome to the first...",LEHRER,Commission for Presidential Debates,General-President
1,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,PEROT: I think the principal that separates me...,PEROT,Commission for Presidential Debates,General-President
2,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: Governor Clinton, a one minute response.",LEHRER,Commission for Presidential Debates,General-President
3,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,CLINTON: The most important distinction in thi...,CLINTON,Commission for Presidential Debates,General-President
4,"October 11, 1992",The First Clinton-Bush-Perot Presidential Deb...,"LEHRER: President Bush, one minute response, sir.",LEHRER,Commission for Presidential Debates,General-President


In [70]:
t_exp_final_combined_transcript_df.shape

(76700, 6)

Pickling this dataframe to work on in a new notebook:

In [71]:
with open('Data/final_speaker_df.pickle', 'wb') as to_write:
    pickle.dump(t_exp_final_combined_transcript_df, to_write)

## MOVE TO: final_dataframe_cleanup.ipynb