# Model Tagging
The purpose of this notebook is to work with the original data to see if i can properly tag the speaker of each line.  Additional scraping may be required to make sure it is classified properly.

Importing packages:

In [1]:
import nltk
import pandas as pd
import numpy as np
import pickle
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10
import pickle
from bs4 import BeautifulSoup
import requests
import re
from IPython.core.display import display, HTML    # make sure Jupyter knows to display it as HTML
import time, os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

In the American Presidency Project, the transcripts are formatted in different ways with how the speaker is listed - sometimes in bold, sometimes not.  If I can scrape these properly, I should be able to properly pull out the speaker from each line.

**American President Project:**  
Group 1: Bold, w/ colons: 
- Everything in 2019/2020 (including Trump/Biden townhalls)
- Everything in 2015-16
- 2012: VP, all Primaries
- 2008 General/Primaries: All others, besides below
- 2004: Democratic Primaries
- 2000: All Democratic Primaries, All Republican Primaries

Group 2: Italics, w/Period:
- 2012: Presidential
- 2004: All Three Presidential
- 1996: All Three Presidential

Group 3: Colon, no italics or bold:
- 2008: Democratic Candidates Debate in Miami, Florida; Republican Candidates Debate in Miami, Florida
- 2004: VP
- 2000: All General Election Debates
- 1996: VP
- 1992: All
- 1988: All
- 1984: All
- 1981: All
- 1976: All
- 1960: All

Goal: find ones in Commission for Presidential Debates that are easier to work with, particularly those in Group 3.

**CPD:**  
- Those aren't better, so let me see if I can write scraping code for those first two groups.

### Group 1: Bold, w/ colons:

In [2]:
test_link = 'https://www.presidency.ucsb.edu/documents/presidential-debate-belmont-university-nashville-tennessee-0'

In [3]:
response = requests.get(test_link)
page = response.text
soup_object = BeautifulSoup(page, 'lxml')

In [4]:
transcript = soup_object.find('div', class_='field-docs-content').find_all('p')

Below loop will go through each paragraph, and if there is a speaker (in bold), then it will print the bold text (i.e. the speaker).  If not, it will print "no speaker!":

In [5]:
for paragraph in transcript:
    if paragraph.find('b'):
        print(paragraph.find('b').get_text().strip(':'))
    else:
        print('no speaker!')

PARTICIPANTS
MODERATOR
WELKER
no speaker!
TRUMP
WELKER
BIDEN
WELKER
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
WELKER
TRUMP
WELKER
BIDEN
WELKER
TRUMP
WELKER
BIDEN
WELKER
BIDEN
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
WELKER
TRUMP
BIDEN
TRUMP
BIDEN
WELKER
TRUMP
WELKER
BIDEN
WELKER
BIDEN
WELKER
BIDEN
WELKER
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
WELKER
TRUMP
WELKER
BIDEN
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
TRUMP
WELKER
TRUMP
WELKER
BIDEN
WELKER
BIDEN
WELKER
TRUMP
WELKER
BIDEN
WELKER
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
WELKER
BIDEN
WELKER
TRUMP
WELKER
BIDEN
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
WELKER
TRUMP
WELKER
TRUMP
BIDEN
TRUMP
BIDEN
WELKER
TRUMP
WELKER
TRUMP
BIDEN
TRUMP
BIDEN
TRUMP
BIDEN
TRUMP
WELKER
BIDEN
WELKER
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
WELKER
BIDEN
WELKER
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
TRUMP
WELKER
BIDEN
WELKER
BIDEN
WELK

The thought is, once I have a list of speakers next to paragraph, any row with "no speaker!" can have the preceding speaker pulled down to it, since this would indicate that line is a continuation of the previous speaker's line.

### Group 2: Italics, w/Period:

Testing on one debate:

In [6]:
test_link = 'https://www.presidency.ucsb.edu/documents/presidential-debate-boca-raton-florida'

In [7]:
response = requests.get(test_link)
page = response.text
soup_object = BeautifulSoup(page, 'lxml')

In [8]:
transcript = soup_object.find('div', class_='field-docs-content').find_all('p')

Below loop will go through each paragraph, and if there is a speaker (in italics), then it will print the bold text (i.e. the speaker).  If not, it will print "no speaker!":

In [9]:
for paragraph in transcript:
    if paragraph.find('i'):
        print(paragraph.find('i').get_text().strip('.'))
    else:
        print('no speaker!')

Moderator Bob Schieffer
no speaker!
no speaker!
no speaker!
no speaker!
Situation in the Middle East and North Africa/Al Qaida Terrorist Organization
no speaker!
no speaker!
no speaker!
Republican Presidential Nominee W. Mitt Romney
no speaker!
no speaker!
no speaker!
no speaker!
Mr. Schieffer. 
Counterterrorism Efforts/Libya
The President
no speaker!
no speaker!
no speaker!
no speaker!
Counterterrorism Efforts/Situation in the Middle East and North Africa
Gov. Romney. 
no speaker!
no speaker!
no speaker!
Mr. Schieffer
Gov. Romney
Mr. Schieffer. 
Governor Romney's Foreign Policy Agenda
The President
no speaker!
no speaker!
no speaker!
no speaker!
Mr. Schieffer. 
Situation in the Middle East/Russia/Iraq
Gov. Romney. 
no speaker!
no speaker!
The President
Gov. Romney. 
no speaker!
The President
Gov. Romney. 
The President
Gov. Romney. 
The President
Gov. Romney. 
The President
Gov. Romney. 
The President
Gov. Romney. 
The President
Gov. Romney. 
The President
Gov. Romney. 
The President


With groups 1 and 2, I can collapse the text in the "no speaker!" rows to the preceding speaker, and thus get each individual speaking line.

### Group 3: Colon, no italics or bold:

With the group 3 ones, is there a way to classify the speaker lines? I.e. Upper Case, etc.  
**American Presidency Project:**
- 2008: Democratic Candidates Debate in Miami, Florida; Republican Candidates Debate in Miami, Florida - split by :, speaker is in all caps (i.e. use .isupper())
- 2004: VP - split by :, speaker is in all caps (i.e. use .isupper())
- 2000: All General Election Debates - split by :, speaker is in all caps (i.e. use .isupper())
- 1996: VP - split by :, speaker is in all caps (i.e. use .isupper())
- 1992: All - VP can be split by :, isupper.  Presidential use Name. as the intro --> check CPD site
- 1988: All - split by :, speaker is in all caps (i.e. use .isupper())
- 1984: All - VP can be split by :, .isupper.  Presidential use Name. as the intro --> check CPD site
- 1980: All --> one split by :, one split by ., speaker is in all caps (i.e. use .isupper())
- 1976: All -> one split by :, rest split by ., speaker is in all caps (i.e. use .isupper())
- 1960: All --> split by :, speaker is in all caps (i.e. use .isupper())

Looking at CPD for potential help:
- 1992 Presidential: use CPD site, since it has : with speaker in all caps
- 1984 Presidential: use CPD site, has NAME: as the format (split by :, index 0 .isupper())
- 1980 All: Use CPD site, has NAME: as the format (split by :, index 0 .isupper())
- 1976 All: Use CPD site, has NAME: as the format (split by :, index 0 .isupper()) 

Therefore, in Group 3, I'll be using APP for all in this group except the following using CPD:
- 1992 Presidential Debates
- 1984 Presidential Debates
- 1980 All Debates
- 1976 All Debates

In [10]:
test_link = 'https://www.debates.org/voter-education/debate-transcripts/october-13-1960-debate-transcript/'

In [11]:
response = requests.get(test_link)
page = response.text
soup_object = BeautifulSoup(page, 'lxml')

In [12]:
transcript = soup_object.find('div', id='content-sm').find_all('p')

Below loops will find the speaker info, for APP and CPD:

In [13]:
#CPD
for paragraph in transcript:
    #Need McGEE from the 1960 debates:
    if paragraph.get_text().split(':')[0].isupper() or paragraph.get_text().split(':')[0] == 'MR. McGEE':
        print(paragraph.get_text().split(':')[0])
    else:
        print('no speaker!')

no speaker!
no speaker!
BILL SHADEL, MODERATOR
MR. KENNEDY
MR. SHADEL
MR. NIXON
MR. SHADEL
MR. McGEE
MR. KENNEDY
MR. SHADEL
MR. NIXON
MR. SHADEL
MR. VON FREMD
MR. NIXON
MR. SHADEL
MR. KENNEDY
MR. SHADEL
MR. CATER
MR. KENNEDY
MR. SHADEL
MR. NIXON
MR. SHADEL
MR. DRUMMOND
MR. NIXON
MR. SHADEL
MR. KENNEDY
MR. SHADEL
MR. VON FREMD
MR. KENNEDY
MR. SHADEL
MR. NIXON
MR. SHADEL
MR. CATER
MR. NIXON
MR. SHADEL
MR. KENNEDY
MR. SHADEL
MR. DRUMMOND
MR. KENNEDY
MR. SHADEL
MR. NIXON
MR. SHADEL
MR. McGEE
MR. NIXON
MR. SHADEL
MR. KENNEDY
MR. SHADEL
MR. CATER
MR. KENNEDY
MR. SHADEL
MR. NIXON
MR. SHADEL
MR. DRUMMOND
MR. NIXON
MR. SHADEL
MR. KENNEDY
MR. SHADEL
MR. McGEE
MR. KENNEDY
MR. SHADEL
MR. NIXON
MR. SHADEL
MR. VON FREMD
MR. NIXON
MR. SHADEL
MR. KENNEDY
MR. SHADEL
MR. DRUMMOND
MR. KENNEDY
MR. SHADEL
MR. NIXON
MR. SHADEL


# Combining the Above

## American Presidency Project:

In order to combine the above into a dataframe with speaker, I'll need to build out lists with the speaker index matched up with the line, debate name, and date.  I'll work these into my functions in debate_scraping_functions.py.

For American Presidency Project, if i can create a list of tuples with the link and group number, I can write a function to pull the speakers out in order.

In [14]:
from debate_scraping_functions import app_url_puller

In [15]:
link_list = app_url_puller('https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/presidential-campaigns-debates-and-endorsements-0')

In [16]:
len(link_list)

169

Groups from above list in markdown:

In [17]:
group_list = [1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,2,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,1,1,1,3,1,1,1,1,1,1,1,1,1,2,2,2,3,1,1,3,3,3,3,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,1,2,2,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,3,1,1]
len(group_list)

169

They match in length, so will combine in a tuple:

In [18]:
group_tuple_list = [(group_list[i], link) for i, link in enumerate(link_list)]
group_tuple_list

[(1,
  'https://www.presidency.ucsb.edu/documents/presidential-debate-belmont-university-nashville-tennessee-0'),
 (1,
  'https://www.presidency.ucsb.edu/documents/presidential-debate-case-western-reserve-university-cleveland-ohio'),
 (1,
  'https://www.presidency.ucsb.edu/documents/vice-presidential-debate-the-university-utah-salt-lake-city'),
 (1,
  'https://www.presidency.ucsb.edu/documents/democratic-candidates-debate-washington-dc'),
 (1,
  'https://www.presidency.ucsb.edu/documents/democratic-candidates-debate-charleston-south-carolina-0'),
 (1,
  'https://www.presidency.ucsb.edu/documents/democratic-candidates-debate-las-vegas-nevada-0'),
 (1,
  'https://www.presidency.ucsb.edu/documents/democratic-candidates-debate-manchester-new-hampshire-0'),
 (1,
  'https://www.presidency.ucsb.edu/documents/democratic-candidates-debate-des-moines-iowa-0'),
 (1,
  'https://www.presidency.ucsb.edu/documents/democratic-candidates-debate-los-angeles-california'),
 (1,
  'https://www.presidency.u

Using function app_group_speaker_puller to get the speakers:

In [19]:
from debate_scraping_functions import app_group_speaker_puller

In [20]:
speaker_list = app_group_speaker_puller(group_tuple_list)

Checking out the end to see if this worked well:

In [21]:
speaker_list[-1][-15:]

['BIDEN',
 'no speaker!',
 'no speaker!',
 'STEPHANOPOULOS',
 'no speaker!',
 'STEPHANOPOULOS',
 'no speaker!',
 'no speaker!',
 'BIDEN',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'STEPHANOPOULOS',
 'BIDEN',
 'STEPHANOPOULOS']

This matches with the last 15 paragraphs of the Biden Town hall [transcript](https://www.presidency.ucsb.edu/documents/remarks-town-hall-meeting-with-george-stephanopoulos-abc-news-the-national-constitution).  It appears that everything is working well!  "no speaker!" means that specific paragraph is not tagged with a speaker.

Additionally, the length of this list should match the original paragraphs length list, from the text_cleaning.ipynb file - 79,607.  

In [22]:
len(speaker_list)

169

In [23]:
counter = 0
for speaker in speaker_list:
    counter += len(speaker)
print(counter)

79607


Next step, slicing these individual lists using the same rules as I did in text_cleaning.ipynb to determine the correct index that the debate starts at:

In [24]:
starts_at_zero = [49, 50, 51, 52, 94, 112, 113, 114, 116, 117, 118, 119, 120, 121, 143, 144, 145, 146, 147, 148, 149, 150, 151, 152, 153, 154, 155, 156, 157, 158, 159, 160, 161, 162, 163, 164, 165, 166, 167, 168]
starts_at_one = [73, 74, 75, 76, 102, 109, 110, 115, 139]
starts_at_three = [20, 21, 32, 33, 43, 93, 103, 104, 106, 107, 122]

In [25]:
new_speakers = []
for i, speaker in enumerate(speaker_list):
    if i in starts_at_zero:
        new_speakers.append(speaker)
    elif i in starts_at_one:
        new_speakers.append(speaker[1:])
    elif i in starts_at_three:
        new_speakers.append(speaker[3:])
    else:
        new_speakers.append(speaker[2:])

In [26]:
len(new_speakers)

169

Moving on to the commission for presidential debates data.

## Commission for Presidential Debates:

Per above, need to use CPD for the following group 3 debates:
- 1992 Presidential Debates
- 1984 Presidential Debates
- 1980 All Debates
- 1976 All Debates  

Since I only need Group 3 code, I can do that on all.

In [27]:
cpd_link = 'https://www.debates.org/voter-education/debate-transcripts/'

In [28]:
from debate_scraping_functions import *

In [29]:
cpd_link_list = cpd_url_puller(cpd_link)

In [30]:
len(cpd_link_list)

48

In [31]:
cpd_speaker_list = cpd_group_speaker_puller(cpd_link_list)

In [32]:
len(cpd_speaker_list)

48

Now, popping off the 23rd element of both, since that is just a link to translations:

In [33]:
cpd_link_list.pop(23)

'https://www.debates.org/voter-education/debate-transcripts/2000-debate-transcripts-translations/'

In [34]:
cpd_speaker_list.pop(23)

['no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!',
 'no speaker!']

Slicing to match up with the other columns:

In [35]:
starts_at_2 = [19, 20, 21, 22, 23, 24, 25, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46]

In [36]:
starts_at_3 = [27, 29, 30]

In [37]:
starts_at_4 = [0, 1, 2, 8, 26, 28, 31]

In [38]:
starts_at_5 = [7,9,10]
starts_at_6 = [17,18]
starts_at_7 = [3, 4, 6, 11, 12, 13, 14, 15, 16]

In [39]:
new_cpd_speakers = []
for i, speaker in enumerate(cpd_speaker_list):
    if i in starts_at_2:
        new_cpd_speakers.append(speaker[2:])
    elif i in starts_at_3:
        new_cpd_speakers.append(speaker[3:])
    elif i in starts_at_4:
        new_cpd_speakers.append(speaker[4:])
    elif i in starts_at_5:
        new_cpd_speakers.append(speaker[5:])
    elif i in starts_at_6:
        new_cpd_speakers.append(speaker[6:])
    elif i in starts_at_7:
        new_cpd_speakers.append(speaker[7:])
    else:
        new_cpd_speakers.append(speaker[8:])

Great, now I have speakers for both.  Time to pickle and bring into the text_cleaning.ipynb notebook.

In [2]:
cd Data

/Users/patrickbovard/Documents/GitHub/presidential_debate_analysis/Data


In [41]:
with open('new_cpd_speakers.pickle', 'wb') as to_write:
    pickle.dump(new_cpd_speakers, to_write)

In [42]:
with open('new_app_speakers.pickle', 'wb') as to_write:
    pickle.dump(new_speakers, to_write)

## Next Move: To text_cleaning.ipynb: near the bottom