# Debate Scraping

The purpose of this notebook is to explore scraping transcript data from the [Commission on Presidential Debates website](https://www.debates.org/voter-education/debate-transcripts/), and the American Presidency Project.

In [1]:
import pandas as pd
import matplotlib.pyplot as plt
from pylab import rcParams
%matplotlib inline
rcParams['figure.figsize'] = 20,10
import numpy as np
import glob
from scipy import stats
from bs4 import BeautifulSoup
import requests
import re
from IPython.core.display import display, HTML    # make sure Jupyter knows to display it as HTML

In [2]:
import time, os
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
chromedriver = "/Applications/chromedriver" # path to the chromedriver executable
os.environ["webdriver.chrome.driver"] = chromedriver

## Initial Soup Testing:

First, making URL for debate on 9/29/2020:

In [3]:
url_1 = 'https://www.debates.org/voter-education/debate-transcripts/september-29-2020-debate-transcript/'

In [4]:
response_1 = requests.get(url_1)

In [5]:
page = response_1.text

In [6]:
soup_object = BeautifulSoup(page, 'lxml')

In [7]:
soup_object

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
<head>
<meta content="en-us" http-equiv="Content-Language"/>
<meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
<title>CPD: September 29, 2020 Debate Transcript</title>
<link href="/wp-content/themes/debates2019/css/reset.css" rel="stylesheet" type="text/css"/>
<link href="/wp-content/themes/debates2019/css/jc-main.css" media="screen,projection" rel="stylesheet" type="text/css"/>
<link href="/wp-content/themes/debates2019/css/fonts.css" media="screen,projection" rel="stylesheet" type="text/css"/>
<!--[if gte IE 5]>
        <link href="/wp-content/themes/debates2019/css/jc-iemain.css" rel="stylesheet" type="text/css" media="screen,projection"  />
        <![endif]-->
<link href="/wp-content/themes/debates2019/css/styles.css" media="screen" rel="stylesheet" type="text/css"/>
<style>
.page-item-44 .children {
    display: none;
}
</style>
</head>
<body>
<div id="wrapper">
<div id="header">
</div>
<div 

With beautiful soup object created, how to pull the text data:

In [8]:
debate_speech = soup_object.find('div', id='content-sm').find_all('p')

In [9]:
debate_speech

[<p><b>Presidential Debate at Case Western Reserve University and Cleveland Clinic in Cleveland, Ohio</b></p>,
 <p>September 29, 2020</p>,
 <p><b>PARTICIPANTS:<br/>
 </b>Former Vice President Joe Biden (D) and<br/>
 President Donald Trump (R)</p>,
 <p><b>MODERATOR:<br/>
 </b>Chris Wallace (Fox News)</p>,
 <p><b>WALLACE:</b> Good evening from the Health Education Campus of Case Western Reserve University and the Cleveland Clinic. I’m Chris Wallace of Fox News and I welcome you to the first of the 2020 presidential debates between President Donald J. Trump and former Vice President Joe Biden. This debate is sponsored by the Commission on Presidential Debates. The Commission has designed the format, six roughly 15-minute segments with two-minute answers from each candidate to the first question, then open discussion for the rest of each segment. Both campaigns have agreed to these rules. For the record, I decided the topics and the questions in each topic. I can assure you none of the que

In [10]:
type(debate_speech)

bs4.element.ResultSet

In [11]:
debate_text = []
for debate in debate_speech:
    debate_item = [item for item in debate]
    debate_text.append(debate_item)

In [12]:
debate_text[8]

[<b>BIDEN:</b>, '\xa0I’m well.']

In [13]:
type(debate_text[8][0])

bs4.element.Tag

In [14]:
type(debate_text[8][1])

bs4.element.NavigableString

In [15]:
debate_speech_2 = soup_object.find('div', id='content-sm').get_text()

In [16]:
debate_speech_2



In [17]:
print(soup_object.prettify())

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
 <head>
  <meta content="en-us" http-equiv="Content-Language"/>
  <meta content="text/html;charset=utf-8" http-equiv="Content-Type"/>
  <title>
   CPD: September 29, 2020 Debate Transcript
  </title>
  <link href="/wp-content/themes/debates2019/css/reset.css" rel="stylesheet" type="text/css"/>
  <link href="/wp-content/themes/debates2019/css/jc-main.css" media="screen,projection" rel="stylesheet" type="text/css"/>
  <link href="/wp-content/themes/debates2019/css/fonts.css" media="screen,projection" rel="stylesheet" type="text/css"/>
  <!--[if gte IE 5]>
        <link href="/wp-content/themes/debates2019/css/jc-iemain.css" rel="stylesheet" type="text/css" media="screen,projection"  />
        <![endif]-->
  <link href="/wp-content/themes/debates2019/css/styles.css" media="screen" rel="stylesheet" type="text/css"/>
  <style>
   .page-item-44 .children {
    display: none;
}
  </style>
 </head>
 <body>
  <div id="wrapp

How to get each speaker, and what they said?

In [18]:
debate_speech_2 = soup_object.find('div', id='content-sm').find_all('p')[-1]

Speaker name:

In [19]:
debate_speech_2.find('b').get_text().strip(':')

'WALLACE'

Text:

In [20]:
debate_speech_2.get_text()

'WALLACE:\xa0… to be continued in more debates as we go on. President Trump, Vice President Biden, it’s been an interesting hour and a half. I want to thank you both for participating in the first of three debates that you have agreed to engage in. We want to thank Case Western Reserve University and the Cleveland Clinic for hosting this event. The next debate, sponsored by the Commission on Presidential Debates, will be one week from tomorrow, October 7th, at the University of Utah in Salt Lake City. The two Vice-Presidential nominees, Vice President Mike Pence and Senator Kamala Harris will debate at 9:00 PM Eastern that night. We hope you watch. Until then, thank you, and good night.'

In [21]:
len(soup_object.find('div', id='content-sm').find_all('p'))

877

In [22]:
transcript = soup_object.find('div', id='content-sm').find_all('p')[4:]
full_transcript = []
for i, item in enumerate(transcript):
    speaker = transcript[i].get_text().strip()
    speaker = speaker.replace(u'\xa0', ' ')
    #speaker = transcript[i].find('b').get_text().strip(':')
    full_transcript.append(speaker)
    

Below is a full transcript, with each paragraph as an object in a list:

In [23]:
full_transcript

['WALLACE: Good evening from the Health Education Campus of Case Western Reserve University and the Cleveland Clinic. I’m Chris Wallace of Fox News and I welcome you to the first of the 2020 presidential debates between President Donald J. Trump and former Vice President Joe Biden. This debate is sponsored by the Commission on Presidential Debates. The Commission has designed the format, six roughly 15-minute segments with two-minute answers from each candidate to the first question, then open discussion for the rest of each segment. Both campaigns have agreed to these rules. For the record, I decided the topics and the questions in each topic. I can assure you none of the questions has been shared with the Commission or the two candidates.',
 'This debate is being conducted under health and safety protocols designed by the Cleveland Clinic, which is serving as the health security advisor to the Commission for all four debates. As a precaution, both campaigns have agreed the candidates

Next step: merge answers from each person together, so it has each list object starting with 'NAME:', so I can easily categorize who is talking, and whether it's a candidate or a moderator.

In [24]:
len(full_transcript)

873

In [25]:
full_transcript[4][0:6]

'BIDEN:'

In [26]:
full_transcript[0][0:6]

'WALLAC'

In [27]:
full_transcript[0].split(':')[0]

'WALLACE'

In [28]:
new_transcript = []
for i, line in enumerate(full_transcript):
    if line[0:6] == 'BIDEN:' or line[0:6] == 'TRUMP:' or line[0:6] == 'WALLAC':
        new_transcript.append(line)
    else:
        new_transcript.append(full_transcript[i-1].split(':')[0] + ': ' + line)
final_list = []
for item in new_transcript:
    final_list.append(item.split(':'))

In [29]:
from pandas import DataFrame

In [30]:
df = DataFrame(final_list, columns=['speaker', 'line', 'other'])

In [31]:
df.head()

Unnamed: 0,speaker,line,other
0,WALLACE,Good evening from the Health Education Campus...,
1,WALLACE,This debate is being conducted under health a...,
2,BIDEN,"How you doing, man?",
3,TRUMP,How are you doing?,
4,BIDEN,I’m well.,


Why are some null?

In [32]:
df[df.other.notnull()]

Unnamed: 0,speaker,line,other
626,TRUMP,"Proud Boys, stand back and stand by. But I’ll...",somebody’s got to do something about Antifa a...
794,WALLACE,"All right, gentlemen, final segment","Election integrity. As we meet tonight, milli..."
872,WALLACE,… to be continued in more debates as we go on...,00 PM Eastern that night. We hope you watch. U...


In [33]:
df.iloc[626]['line']

' Proud Boys, stand back and stand by. But I’ll tell you what, I’ll tell you what'

In [34]:
full_transcript[626]

'TRUMP: Proud Boys, stand back and stand by. But I’ll tell you what, I’ll tell you what: somebody’s got to do something about Antifa and the left because this is not a right wing problem this is a left-wing. This is a left-wing problem. . .'

The colon is splitting it.

## Pulling the Transcripts from Each Debate:

### Commission on Presidential Debates [site](https://www.debates.org/voter-education/debate-transcripts/)

To scrape from each of the debates available on the Commission on Presidential Debates site, I have written functions in debate_scraping_functions.py.  This is the data I'll be using for the general election (president and vice president) debates, 1960-2020.

In [35]:
from debate_scraping_functions import *

In [36]:
transcript_home_url = 'https://www.debates.org/voter-education/debate-transcripts/'

In [37]:
cpd_transcript_list = cpd_transcript_puller(transcript_home_url)

In [38]:
cpd_transcript_list[1]

['Vice Presidential Debate at the University of Utah in Salt Lake City, Utah',
 'October 07, 2020',
 'PARTICIPANTS:\nSenator Kamala Harris (D-CA) and\nVice President Mike Pence (R)',
 'MODERATOR:\nSusan Page (USA Today)',
 'PAGE: Good evening. From the University of Utah in Salt Lake City, welcome to the first, and only, vice presidential debate of 2020, sponsored by the nonpartisan Commission on Presidential Debates. I’m Susan Page of USA TODAY. It is my honor to moderate this debate, an important part of our democracy. In Kingsbury Hall tonight we have a small and socially distant audience and we’ve taken extra precautions during this pandemic. Among other things, everyone in the audience is required to wear a face mask and the candidates will be seated 12 feet apart. The audience is enthusiastic about their candidates, but they’ve agreed to express that enthusiasm, only twice. At the end of the debate and now when I introduce the candidates. Please welcome California Senator Kamala 

In [39]:
len(cpd_transcript_list)

48

I know have a list of lines from each debate from this site within trans_list.  

## The American Presidency Project
Utilizing the data UC Santa Barbara has on the American Presidency Project [site](https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/presidential-campaigns-debates-and-endorsements-0), I'll scrape the transcripts from Primary Election Debates (Republican and Democrat) from 2000-present.

In [40]:
app_link = 'https://www.presidency.ucsb.edu/documents/presidential-documents-archive-guidebook/presidential-campaigns-debates-and-endorsements-0'

Below functions are in debate_scraping_functions.py:

In [41]:
app_transcript_list = app_transcript_puller(app_link)

In [42]:
len(app_transcript_list)

169

All of the transcripts are pulled now - 169 in total from the American Presidency Project.

How many different "rows" (i.e. paragraphs in each transcript) do I have in total?

In [43]:
len(app_transcript_list[0]) #Finding how many "rows" are in the first transcript, for comparison

356

Ok, 356 in total.  How many in the entire set>

In [45]:
counter = 0
for transcript in app_transcript_list:
    counter += len(transcript)
print(counter)

79607


In total then, I have about 79,607 rows of data.  I know some of these are intro parts of the transcripts so some cleaning will need to be done, but that's a good starting point.

## Pickling the two transcripts lists:

In [47]:
import pickle

American Presidency Project:

In [50]:
with open('app_transcripts.pickle', 'wb') as to_write:
    pickle.dump(app_transcript_list, to_write)

CPD:

In [54]:
with open('cpd_transcript_list.pickle', 'wb') as to_write_2:
    pickle.dump(cpd_transcript_list, to_write_2)

Both files moved to the Data folder in the repo for organization.