# Sccraping Presidential Speeches

The Miller Center maintains an archive of all presidential speeches. The archive is available at https://millercenter.org/the-presidency/presidential-speeches. The archive contains speeches from all presidents from George Washington to Donald Trump. The archive is organized by president and then by speech. Each speech has a title, date, and transcript. Some speeches also have a video. The archive is a great resource for anyone interested in the history of the United States.

In [9]:
from bs4 import BeautifulSoup as bs
import pandas as pd
import time
from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from webdriver_manager.firefox import GeckoDriverManager
from selenium.webdriver.common.by import By
import re
import dateparser

## Scrape Links to Speechs from the Miller Center

The first thing we need to do is scrape links to all the presidential speeches. The main page uses JavaScript and will only reveal links as you scroll down the page. This, therefore, will require us to use Selenium to scrape this page; as selenium is our only option for scraping JavaScript pages.


In [10]:
# the following works on macos if I have gecko driver in the same folder as the script
driver = webdriver.Firefox()

# load page with Selenium
# we need to use selenium because the page loads additional records as you scroll down
# if we used requests, we would only get the first page of speeches
url = "https://millercenter.org/the-presidency/presidential-speeches"
driver.get(url)
driver.implicitly_wait(10)

#keep scrolling down until page stops loading additional records#
pause_scroll = 4
last_try = 0
initialcoord = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(pause_scroll)
    newcoord = driver.execute_script("return document.body.scrollHeight")
    if newcoord == initialcoord:
        break
    initialcoord = newcoord

In [13]:
pip install lxml

Note: you may need to restart the kernel to use updated packages.



[notice] A new release of pip is available: 23.1.2 -> 23.2.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [17]:
#retrieve urls to all speeches
page_source = driver.page_source
bsobject_linkpage = bs(page_source, 'html.parser')

links = bsobject_linkpage.find_all("a", href= re.compile('presidential-speeches/'))
link_list = list()
for link in links:
    link_specific = link['href']
    link_list.append(link_specific)

## For each URL, scrape the speech

For each of the links that we extracted from the main page, we will load the page, and use beutiful soup to extract the speech text, title, date, and president name.

NOTE: This will take a long time to run -- possible an hour. The Miller Center has a lot of speeches. Also, you may notice that we pause between pages - this is to make sure we are not hitting the server to hard. Some systems will notice this and reject your requests if you are hitting the server to hard.

In [20]:
# looking at html content...
# there is a class called president-name, episode-date, speed-loc, about-sidebar--intro, 
# presidential-speeches--title, presidential-speeches--title, view-transcript 
# 
# view-transcript content may have multiple "Transcript" Headers (header 3)
# it will also include a title ending in a colon
# 
#scrape the speech#

page_wait = 3

title, speech, name, date, about = ([] for i in range(5))
for index,link in enumerate(link_list):
    #access speech page with Selenium and load html source into Beautifulsoup#
    driver.get(link_list[index])
    driver.find_elements(By.CSS_SELECTOR, 'div[class="transcript-inner"]')
    page_source = driver.page_source
    bsobject_speechpage = bs(page_source, 'html.parser')

    #scrape speech and other properties#
    try:
        title.append((bsobject_speechpage.find('h2', class_="presidential-speeches--title").text).rstrip())
    except AttributeError:
        title.append("No title available")
        
    try:
        speech_raw = bsobject_speechpage.find('div', class_="transcript-inner").text
    except:
        try:
            speech_raw = (bsobject_speechpage.find('div', class_="view-transcript").text).rstrip()
        except:
            speech_raw = "No speech available"
    speech.append(re.sub("Transcript|\\n"," ",speech_raw))
    
    try:
        name.append((bsobject_speechpage.find('p', class_="president-name").text).rstrip())
    except AttributeError:
        name.append("No name available")
    
    try:
        date.append((dateparser.parse(bsobject_speechpage.find('p', class_="episode-date").text)))
    except AttributeError:
        date.append("No date available")
        
    try:
        about.append(bsobject_speechpage.find('div', class_="about-sidebar--intro").text.rstrip())
    except AttributeError:
        about.append("No info available")
    
    time.sleep(page_wait)

KeyboardInterrupt: 

## Create dataframe and save to csv

Finally, we assemble all out data into a dataframe and save it to a csv file. We will use this csv file in out next notebook.

In [None]:
#save this to a dataframe and save to a csv file#
if len(title) == len(speech) == len(name) == len(date) == len(about):
    speeches_presidents = pd.DataFrame({'name':name,'title':title,'date':date,'info':about,'speech':speech}, columns=['name','title','date','info','speech'])
    speeches_presidents['speech'] = speeches_presidents['speech'].apply(lambda x: x.replace(".",". "))
    speeches_presidents.to_csv("presidential_speeches.csv", encoding="utf-8",quotechar="'",index=False)
else:
    print("Something went wrong with scraping the speeches. Please check the code.")
    
    # dump the data to csv files for debugging
    df_names = pd.DataFrame({'name':name}) 
    df_names.to_csv("names.csv", encoding="utf-8",quotechar="'",index=False)
    
    df_titles = pd.DataFrame({'title':title})
    df_titles.to_csv("titles.csv", encoding="utf-8",quotechar="'",index=False)
    
    df_dates = pd.DataFrame({'date':date})
    df_dates.to_csv("dates.csv", encoding="utf-8",quotechar="'",index=False)
    
    df_infos = pd.DataFrame({'info':about})
    df_infos.to_csv("infos.csv", encoding="utf-8",quotechar="'",index=False)
    
    df_speeches = pd.DataFrame({'speech':speech})
    df_speeches.to_csv("speeches.csv", encoding="utf-8",quotechar="'",index=False)