## This script performs webscraping on the top 100 speeches using beautifulsoup package.

Importing necessary packages

In [1]:
import requests
import re
import pandas as pd
from bs4 import BeautifulSoup

Basic setup and site access.

In [1]:
MainURL='https://www.americanrhetoric.com/top100speechesall.html'
MainPage= requests.get(MainURL, headers={'user-agent': 'Mozilla/5.0'})
soup= BeautifulSoup(MainPage.content, 'html.parser')

NameError: name 'requests' is not defined

Obtaining name of the speakers and titles of the speeches from the main page.

In [2]:
names = soup.find_all(attrs={"width": "203"})
nameList = [n.get_text(strip=True) for n in names]

def findTitles (x) :
    titles = x.find_all(attrs={"color":"#BA1D01"})
    titleList1 = [n.get_text(strip=True) for n in titles]
    pattern=re.compile("([\r])|([\t])|([\n])|([\ ]{2,})")
    titleListCleaned = [re.sub(pattern, '', str(title)) for title in titleList1]
    titleListCleaned.pop(0)
    return titleListCleaned

Obtaining URL list for each speech from main webpage and storing them as list variable. 

In [3]:
mainlinks=[a['href'] for a in soup.find_all('a',href=True)   
    if 'off site' in a.text 
    or a['href'].startswith('speeches') and 'PDFFiles' not in a['href']
    or 'Belief and Public Morality' in a.text]

Obtaining the speech, and where and when the speech is given by itirating through the URL links list.  

In [4]:
deliveredLists = []
speeches = []
for i, link in enumerate(mainlinks) : 
    URL='https://www.americanrhetoric.com/'+mainlinks[i]
    Page=requests.get(URL, headers={'user-agent': 'Mozilla/5.0'})
    speechsoup=BeautifulSoup(Page.content, 'html.parser')
    findWhere = speechsoup.find_all(attrs={"color":"#CE0A04"})
    deliveredLists.append([n.get_text(strip=True) for n in findWhere])
    findspeeches = speechsoup.find_all("font",{'face':'Verdana'})
    speeches.append([n.get_text(strip=True) for n in findspeeches])


Since some of the speeches cannot be scraped with the common tags and attibutes above, and it sometimes gets some extra information that we do not need the lists should be cleaned. These cleaned list also gives us which data is missing and needs extra scrapping. 

In [5]:
deliveredAtCleaned = []

for i, place in enumerate(deliveredLists) : 
    if len(place) == 0 : 
        deliveredAtCleaned.append(None) 
    else : 
        for p in place : 
            if p.startswith('deliver') or p.startswith('Deliver') or p.startswith('Radio') or p.startswith('Broadcast') or p.startswith('broadcast') or p.startswith('presented') or p.startswith('Air') or p.startswith('original') or p.startswith('Paper') : 
                deliveredAtCleaned.append(p)     

In [6]:
speechesCleaned = []

for i, speech in enumerate(speeches) :
    if len(speeches[i]) <= 4 :
        speechesCleaned.append(None)
    else : 
        script=' '.join([str(line) for line in speech])
        script=' '.join(script.split())
        speechesCleaned.append(script)


Writing the information into a pandas dataframe. 

In [7]:
# Creating DF
df = pd.DataFrame(list(zip(nameList, findTitles(soup), mainlinks, deliveredAtCleaned, speechesCleaned)),
               columns =['Speaker', 'Title', 'Links', "Delivered", 'Speeches'])

df.head(10)

Unnamed: 0,Speaker,Title,Links,Delivered,Speeches
0,"Martin Luther King, Jr.",I Have A Dream,speeches/mlkihaveadream.htm,"delivered \r\n 28 August 1963, at the Lin...",I am happy to join with you today in what will...
1,John Fitzgerald Kennedy,Inaugural Address,speeches/jfkinaugural.htm,"delivered 20 January 1961, \r\nWashington, D.C.","Vice President Johnson, Mr. Speaker, Mr. Chief..."
2,Franklin Delano Roosevelt,First Inaugural Address,speeches/fdrfirstinaugural.html,Delivered 4 March 1933,"President Hoover, Mr. Chief Justice, my friend..."
3,Franklin Delano Roosevelt,Pearl Harbor Address to the Nation,speeches/fdrpearlharbor.htm,"delivered 8 \r\nDecember 1941, Washington, D.C.","Mr. Vice President, Mr. Speaker, Members of th..."
4,Barbara Charline Jordan,1976 DNC Keynote Address,speeches/barbarajordan1976dnc.html,"delivered 12 July 1976, New York, NY",Thank you ladies and gentlemen for a very warm...
5,Richard Milhous Nixon,Checkers,speeches/richardnixoncheckers.html,delivered and broadcast live on television 23 ...,"My Fellow Americans, I come before you tonight..."
6,Malcolm X,The Ballot or the B,http://americanradioworks.publicradio.org/feat...,,
7,Ronald Wilson Reagan,Shuttle 'Challenger' Disaster Address,speeches/ronaldreaganchallenger.htm,delivered 28 January 1986,"Ladies and Gentlemen, I'd planned to speakto y..."
8,John Fitzgerald Kennedy,Houston Ministerial Association,speeches/jfkhoustonministers.html,delivered 12 September 1960 at the Rice Hotel ...,"Reverend Meza, Reverend Reck, I'm grateful for..."
9,Lyndon Baines Johnson,We Shall Overcome,speeches/lbjweshallovercome.htm,"delivered 15 March 1965, \r\nWashington, D.C.","Mr. Speaker, Mr. President, Members of the Con..."


Checking which data is missing. 

In [8]:
df[df['Speeches'].isna()]

Unnamed: 0,Speaker,Title,Links,Delivered,Speeches
6,Malcolm X,The Ballot or the B,http://americanradioworks.publicradio.org/feat...,,
13,(Gen) Douglas MacArthur,Farewell Address to Congress,speeches/douglasmacarthurfarewelladdress.htm,"delivered 19 April 1951, \r\nWashington, D.C.",
41,Franklin Delano Roosevelt,The Four Freedoms,speeches/fdrthefourfreedoms.htm,"delivered 6 January, 1941",
43,William Jennings Bryan,Against Imperialism,speeches/wjbryanimperialism.htm,"delivered 8 August 1900, Indianapolis, IN",
58,Mario Matthew Cuomo,Religious Belief and Public Morality,http://archives.nd.edu/research/texts/cuomo.ht...,,
79,Eugene Victor Debs,The Issue (off site),https://www.marxists.org/archive/debs/works/19...,,
82,Crystal Eastman,Now We Can Begi,https://womenshistory.info/now-can-begin-whats...,,
88,Malcolm X,Message to the Grassroot,http://teachingamericanhistory.org/library/ind...,,


There are 8 speeches that need to be scraped individually from different sources and different html structures. This extracted information is added to dataframe.

In [9]:
speech6 ='http://americanradioworks.publicradio.org/features/blackspeech/mx.html'
MainPage6= requests.get(speech6, headers={'user-agent': 'Mozilla/5.0'})
soup= BeautifulSoup(MainPage6.content, 'html.parser')    
soup6 = soup.select('blockquote p')
speech6 = [n.get_text(strip=True) for n in soup6]
listToStr6 = ' '.join([str(elem) for elem in speech6])
df['Speeches'][6] = str(listToStr6)
df['Delivered'][6] = 'King Solomon Baptist Church, Detroit, Michigan - April 12, 1964'

In [10]:
speech13 ='https://www.americanrhetoric.com/speeches/douglasmacarthurfarewelladdress.htm'
MainPage13= requests.get(speech13, headers={'user-agent': 'Mozilla/5.0'})
soup= BeautifulSoup(MainPage13.content, 'html.parser') 
soup13 = soup.find_all(attrs={"style": "font-family:Verdana"})
speech13 = [n.get_text(strip=True) for n in soup13]
listToStr13 = ' '.join([str(elem) for elem in speech13])
df['Speeches'][13] = str(listToStr13)

In [11]:
speech41 ='https://www.americanrhetoric.com/speeches/fdrthefourfreedoms.htm'
MainPage41= requests.get(speech41, headers={'user-agent': 'Mozilla/5.0'})
soup= BeautifulSoup(MainPage41.content, 'html.parser') 
soup41 = soup.select('.MsoNormal')
speech41 = [n.get_text(strip=True) for n in soup41]
listToStr41 = ' '.join([str(elem) for elem in speech41])
df['Speeches'][41] = str(listToStr41)

In [12]:
speech43 ='https://www.americanrhetoric.com/speeches/wjbryanimperialism.htm'
MainPage43= requests.get(speech43, headers={'user-agent': 'Mozilla/5.0'})
soup= BeautifulSoup(MainPage43.content, 'html.parser') 
soup43 = soup.select('.MsoNormal')
speech43 = [n.get_text(strip=True) for n in soup43]
del speech43[115:]
listToStr43 = ' '.join([str(elem) for elem in speech43])
df['Speeches'][43] = str(listToStr43)

In [13]:
speech58 ='http://archives.nd.edu/research/texts/cuomo.htm?DocID=14'
MainPage58= requests.get(speech58, headers={'user-agent': 'Mozilla/5.0'})
soup= BeautifulSoup(MainPage58.content, 'html.parser')    
soup58 = soup.select(".mainbody")
speech58 = [n.get_text(strip=True) for n in soup58]
del speech58[0:2]
listToStr58 = ' '.join([str(elem) for elem in speech58])
df['Speeches'][58] = str(listToStr58)
df['Delivered'][58] = "delivered September 13, 1984, as a John A. O'Brien Lecture in the University of Notre Dame's Department of Theology"

In [14]:
speech79 = 'https://www.marxists.org/archive/debs/works/1908/issue.htm'
MainPage79= requests.get(speech79, headers={'user-agent': 'Mozilla/5.0'})
soup= BeautifulSoup(MainPage79.content, 'html.parser')    
soup79 = soup.select("p")
soup79speech = [n.get_text(strip=True) for n in soup79]
soup79sliced = soup79speech[slice(2, 59)]
listToStr79 = ' '.join([str(elem) for elem in soup79sliced])
df['Speeches'][79] = str(listToStr79)
df['Delivered'][79] = "May 23, 1908"

In [15]:
speech82 ='https://womenshistory.info/now-can-begin-whats-next-beyond-woman-suffrage/'
MainPage82= requests.get(speech82, headers={'user-agent': 'Mozilla/5.0'})
soup= BeautifulSoup(MainPage82.content, 'html.parser')    
soup82 = soup.select("blockquote")
soup82speech = [n.get_text(strip=True) for n in soup82]
listToStr82 = ' '.join([str(elem) for elem in soup82speech])
df['Speeches'][82] = str(listToStr82)
df['Delivered'][82] = 'published in 1920'

In [16]:
speech88 ='https://teachingamericanhistory.org/document/message-to-grassroots/'
MainPage88= requests.get(speech88, headers={'user-agent': 'Mozilla/5.0'})
soup= BeautifulSoup(MainPage88.content, 'html.parser')  
soup88 = soup.select('div > p')
soup88speech = [n.get_text(strip=True) for n in soup88]
soup88sliced = soup88speech[slice(2, 41)]
listToStr88 = ' '.join([str(elem) for elem in soup88sliced])
df['Speeches'][88] = str(listToStr88)
df['Delivered'][88] = 'November 10, 1963'

Write the dataframe to a CSV file.

In [18]:
df.to_csv('Data/speeches.csv', encoding='utf-8')