# Scraping the Castro Speech Database


## 1. Scraping the content of a single webpage

In order to process HTML we pull down from the web, we'll be using a Python library called Beautiful Soup. The name comes from "Alice in Wonderland," which is a fun fact you can throw around at parties.

In [1]:
from bs4 import BeautifulSoup
from urllib import request

We'll now need to identify a base link to scrape from.

In [3]:
url = 'http://lanic.utexas.edu/project/castro/1959/'

In [4]:
html = request.urlopen(url).read()
print(html[0:4000])


b'<?xml version="1.0" encoding="utf-8"?>\r\n<!DOCTYPE html\r\n     PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"\r\n     "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">\r\n<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en">\r\n<head>\r\n<meta http-equiv="Content-type" content="text/html; charset=utf-8" />\r\n\r\n<link rel="stylesheet" type="text/css"\r\nhref="/css/lanic09.css" />\r\n\r\n<link rel="stylesheet" type="text/css"\r\nhref="/css/lanic08.css" />\r\n\r\n<link rel="stylesheet" type="text/css"\r\nhref="../castro09.css" />\r\n\r\n<link rel="stylesheet" type="text/css"\r\nhref="/css/print/print08.css" media="print" />\r\n\r\n<!--[if lt IE 7]>\r\n    <link rel="stylesheet" type="text/css" href="/css/ie6-08.css" />\r\n    <![endif]-->\r\n\r\n<!--[if IE 7]>\r\n    <link rel="stylesheet" type="text/css" href="/css/ie7-08.css" />\r\n    <![endif]-->\r\n\r\n<title>Castro Speech Data Base - LANIC - Browse Speeches from 1959</title>\r\n</head>\r\n\r\n<body id="castromont

In [5]:
soup = BeautifulSoup(html, 'lxml')

from pprint import pprint
pprint(soup.body)

<body class="initiativepage" id="castromonth">
<div id="container">
<div id="logo">
<h4><a href="/">LANIC</a></h4>
<h1>Latin American Network Information Center</h1>
</div> <!-- end logo box -->
<a class="skip" href="#maincontent">Skip to main content</a>
<a class="skip" id="maincontent"></a>
<div id="main">
<h2>Browse Speeches from 1959</h2>
<div id="nav">
<ul>
<li><a href="/la/cb/cuba/castro.html">CSDB Home</a></li>
</ul>
</div><!-- end nav -->
<a href="/la/cb/cuba/castro.html">
<img alt="" id="castrologo" src="../image/fidel3.jpg" title="Castro Speech Data Base"/></a>
<div id="basic">
<a name="basicsearch"></a>
<div id="basicform">
<!-- Google CSE Search Box Begins  -->
<form action="http://lanic.utexas.edu/world/search/results/castro/" id="searchbox_009303113233185091933:y0x1isqxdmm">
<input name="cx" type="hidden" value="009303113233185091933:y0x1isqxdmm"/>
<input name="cof" type="hidden" value="FORID:11"/>
<input name="q" size="40" type="text"/>
<input name="sa" type="submit" val

This line says, "take the HTML that you've pulled down and get ready to do Beautiful Soup things to it." In programming speak, we're saying "turn that HTML into a Beautiful Soup object." Saying something is an object is a way of saying "I expect this data to have certain characteristics and be able to do certain things." In this case, BeautifulSoup gives us a series of ways to manipulate the HTML using HTML and CSS structural elements.

We can do things like:

* Get all the links

In [8]:
pprint(soup.find_all('a')[0:20])

[<a href="/">LANIC</a>,
 <a class="skip" href="#maincontent">Skip to main content</a>,
 <a class="skip" id="maincontent"></a>,
 <a href="/la/cb/cuba/castro.html">CSDB Home</a>,
 <a href="/la/cb/cuba/castro.html">
<img alt="" id="castrologo" src="../image/fidel3.jpg" title="Castro Speech Data Base"/></a>,
 <a name="basicsearch"></a>,
 <a href="/project/castro/db/1959/19590103.html">Castro
speaks to citizens of Santiago: 01/03/1959</a>,
 <a href="/project/castro/db/1959/19590109.html">Castro
warns against complacency: 01/09/1959</a>,
 <a href="/project/castro/db/1959/19590109-1.html">Castro
speech delivered in ciudad libertad: 01/09/1959</a>,
 <a href="/project/castro/db/1959/19590109-2.html">Castro
speaks at presidential palace: 01/09/1959</a>,
 <a href="/project/castro/db/1959/19590117.html">Castro
speaks at chibas tomb: 01/17/1959</a>,
 <a href="/project/castro/db/1959/19590117-1.html">Castro
delivers speech at presidential palace: 01/17/1959</a>,
 <a href="/project/castro/db/19

We can say, get me all the text.

In [10]:
print(soup.text)










Castro Speech Data Base - LANIC - Browse Speeches from 1959




LANIC
Latin American Network Information Center
 
Skip to main content


Browse Speeches from 1959


CSDB Home

















 
January 1959

Castro
speaks to citizens of Santiago: 01/03/1959

Castro
warns against complacency: 01/09/1959

Castro
speech delivered in ciudad libertad: 01/09/1959

Castro
speaks at presidential palace: 01/09/1959 
Castro
speaks at chibas tomb: 01/17/1959

Castro
delivers speech at presidential palace: 01/17/1959

Castro
speaks before havana rally: 01/21/1959

Castro
broadcasts to venezuelan people: 01/24/1959 
Public
opinion campaign: 01/ 25/1959

Castro
continues to assail dictators: 01/25/1959

Cubans
discontented over delay in trials: 01/25/1959

OAS
assailed by Castro as useless: 01/25/1959 
Castro-betancourt
meeting: 01/25/1959

Means
for ibero-american unity suggested: 01/26/1959

Mexican
interview by Guillermo Vela: 01/27/1959


February 1959

Speech
to 15,000 

It might not be very clear, but that's just the text of the webpage as one long string with all the HTML stripped out.

How many links are there on this page anyway? We can find out by checking out the length of this ResultSet:

In [8]:
print(len(soup.find_all('a')))

92


## 2. Scraping the text of a single speech

Now that we've scraped the webpage containing an index of all the Castro speeches from 1959, let's focus on scraping a single speech from this index.

In [9]:
url = 'http://lanic.utexas.edu/project/castro/db/1959/19591128-1.html'
html = request.urlopen(url).read()
soup = BeautifulSoup(html, 'lxml')
raw_text = soup.select('pre')
clean_text = raw_text[0].text
print(clean_text)


-DATE-
19591128
-YEAR-
1959
-DOCUMENT_TYPE-
SPEECH
-AUTHOR-
F. CASTRO
-HEADLINE-
SPEAKS AT UNIVERSITY OF HAVANA
-PLACE-
HAVANA
-SOURCE-
BOLETIN DE PRENSA
-REPORT_NBR-
FBIS
-REPORT_DATE-
19591127
-TEXT-
CASTRO SPEAKS AT UNIVERSITY OF HAVANA ON 27 NOVEMBER 1959

Source:  Boletin [de Prensa] No. 90, Ministerio de Estado, Republica de
Cuba, 28 November 1959, Havana, pp 1-18

University comrades:

I would like to make it clear first of all why I am wearing the
uniform of the university militia.  First, it is because I am still a
student, and second, because the comrades of the University Battalion
honored me by making me a gift of it.  And also, because we are certain
that we can honor it.

This 27 November has been ... just a moment, if there is a
counterrevolutionary here, let his stay, do not disturb him.  There are
many more out there in the street, and they are not bothering anyone
much.  Also, I believe that any counterrevolutionary here will simpl

The 'pre' bit in the lines of code above is using css syntax to walk the structure of the HTML document to get to what we want. I know that I need those particular selectors because I have examined the HTML for the page to see how it is organized. You can do this by going to your webpage and inspecting the element that you want by right clicking on it and selecting "inspect element". This particular code says, "find the 'pre' tags. Once we have all that, print out the text of those 'pre' tags.

Alright, so now that we've been able to extract the text from a single speech in the Castro speech database, let's expand the scope of our scraping to extract the text of hundreds of Castro speeches spanning several years!

## 3. Scraping Hundreds of Speeches from 1959-1970

First, create a new folder within your shared folder on the desktop to hold the indices of speeches for the years 1959-1970.

`os` is the 'operating system' python library, `chdir` is shorthand for 'change directory' and 'mkdir' is shorthand for 'make directory".

In [None]:
import os
!mkdir -p castro_speeches/indices-of-speeches_1959to1970
os.chdir('castro_speeches/indices-of-speeches_1959to1970')

Open your shared folder on the desktop to take a look and see if the folder was created.

Now we're going to identify the base URL we'll be scraping from. All the URLs in the Castro database reflect a highly structured, uniform naming system, with each file assigned a unique identifier and nested within a directory structure by year.

Using a special function called `wget` (short for 'website get'), we will scrape all of the URLs corresponding to the indices of speeches for each year from 1959-1970.

These files will be written into the folder `indices-of-speeches_1959to1970`

(Note that in the year range identified, we end at 1971, not 1970. This is because Python counts integers starting at 0 not 1, so ending at 1971 accommodates for this counting system.)


In [None]:
import subprocess

for i in range(1959,1971):
    base_url = 'http://lanic.utexas.edu/project/castro/'
    url = base_url + str(i)
    subprocess.call(['wget', url, '--adjust-extension'])

Check the folder `indices-of-speeches_1959to1970` you made within the sharedfolder on the desktop. Hopefully you see several html files corresponding to the years 1959-1970! You've successfully scraped HTML files from the web and can use these files to extract the specific text you're interested in!


In [None]:
html_filenames = [item for item in os.listdir('./') if '.html' in item]

html_filenames [:10] #this prints out the filenames for the first ten items in our list.

Now we're going to use Beautiful Soup to extract the URLs corresponding to each speech listed in the yearly indices.

In [None]:
url_list = []

for file in html_filenames:
    page = open(file).read()
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(page, 'lxml')
    for link in soup.findAll('a'):
        try:
            url_list.append('http://lanic.utexas.edu' + link['href'])
        except Exception as e:
            print(e)

The output of repeated 'href' represent the empty entries in the 'a href' field throughout all the html files.

In [None]:
url_list # prints out the content of the list

How many speech URLs did we scrape? Let's find out!

In [None]:
len(url_list) # len means length. We're finding the number of URLs in our list.

Let's create a filtered list of only the urls that correspond to speeches.

In [None]:
urls_filtered = []

for item in url_list:
    if '/db/' in item:
        urls_filtered.append(item)

urls_filtered [:50]

Great! We have a filtered list of all the URLs we need. Now let's copy this list to a text file!

In [None]:
os.chdir('..')
with open('urls-speeches.txt', 'w') as file_out:
    for url in urls_filtered:
        file_out.write(url)
        file_out.write('\n')

Now we have a text file containing all the URLs we want to scrape content from. Open the Castro's shared folder on the desktop to open the `urls-speeches.txt` file we just made.

The next step is to create a new folder called 'speeches' where we'll dump all the text we scrape from the list of URLs we have.

In [None]:
!mkdir speeches

Now we're going to bulk scrape the HTML content of each URL that corresponds to a speech. Note that when we run the cell below, it'll take approximately 56 seconds. When that time has lapsed, we should have successfully downloaded 443 HTML files.

In [None]:
import subprocess

!wget -i urls-speeches.txt --wait=0.1 -P ./speeches/

Hooray! We now have an HTML file for each speech in the Castro Speech Database from 1959 to 1970!

Open one of the HTML files by control clicking and selecting "TextWrangler" from the "Open With" dropdown menu.

You'll see that the speech text we want is still embedded in HTML. Let's run Beautiful Soup to extract the speech texts from all 443 HTML files. Then, we'll create plain text files for each speech!

In [None]:
os.chdir('speeches')
         
html_filenames = [item for item in os.listdir('./') if '.html' in item]

for file in html_filenames:
    try:
        page = open(file, 'rb').read().decode('latin1') #read as byte stream
        soup = BeautifulSoup(page, 'lxml')
        body_text = soup.body
        raw_text = soup.select('pre')
        clean_text = raw_text[0].text
        with open(file.replace('.html', '.txt'), 'w') as file_out:
            file_out.write(clean_text)
    except Exception as e:
        print(file)

You should now have clean, plain text files of all the 1959 to 1970 speeches. With a few modifications to the code above, you can easily extend this web scraping to the entire database for speeches from 1959 to 1996!