# Web Scraping Demo

### Congressional Legislation, Rhetoric, and Calendar Data


**Goals:**
* I want a list of all current Congress members.
* I want to match them to their Wikipedia pages.
* I want a database of Congressional Twitter rhetoric.
* I want a database of all Congressional bills.

In this demo we are doing to use a suite of standard Python packages in order to demonstrate how to scrape and collect the aforementioned Congressional data.

Our primary source is going to be **GovTrack.us** (https://www.govtrack.us).


We will specifically be using:

* **`bs4`** - BeautifulSoup4, a package for text parsing common web formats.
* **`json`** - A built-in Python for manipulating .json (dictionary) files.
* **`tweepy`** - The official Twitter API for scraping Tweets and timelines.
* **`PDFMiner`** - A package that extracts text from raw PDFs.


### Exercise 1.) Parsing Bills / Members of Congress

We will examine how to parse bills / member data from various web formats using **`bs4`**.

**Part A.** Let's look at a single Congressional bill and its corresponding sponsor.

In [7]:
import bs4
import urllib2

# An example Congressional bill 
url = "https://www.govtrack.us/congress/bills/115/sres80"

# Make an HTTP request to retreive the HTML page
response = urllib2.urlopen(url)

print response.info()

Server: nginx/1.4.6 (Ubuntu)
Date: Fri, 17 Mar 2017 03:32:53 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: public
X-Konklone-Force-HTTPS: TRUE
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload



In [8]:
# The 'soup' contains all the parsed data we need
soup = bs4.BeautifulSoup(response.read(), "html.parser")
print soup.prettify()[:1000]
print "..."

<!DOCTYPE html>
<html>
 <head>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible">
   <meta charset="utf-8">
    <meta content="width=device-width, initial-scale=1" name="viewport">
     <title>
      A resolution designating March 3, 2017, as “World Wildlife Day”. (S.Res. 80) - GovTrack.us
     </title>
     <link href="/favicon.ico" rel="shortcut icon" type="image/x-icon"/>
     <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css" integrity="sha384-BVYiiSIFeK1dGmJRAkycuHAHRg32OmUcww7on3RYdg4Va+PmSTsz/K68vbdEjh4u" rel="stylesheet">
      <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap-theme.min.css" integrity="sha384-rHyoN1iRsVXV4nD0JutlnGaslCJuC7uwjduW9SVrLvRYooPp2bWYgmgJQIXwl/Sp" rel="stylesheet">
       <link crossorigin="anonymous" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.3.0/css/font-awesome.min.css" integrity="sha256-VBrFgheoreGl4pKmWgZh3J23pJrhNlSUOBe

In [9]:

# Soup attributes correspond to HTML tags,
# SEE https://www.w3schools.com/tags/ for reference.
print soup.title
print soup.title.string

# We've identified a unique id for our data of interest
bill_overview_id = "bill-overview-panel"

overview_tag = soup.find_all(id=bill_overview_id)[0]

<title>A resolution designating March 3, 2017, as “World Wildlife Day”. (S.Res. 80) - GovTrack.us</title>
A resolution designating March 3, 2017, as “World Wildlife Day”. (S.Res. 80) - GovTrack.us
<p>Follow GovTrack on social media for more updates:</p>


Once you've 'narrowed' down the section of your html, you want to 'crawl' through to collect the data you want. We've got our overview data, so let's clean this up now!

In [10]:
detail_name_tags = overview_tag.find_all("dt")
detail_data_tags = overview_tag.find_all("dd")

print detail_name_tags[0].text
print detail_data_tags[0].text

print "\n"

print detail_name_tags[1].text
print detail_data_tags[1].text.strip().split("\n")[0]

print "\n"

print detail_name_tags[2].text
print detail_data_tags[2].text.strip()

print "\n"

Introduced:
Mar 2, 2017
115th Congress, 2017–2019


Status:
Agreed To (Simple Resolution)


Sponsor:
Chris Coons
Junior Senator from Delaware
Democrat




**Can we match this to the sponsor's Wikipedia page?**

In [12]:
import bs4
import urllib2

d = urllib2.urlopen("https://en.wikipedia.org/wiki/Chris_Coons")

soup = bs4.BeautifulSoup(d.read(), "html.parser")

info_table = soup.find_all(class_="infobox vcard")

tbl_rows = info_table[0].find_all("tr")
for row in tbl_rows:
    cells =  row.find_all("td") # table cells
    header = row.find("th")
    if header != None:
        print "\nTitle: " + header.text
    if len(cells) != 0:
        for cell in cells:
            cell_text = cell.text.strip()
            if len(cell_text) != 0:
                print "Desc: " + cell_text
    
    


Title: Chris Coons

Title: United States Senator
from Delaware
Desc: Incumbent
Desc: Assumed office
November 15, 2010
Serving with Tom Carper

Title: Preceded by
Desc: Ted Kaufman

Title: Vice Chair of the Senate Ethics Committee
Desc: Incumbent
Desc: Assumed office
January 3, 2017

Title: Preceded by
Desc: Barbara Boxer

Title: County Executive of New Castle County
Desc: In office
January 4, 2005 – November 15, 2010

Title: Preceded by
Desc: Thomas Gordon

Title: Succeeded by
Desc: Paul Clark

Title: President of the New Castle County Council
Desc: In office
January 2, 2001 – January 4, 2005

Title: Preceded by
Desc: Stephanie Hansen

Title: Succeeded by
Desc: Paul Clark

Title: Personal details

Title: Born
Desc: Christopher Andrew Coons
(1963-09-09) September 9, 1963 (age 53)
Greenwich, Connecticut, U.S.

Title: Political party
Desc: Republican (before 1988)
Democratic (1988–present)

Title: Spouse(s)
Desc: Annie Lingenfelter

Title: Children
Desc: 3

Title: Education
Desc: Amherst

**Part B.** Now let's download an entire list of U.S. Senators.

In [14]:
# XML file of senators
url = "https://www.govtrack.us/api/v2/role?current=true&format=xml"

resp = urllib2.urlopen(url)
print "Reading in ..."
soup = bs4.BeautifulSoup(resp.read(), "lxml")

print "OK"

Reading in ...
OK


In [15]:
# Let's traverse immediate children of the root tag
mbrs = soup.find("objects").findChildren("item", recursive=False)
print "TOTAL: " + str(len(mbrs))

TOTAL: 100


In [16]:
children = mbrs[0].findChildren()
for child in children:
    print child.name + ": " + child.text

caucus: null
congress_numbers: 
114
115
116

item: 114
item: 115
item: 116
current: True
description: Senior Senator from Tennessee
district: null
enddate: 2021-01-03
extra: 
455 Dirksen Senate Office Building Washington DC 20510
http://www.alexander.senate.gov/public/index.cfm?p=Email
202-228-3398
455 Dirksen Senate Office Building
http://www.alexander.senate.gov/public/?a=rss.feed

address: 455 Dirksen Senate Office Building Washington DC 20510
contact_form: http://www.alexander.senate.gov/public/index.cfm?p=Email
fax: 202-228-3398
office: 455 Dirksen Senate Office Building
rss_url: http://www.alexander.senate.gov/public/?a=rss.feed
id: 43255
leadership_title: null
party: Republican
person: 
A000360
1940-07-03
5
Lamar
male
Male
300002
Alexander
https://www.govtrack.us/congress/members/lamar_alexander/300002
        
Sen. Lamar Alexander [R-TN]


N00009888
15691
Alexander, Lamar (Sen.) [R-TN]
SenAlexander
lamaralexander

bioguideid: A000360
birthday: 1940-07-03
cspanid: 5
firstname: L

In [17]:
# Let's do this again
# but 'parse out' the `person` field this time.

children = mbrs[0].findChildren()
for child in children:
    if child.name in ["birthday", "party", "twitterid"]:
        print child.name + ": " + child.text
    if child.name == "person":
        print "name: " + child.find("name").text
        print "gender: " + child.find("gender").text
        print "cspanid: " + child.find("cspanid").text

party: Republican
name: Sen. Lamar Alexander [R-TN]
gender: male
cspanid: 5
birthday: 1940-07-03
twitterid: SenAlexander


**We can now for every single Senator in our file! Why don't we turn this into a nice dataframe and save it?**

In [22]:
import pandas as pd 

columns=["name", "birthday", "party", "twitterid", "cspanid"]
df = pd.DataFrame([], columns=columns)


for mbr in mbrs:
    children = mbr.findChildren()
    for child in children:
        if child.name == "birthday":
            birthday = child.text
        if child.name == "party":
            party = child.text
        if child.name == "twitterid":
            twitterid = child.text
        if child.name == "person":
            name = child.find("name").text
            gender = child.find("gender").text
            cspanid = child.find("cspanid").text    
    if name != None:
        new_row_df = pd.DataFrame([[name, birthday, party, twitterid, cspanid]], columns=columns) 
        df = df.append(new_row_df, ignore_index=True)

from IPython.display import display
print display(df)



Unnamed: 0,name,birthday,party,twitterid,cspanid
0,Sen. Lamar Alexander [R-TN],1940-07-03,Republican,SenAlexander,5
1,Sen. Roger Wicker [R-MS],1951-07-05,Republican,SenatorWicker,18203
2,Sen. Timothy Kaine [D-VA],1958-02-26,Democrat,SenKaineOffice,49219
3,Sen. Ted Cruz [R-TX],1970-12-22,Republican,SenTedCruz,1019953
4,Sen. Deb Fischer [R-NE],1951-03-01,Republican,SenatorFischer,1034067
5,Sen. Heidi Heitkamp [D-ND],1955-10-30,Democrat,SenatorHeitkamp,95414
6,Sen. Angus King [I-ME],1944-03-31,Independent,SenAngusKing,37413
7,Sen. Elizabeth Warren [D-MA],1949-06-22,Democrat,SenWarren,1023023
8,Sen. Joe Manchin [D-WV],1947-08-24,Democrat,Sen_JoeManchin,62864
9,Sen. Martin Heinrich [D-NM],1971-10-17,Democrat,MartinHeinrich,1030686


None


**Try extending this on your own!**

**Part C.** Let's parse some roll call votes for a specific bill ... but wait, it's in JSON format!

**.. How do we work with JSON data?**

In [19]:
import json

# File of roll call votes for a specific bill
url ="https://www.govtrack.us/data/congress/113/votes/2014/h108/data.json"
response = urllib2.urlopen(url)
dict_obj = json.loads(response.read())


In [20]:
print dict_obj.keys()
print dict_obj["amendment"]
print dict_obj["bill"]

[u'amendment', u'category', u'votes', u'congress', u'result_text', u'type', u'bill', u'question', u'number', u'source_url', u'updated_at', u'chamber', u'session', u'result', u'date', u'vote_id', u'requires']
{u'type': u'h-bill', u'number': 1, u'author': u'Jackson Lee of Texas Part C Amendment No. 1'}
{u'type': u'hr', u'number': 2641, u'congress': 113}


In [24]:
cols = ["name", "party", "state", "vote", "bill", "ammendment", "year"]
df2 = pd.DataFrame([], columns=cols)

for aye in dict_obj["votes"]["Aye"]:
    name = aye["display_name"]
    party = aye["party"]
    state = aye["state"]
    vote = "aye"
    bill = "H.R. 108"
    year = "2014"
    ammendment = "1"
    new_row_df = pd.DataFrame([[name, party, state, vote, bill, ammendment, year]],  columns=cols)
    df2 = df2.append( new_row_df, ignore_index=True )

print display(df2)

Unnamed: 0,name,party,state,vote,bill,ammendment,year
0,Barber,D,AZ,aye,H.R. 108,1,2014
1,Bass,D,CA,aye,H.R. 108,1,2014
2,Beatty,D,OH,aye,H.R. 108,1,2014
3,Becerra,D,CA,aye,H.R. 108,1,2014
4,Bera (CA),D,CA,aye,H.R. 108,1,2014
5,Bishop (NY),D,NY,aye,H.R. 108,1,2014
6,Blumenauer,D,OR,aye,H.R. 108,1,2014
7,Bonamici,D,OR,aye,H.R. 108,1,2014
8,Brady (PA),D,PA,aye,H.R. 108,1,2014
9,Braley (IA),D,IA,aye,H.R. 108,1,2014


None


**You could now extend out this dataframe for any and all bills and all respective ammendments!**

### Exercise 2. ) Scraping Congressional Tweets 

For this segment, we will use the **`tweepy`** library to demonstrate how to use the Twitter API to download the Twitter timelines of our Senators.

Note that you will need to create an developer account from here first:
https://dev.twitter.com/

In [2]:
import tweepy

# Fill this in with your own information
ckey    = "jGJsmFgEi6wdPeFo8hvpryQrQ" #consumer key
csecret = "Ff5X4QeVYHGGikeBN03l9ooHCb0x3tOPhmEpWGi8kTnLFloFLC" #conumer scret
atoken  = "731185387-O34WQW8MmjWtvaW72tuhz4hfAkZZ3p7RYqOP3nel" # access token
asecret = "NwoGqLCwwmFd7hjejry9hP9MbZUYaWfQrS4egWZi3Jp7A" #access secret
    

In [3]:
# Instantiate OAuthHandler and initialize it with your credentials
# (i.e. secure "handshake" with the API)
auth = tweepy.OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
 
# Instantiate the tweepy API object.
api = tweepy.API(auth)

**Now let's grab some tweets!**

In [6]:
# Example
example_user = "realDonaldTrump"
tweeter = api.get_user(example_user)

print "======================================="
print "@" + example_user
print "\nTweeter follower count: " + str(tweeter.followers_count)
print "Tweeter description: " + tweeter.description
print "======================================="

## Using the api object, we access the user timeline and nab 10 tweets
tweets = api.user_timeline(id=tweeter.id,count=10)
for tweet in tweets:
    print "Tweet: " + tweet.text
    print "\n"

@realDonaldTrump

Tweeter follower count: 26615381
Tweeter description: 45th President of the United States of America
Tweet: "The President Changed. So Has Small Businesses' Confidence"
https://t.co/daTGjPmYeJ


Tweet: North Korea is behaving very badly. They have been "playing" the United States for years. China has done little to help!


Tweet: RT @foxandfriends: FOX NEWS ALERT: Jihadis using religious visa to enter US, experts warn (via @FoxFriendsFirst) https://t.co/pwXeR9OMQC


Tweet: RT @foxandfriends: VIDEO: Rep. Scalise — GOP agrees on over 85 percent of health care bill https://t.co/05dtfjAUbx


Tweet: RT @AmericaFirstPol: MAJOR IMPACT: @POTUS Trump is 50 Days in and moving swiftly to get America back on the right track. #MAGA 
https://t.c…


Tweet: RT @FoxNews: Jobs created in February. https://t.co/sOaMDxxTA8


Tweet: Happy Lá Fheile Phadraig to all of my great Irish friends!


Tweet: RT @DiamondandSilk: When the President says "You're Fired"     That means: "Pack Yo Stuff 

**Now let's grab 10 most recent tweets from all our Senators (limiting to 15 for demo purposes)**


In [None]:
cols = ["name", "twitterid", "tweet", "hour", "day", "month", "year"]
df3 = pd.DataFrame([], columns=cols)

# Iterate over all unique Senators we previously parsed
for i in range(len(df.index))[:15]:
    name = df.ix[i, "name"]
    twitterid = df.ix[i, "twitterid"]
    print twitterid
    
    tweeter = api.get_user(twitterid)

    # Using the api object, we access the user timeline and nab 10 tweets
    tweets = api.user_timeline(id=tweeter.id,count=10)
    for tweet in tweets:
        text = tweet.text
        date = tweet.created_at
        new_row_df = pd.DataFrame([[name, twitterid, text, date.hour, date.day, date.month, date.year ]], columns=cols)
        df3 = df3.append(new_row_df)

display(df3)

**In theory, you can now extend this dataframe out for all senators and all tweets!**

### Exercise 3.) Extracting text from bill PDFs

Finally, to round out our dataset, we would like more comprehensive data on our bills.

In theory, we have now built a *HUGE* database of Congress members, roll call votes on bills of interest, as well as Congressional rhetoric from Twitter.

Now - for bills of our interest - we want to store the **entire bill text** as well as general metadata on each bill using the tool **`PDFMiner`**.

In [30]:
import pdfminer

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO


In order to use PDFMiner, we could implement our own routine for parsing the *exact* structure of the PDF file we are parsing.

For the purposes of this workshop, we borrow the function **`convert_pdf_to_txt`**.

### This function can be found through a quick google search of `convert_pdf_to_txt`:

In [29]:
######################################################################################
# Implemented previously here:
# https://github.com/kshvmdn/find-me-a-job/blob/master/helpers/convert_pdf_to_txt.py
######################################################################################

def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()
    fp.close()
    device.close()
    retstr.close()
    return text

# Example bill to parse 
url = "https://www.gpo.gov/fdsys/pkg/BILLS-111hr645ih/pdf/BILLS-111hr645ih.pdf"
response = urllib2.urlopen(url)

# Write the PDF data to disk 
with open('/tmp/my_bill.pdf', 'wb') as f:
    f.write(response.read())
    print "ok"
    
text = convert_pdf_to_txt('/tmp/my_bill.pdf') 


ok
Congress #: 111TH
Session: 1ST
Bill: 645




To direct the Secretary of Homeland Security to establish national emergency 

centers on military installations. 

IN  THE  HOUSE  OF  REPRESENTATIVES 

JANUARY 22, 2009 

Mr.  HASTINGS of  Florida  introduced  the  following  bill;  which  was  referred  to 

the  Committee  on  Transportation  and  Infrastructure,  and  in  addition  to 

the  Committee  on  Armed  Services,  for  a  period  to  be  subsequently  deter-

mined  by  the  Speaker,  in  each  case  for  consideration  of  such  provisions 

as fall within the jurisdiction of the committee concerned 

A  BILL 

To direct the Secretary of Homeland Security to establish 

national emergency centers on military installations. 

Be it enacted by the Senate and House of Representa-

tives of the United States of America in Congress assembled, 

SECTION 1. SHORT TITLE. 

This  Act  may  be  cited  as  the  ‘‘National  Emergency 

Centers Establishment Act’’. 

SEC.  2.  ESTABLIS

**Now to clean up our PDF text ...**

In [None]:

cleaned_text = []
for i, line in enumerate(text.split("\n")):
    if len(line.strip()) == 0:
        # blank line
        continue
    if len(line.strip()) == 1 or len(line.strip()) == 2:
        # margin line numberings
        continue
    elif i == 2:
        cong_num = line.split(" ")[0]
        print "Congress #: " + line.split(" ")[0]
    elif i == 4:
        session = line.split(" ")[0]
        billno  = line.split(" ")[-2]
        print "Session: " + line.split(" ")[0]
        print "Bill: " + line.split(" ")[-2]
    else:
        cleaned_text.append(line)
    
print "\n\n\n"
print "\n\n".join(cleaned_text)

### Conclusion

We have now covered how to work with...

1. HTML data (a single bill page / wikipedia page)
2. XML data  (a large collection of Senator entries)
3. JSON data (unstructured data on roll call votes)
4. Twitter data (scraping tweets from matching handles)
5. PDF data (mining text from PDFs)



#### You now have a  fundamental toolkit to go scrape any and all corners of the world wide web!