# Web Scraping Demo

### Congressional Legislation, Rhetoric, and Calendar Data


**Goals:**
* I want a list of all current Congress members.
* I want to match them to their Wikipedia pages.
* I want a database of Congressional Twitter rhetoric.
* I want a database of all Congressional bills.

In this demo we are doing to use a suite of standard Python packages in order to demonstrate how to scrape and collect the aforementioned Congressional data.

Our primary source is going to be **GovTrack.us** (https://www.govtrack.us).


We will specifically be using:

* **`bs4`** - BeautifulSoup4, a package for text parsing common web formats.
* **`json`** - A built-in Python for manipulating .json (dictionary) files.
* **`tweepy`** - The official Twitter API for scraping Tweets and timelines.
* **`PDFMiner`** - A package that extracts text from raw PDFs.


### 1.) Parsing a Single Bill / Congressperson

In [389]:
import bs4
import urllib2

# An example Congressional bill 
url = "https://www.govtrack.us/congress/bills/115/sres80"

# Make an HTTP request to retreive the HTML page
response = urllib2.urlopen(url)

print response.info()

Server: nginx/1.4.6 (Ubuntu)
Date: Fri, 17 Mar 2017 03:24:24 GMT
Content-Type: text/html; charset=utf-8
Transfer-Encoding: chunked
Connection: close
Vary: Accept-Encoding
Cache-Control: public
X-Konklone-Force-HTTPS: TRUE
Strict-Transport-Security: max-age=31536000; includeSubDomains; preload



In [None]:
# The 'soup' contains all the parsed data we need
soup = bs4.BeautifulSoup(d.read(), "html")
print soup.prettify()[:1000]

In [None]:

# Soup attributes correspond to HTML tags,
# SEE https://www.w3schools.com/tags/ for reference.
print soup.title
print soup.title.string
print soup.p


## To find all tags called x, simply use soup.find_all(x)

# soup.find_all('a')

## the best way to find ANY tag of interest in your html is by id or class ... find_all can do this!
## .. will return a list of candidates
overview_tag = soup.find_all(id="bill-overview-panel")[0]

## Once you've 'narrowed' down the section of your html, you want to 'crawl'
## through to collect the data you want (go back to chrome inspector and show.)

## We've got our overview data, so let's clean this up now!
## the `text` attribute of a tag is REALLY important ... once you've found a tag of interest
## and you know the text at the very bottom is important, just call text and it will go all the way down

detail_name_tags = overview_tag.find_all("dt")
detail_data_tags = overview_tag.find_all("dd")

print detail_name_tags[0].text
print detail_data_tags[0].text

print "\n"

print detail_name_tags[1].text
print detail_data_tags[1].text.strip().split("\n")[0]

print "\n"

print detail_name_tags[2].text
print detail_data_tags[2].text.strip()

print "\n"


## In short, that's how you do HTML scraping!!
## .. But hmm, that's not really scalable and it's pretty tedious.
## .. That's why we prefer to use APIs (things actually provided for us) or 
##    exposed XML, JSON data and even *tabular* html rather than 




In [161]:
#### Wikipedia example
import bs4
import urllib2

d = urllib2.urlopen("https://en.wikipedia.org/wiki/Chris_Coons")

soup = bs4.BeautifulSoup(d.read(), "html")

info_table = soup.find_all(class_="infobox vcard")
# print info_table[0].prettify()

tbl_rows = info_table[0].find_all("tr")
for row in tbl_rows:
    cells =  row.find_all("td") # table cells
    header = row.find("th")
    if header != None:
        print "\nTitle: " + header.text
    if len(cells) != 0:
        for cell in cells:
            if len(cell.text.strip()) != 0:
                print "Desc: " + cell.text.strip()
    
    


Title: Chris Coons

Title: United States Senator
from Delaware
Desc: Incumbent
Desc: Assumed office
November 15, 2010
Serving with Tom Carper

Title: Preceded by
Desc: Ted Kaufman

Title: Vice Chair of the Senate Ethics Committee
Desc: Incumbent
Desc: Assumed office
January 3, 2017

Title: Preceded by
Desc: Barbara Boxer

Title: County Executive of New Castle County
Desc: In office
January 4, 2005 – November 15, 2010

Title: Preceded by
Desc: Thomas Gordon

Title: Succeeded by
Desc: Paul Clark

Title: President of the New Castle County Council
Desc: In office
January 2, 2001 – January 4, 2005

Title: Preceded by
Desc: Stephanie Hansen

Title: Succeeded by
Desc: Paul Clark

Title: Personal details

Title: Born
Desc: Christopher Andrew Coons
(1963-09-09) September 9, 1963 (age 53)
Greenwich, Connecticut, U.S.

Title: Political party
Desc: Republican (before 1988)
Democratic (1988–present)

Title: Spouse(s)
Desc: Annie Lingenfelter

Title: Children
Desc: 3

Title: Education
Desc: Amherst

In [250]:
# Part 1. (b.)

## Now we want to download some data on senators (from an XML file)

## TODO: for this one, collapse it all the way to display the structure
##       and then uncollapse one level at a time.

url = "https://www.govtrack.us/api/v2/role?current=true&format=xml"

## Similar drill as before
d = urllib2.urlopen(url)
print "Reading in ..."
soup = bs4.BeautifulSoup(d.read(), "lxml")

## Much more structured than the HTML ... we can directly just look through source and see what we need
print "OK"

Reading in ...
Total count: 100


In [262]:
## In our strategy, we look for immediate children instead of looking for a specific field
mbrs = soup.find("objects").findChildren("item", recursive=False)
# print "TOTAL: " + str(len(mbrs))

## Instead of going STRAIGHT to the text field, perhaps we want to explore ...
## .. how can we see what kind of children are present in the level below ours? (put this in separate cell)
# children = mbrs[0].findChildren()
# for child in children:
#     print child.name+": "+child.text
    
    
    
# children = mbrs[0].findChildren()
# for child in children:
#     if child.name in ["birthday", "party", "twitterid", "person"]:
#         print child.name+": "+child.text
        
    
## Note that person is all concatenated together ... which means that it's not a text field, let's do this again

children = mbrs[0].findChildren()
for child in children:
    if child.name in ["birthday", "party", "twitterid"]:
        print child.name+": "+child.text
    if child.name == "person":
        print "name: " +child.find("name").text
        print "gender: "+child.find("gender").text
        print "cspanid: "+child.find("cspanid").text

party: Republican
name: Sen. Lamar Alexander [R-TN]
gender: male
cspanid: 5
birthday: 1940-07-03
twitterid: SenAlexander


In [340]:
### We can now loop over and do this for every single person!

# for mbr in mbrs:
#     children = mbr.findChildren()
#     for child in children:
#         if child.name in ["person"]:
#             pass
#             print "name: " +child.find("name").text

### Why don't we turn this into a nice dataframe and save it?
import pandas as pd 

columns=["name", "birthday", "party", "twitterid", "cspanid"]
df = pd.DataFrame([], columns=columns)


for mbr in mbrs:
    children = mbr.findChildren()
    for child in children:
        if child.name == "birthday":
            birthday = child.text
        if child.name == "party":
            party = child.text
        if child.name == "twitterid":
            twitterid = child.text
        if child.name == "person":
            name = child.find("name").text
            gender = child.find("gender").text
            cspanid = child.find("cspanid").text    
    if name != None:
        new_row_df = pd.DataFrame([[name, birthday, party, twitterid, cspanid]], columns=columns) 
        df = df.append(new_row_df, ignore_index=True)

from IPython.display import display
print display(df)

# df.to_csv("my_data.csv")


### and that's the story

## There are endpoints for bill data, sessions, etc.

## Try extending this on your own!


Unnamed: 0,name,birthday,party,twitterid,cspanid
0,Sen. Lamar Alexander [R-TN],1940-07-03,Republican,SenAlexander,5
1,Sen. Roger Wicker [R-MS],1951-07-05,Republican,SenatorWicker,18203
2,Sen. Timothy Kaine [D-VA],1958-02-26,Democrat,SenKaineOffice,49219
3,Sen. Ted Cruz [R-TX],1970-12-22,Republican,SenTedCruz,1019953
4,Sen. Deb Fischer [R-NE],1951-03-01,Republican,SenatorFischer,1034067
5,Sen. Heidi Heitkamp [D-ND],1955-10-30,Democrat,SenatorHeitkamp,95414
6,Sen. Angus King [I-ME],1944-03-31,Independent,SenAngusKing,37413
7,Sen. Elizabeth Warren [D-MA],1949-06-22,Democrat,SenWarren,1023023
8,Sen. Joe Manchin [D-WV],1947-08-24,Democrat,Sen_JoeManchin,62864
9,Sen. Martin Heinrich [D-NM],1971-10-17,Democrat,MartinHeinrich,1030686


None


In [341]:
## What about JSON data? How do we manipulate that?

## Let's look at roll call votes for a bill (113th H.R. 108)

url ="https://www.govtrack.us/data/congress/113/votes/2014/h108/data.json"

import json

response = urllib2.urlopen(url)
dict_obj = json.loads(response.read())

# print json.dumps(dict_obj, indent=4)

## Json is in fact simpler to index into ... but sometimes more complex, unstructured and difficult to store

print dict_obj.keys()

print dict_obj["amendment"]
print dict_obj["bill"]


cols = ["name", "party", "state", "vote", "bill", "ammendment", "year"]
df2 = pd.DataFrame([], columns=cols)

for aye in dict_obj["votes"]["Aye"]:
    name = aye["display_name"]
    party = aye["party"]
    state = aye["state"]
    vote = "aye"
    bill = "H.R. 108"
    year = "2014"
    ammendment = "1"
    new_row_df = pd.DataFrame([[name, party, state, vote, bill, ammendment, year]],  columns=cols)
    df2 = df2.append( new_row_df, ignore_index=True )

print display(df2)
    
## You could imagine extending out this dataframe for all bills and all ammendments
## TODO: write some pseudo code around that

[u'amendment', u'category', u'votes', u'congress', u'result_text', u'type', u'bill', u'question', u'number', u'source_url', u'updated_at', u'chamber', u'session', u'result', u'date', u'vote_id', u'requires']
{u'type': u'h-bill', u'number': 1, u'author': u'Jackson Lee of Texas Part C Amendment No. 1'}
{u'type': u'hr', u'number': 2641, u'congress': 113}


Unnamed: 0,name,party,state,vote,bill,ammendment,year
0,Barber,D,AZ,aye,H.R. 108,1,2014
1,Bass,D,CA,aye,H.R. 108,1,2014
2,Beatty,D,OH,aye,H.R. 108,1,2014
3,Becerra,D,CA,aye,H.R. 108,1,2014
4,Bera (CA),D,CA,aye,H.R. 108,1,2014
5,Bishop (NY),D,NY,aye,H.R. 108,1,2014
6,Blumenauer,D,OR,aye,H.R. 108,1,2014
7,Bonamici,D,OR,aye,H.R. 108,1,2014
8,Brady (PA),D,PA,aye,H.R. 108,1,2014
9,Braley (IA),D,IA,aye,H.R. 108,1,2014


None


In [325]:
## Part 2. Twitter API 

## Now that we have twitter IDs, let's create our database of tweets

import tweepy

ckey    = "jGJsmFgEi6wdPeFo8hvpryQrQ" #consumer key
csecret = "Ff5X4QeVYHGGikeBN03l9ooHCb0x3tOPhmEpWGi8kTnLFloFLC" #conumer scret
atoken  = "731185387-O34WQW8MmjWtvaW72tuhz4hfAkZZ3p7RYqOP3nel" # access token
asecret = "NwoGqLCwwmFd7hjejry9hP9MbZUYaWfQrS4egWZi3Jp7A" #access secret

# instantiate OAuthHandler and initialize it with your credentials
auth = tweepy.OAuthHandler(ckey,csecret)
auth.set_access_token(atoken,asecret)

# instantiate the tweepy object, invoke the constructor with the just-created
# OAuthHandler object
api = tweepy.API(auth)


# Now let's get some tweets!!

## We grab a 'tweeter' object
tweeter = api.get_user("BarackObama")

print "\nTweeter follower count: " + str(tweeter.followers_count)
print "Tweeter description: " + tweeter.description

print "\nHere come the tweets..."

## Using the api object, we access the user timeline and nab 10 tweets
tweets = api.user_timeline(id=tweeter.id,count=10)
for tweet in tweets:
    print "Tweet: " + tweet.text
    


Tweeter follower count: 85858874
Tweeter description: Dad, husband, President, citizen.

Here come the tweets...
Tweet: On International Women’s Day, @MichelleObama and I are inspired by all of you who embrace your power to drive chang… https://t.co/Er9mIQlmgr
Tweet: RT @ObamaFoundation: Courage comes in many forms. Who in your local community or neighborhood leads by example? #ProfileInCourage https://t…
Tweet: Humbled to be recognized by a family with a legacy of service. Who's your #ProfileInCourage? Tell me about them:… https://t.co/25Ohhab8Xn
Tweet: We asked. You answered. https://t.co/mAJvko6VqR
Tweet: Happy Valentine’s Day, @michelleobama! Almost 28 years with you, but it always feels new. https://t.co/O0UhJWoqGN
Tweet: I read letters like these every single day. It was one of the best parts of the job – hearing from you. https://t.co/so1luBcszV
Tweet: RT @ObamaFoundation: Add your voice: https://t.co/mA9MSHmi7o https://t.co/Uf7oEvkZF3
Tweet: In the meantime, I want to hear wha

In [346]:
## Now let's grab 10 most recent tweets from all our Senators (doing 15 for sake of brevity)

cols = ["name", "twitterid", "tweet", "hour", "day", "month", "year"]
df3 = pd.DataFrame([], columns=cols)

for i in range(len(df.index))[:15]:
    name = df.ix[i, "name"]
    twitterid = df.ix[i, "twitterid"]
    print twitterid
    
    tweeter = api.get_user(twitterid)

    ## Using the api object, we access the user timeline and nab 10 tweets
    tweets = api.user_timeline(id=tweeter.id,count=10)
    for tweet in tweets:
        text = tweet.text
        date = tweet.created_at
        new_row_df = pd.DataFrame([[name, twitterid, text, date.hour, date.day, date.month, date.year ]], columns=cols)
        df3 = df3.append(new_row_df)

display(df3)

# df3.to_csv("my_tweets.csv")


## You could imagine extending out this dataframe for all senators and all tweets!

        
        

SenAlexander
SenatorWicker
SenKaineOffice
SenTedCruz
SenatorFischer
SenatorHeitkamp
SenAngusKing
SenWarren
Sen_JoeManchin
MartinHeinrich
SenJohnBarrasso
SenBobCorker
SenWhitehouse
SenBobCasey
SenatorTester


Unnamed: 0,name,twitterid,tweet,hour,day,month,year
0,Sen. Lamar Alexander [R-TN],SenAlexander,Cheering for @VandyMBB in their game against N...,20.0,16.0,3.0,2017.0
0,Sen. Lamar Alexander [R-TN],SenAlexander,Good luck to @MT_MBB as they take on Minnesota...,20.0,16.0,3.0,2017.0
0,Sen. Lamar Alexander [R-TN],SenAlexander,Rooting for an @ETSU_MBB win in their game aga...,19.0,16.0,3.0,2017.0
0,Sen. Lamar Alexander [R-TN],SenAlexander,".@SenAlexander, @SenBobCorker joined President...",15.0,16.0,3.0,2017.0
0,Sen. Lamar Alexander [R-TN],SenAlexander,READ: Sen. Alexander’s op-ed in the @Tennessea...,21.0,15.0,3.0,2017.0
0,Sen. Lamar Alexander [R-TN],SenAlexander,.@SenAlexander: Republicans’ top priority in C...,20.0,15.0,3.0,2017.0
0,Sen. Lamar Alexander [R-TN],SenAlexander,"On this day in 1982, Alexander, then TN gov., ...",19.0,15.0,3.0,2017.0
0,Sen. Lamar Alexander [R-TN],SenAlexander,RT @memphisnews: Op-ed from @SenAlexander: Tru...,18.0,15.0,3.0,2017.0
0,Sen. Lamar Alexander [R-TN],SenAlexander,"RT @SenBobCorker: We just landed in Detroit, w...",17.0,15.0,3.0,2017.0
0,Sen. Lamar Alexander [R-TN],SenAlexander,"“230,000 Tennesseans …will likely have zero he...",16.0,15.0,3.0,2017.0


In [388]:
## Part 3. bill PDF


## Okay! In theory, we've built on these examples, we have a huge dataset of wikipedia info, roll call votes, information on Congress, as well as bills

## ... remember how in the first example we saw something about "bill text" ? How would we go about parsing that??

import pdfminer

from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter
from pdfminer.converter import TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfpage import PDFPage
from cStringIO import StringIO


def convert_pdf_to_txt(path):
    rsrcmgr = PDFResourceManager()
    retstr = StringIO()
    codec = 'utf-8'
    laparams = LAParams()
    device = TextConverter(rsrcmgr, retstr, codec=codec, laparams=laparams)
    fp = file(path, 'rb')
    interpreter = PDFPageInterpreter(rsrcmgr, device)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    for page in PDFPage.get_pages(fp, pagenos, maxpages=maxpages, password=password,caching=caching, check_extractable=True):
        interpreter.process_page(page)

    text = retstr.getvalue()

    fp.close()
    device.close()
    retstr.close()
    return text


url = "https://www.gpo.gov/fdsys/pkg/BILLS-111hr645ih/pdf/BILLS-111hr645ih.pdf"

response = urllib2.urlopen(url)

with open('/tmp/my_bill.pdf', 'wb') as f:
    f.write(response.read())
    print "ok"
    
    
text = convert_pdf_to_txt('/tmp/my_bill.pdf')

# Clean this line by line:
cleaned_text = []
for i, line in enumerate(text.split("\n")):
    if len(line.strip()) == 0:
        # blank line
        continue
    if len(line.strip()) == 1 or len(line.strip()) == 2:
        # margin line numberings
        continue
    elif i == 2:
        cong_num = line.split(" ")[0]
        print "Congress #: " + line.split(" ")[0]
    elif i == 4:
        session = line.split(" ")[0]
        billno  = line.split(" ")[-2]
        print "Session: " + line.split(" ")[0]
        print "Bill: " + line.split(" ")[-2]
    else:
        cleaned_text.append(line)
    
print "\n\n\n"
print "\n\n".join(cleaned_text)
        


ok
Congress #: 111TH
Session: 1ST
Bill: 645




To direct the Secretary of Homeland Security to establish national emergency 

centers on military installations. 

IN  THE  HOUSE  OF  REPRESENTATIVES 

JANUARY 22, 2009 

Mr.  HASTINGS of  Florida  introduced  the  following  bill;  which  was  referred  to 

the  Committee  on  Transportation  and  Infrastructure,  and  in  addition  to 

the  Committee  on  Armed  Services,  for  a  period  to  be  subsequently  deter-

mined  by  the  Speaker,  in  each  case  for  consideration  of  such  provisions 

as fall within the jurisdiction of the committee concerned 

A  BILL 

To direct the Secretary of Homeland Security to establish 

national emergency centers on military installations. 

Be it enacted by the Senate and House of Representa-

tives of the United States of America in Congress assembled, 

SECTION 1. SHORT TITLE. 

This  Act  may  be  cited  as  the  ‘‘National  Emergency 

Centers Establishment Act’’. 

SEC.  2.  ESTABLIS

In [367]:


### WE have covered how to scrape
1.) html data
2.) xml data
3.) json data
4.) twitter data
5.) pdf data

### that is a solid basic toolkit for you to go off into the world with!


