# Scrape Congress

This notebook demonstrates how to scrape http://www.tweetcongress.org for all of the contact info for congress members

### Part 1: Setup

I had to do `brew install geckodriver` for this to work
These are the libraries you need

In [104]:
import requests
from bs4 import BeautifulSoup
from selenium import webdriver
browser = webdriver.Firefox()
import time
import csv

Instantiate a browser

### Part 2: Test One Page (Texas)

Lets try to test a page to see if we can get at the relevant parts

In [84]:
url = 'http://www.tweetcongress.org/tweeters/TX'
browser.set_script_timeout(20)
browser.get(url)

In [85]:
html_source = browser.page_source
soup = BeautifulSoup(html_source)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html5lib")

  markup_type=markup_type))


By using dev tools, we can inspect elements and find reasonable hooks that allow us to get the data we want:

In [86]:
guy = soup.find('div', class_="legislator-wrap")

In [87]:
guy.find(class_ = 'tweet-party').text

'R - TX'

In [31]:
guy.find('a')['title']

'John Cornyn'

In [35]:
guy.find(class_ = 'tw-legislator')['href']

'http://twitter.com/JohnCornyn'

In [37]:
guy.find(class_='legislator-phone').text

'202-224-2934'

In [38]:
guy.find(class_='legislator-email').text

'Sen.Cornyn@opencongress.org'

In [39]:
guy.find(class_='legislator-website').text

'https://www.cornyn.senate.gov'

### Part 3:  Make a function to generalize the acquisiton of information

In [110]:
def extract_profile_info(html):
    """
    get info from person's profile
    """
    party = html.find(class_ = 'tweet-party').text
    name = html.find('a')['title']
    twitter = html.find(class_ = 'tw-legislator')['href']
    email = html.find(class_='legislator-email').text
    phone = html.find(class_='legislator-phone').text
    website = html.find(class_='legislator-website').text
    return(name, party, twitter, email, phone, website)

In [111]:
def extract_page_info(url = 'http://www.tweetcongress.org/tweeters/TX', 
                     file = 'legislators.csv'):
    """
    this function takes as input a url from tweetcongress.org and
    extracts relevant information into a csv file
    """
    browser.get(url)
    #timeout for 10 seconds to allow page to load
    time.sleep(10)
    html_source = browser.page_source
    html_soup = BeautifulSoup(html_source)
    
    with open(file, 'a') as csvfile:
        cwriter = csv.writer(csvfile, delimiter=',')
        for html in html_soup.find_all('div', class_="legislator-wrap"):
            cwriter.writerow(list(extract_profile_info(html)))

### Part 4: Get a list of valid states that you want to scrape for

Got the list of states that are available on thier website.  I could have just used a list of the 52 states, but I found this to be easier.

In [113]:
url = 'http://www.tweetcongress.org/tweeters'
browser.set_script_timeout(20)
browser.get(url)
html_source = browser.page_source
soup = BeautifulSoup(html_source)



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html5lib")

  markup_type=markup_type))


### Part 5: Run this for all states

In [132]:
statehtml = soup.find_all('path')

In [134]:
statelist = []
for html in statehtml:
    statelist.append(html.attrs['class'][1])

In [146]:
statelist[-13:]

['PA', 'RI', 'SC', 'SD', 'TN', 'TX', 'UT', 'VT', 'VA', 'WA', 'WV', 'WI', 'WY']

In [148]:
for state in statelist[-12:]:
    extract_page_info('http://www.tweetcongress.org/tweeters/{st}'.format(st=state))



 BeautifulSoup([your markup])

to this:

 BeautifulSoup([your markup], "html5lib")

  markup_type=markup_type))


### Part 6: Get CSV File

In [149]:
from IPython.display import FileLink

In [None]:
FileLink('legislators.')