# Webscraping with Beautiful Soup
*****
In this lesson we'll learn about various techniques to scrape data from websites. This lesson will include:

0. Discussion of complying with Terms of Use
1. Using Python's `BeautifulSoup` library
2. Collecting data from one page
3. Following collected links
4. Exporting data to CSV


# 0. Terms of Use
We'll be scraping [information on the state senators of Illinois](http://www.ilga.gov/senate), as well as the [list of bills](http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True) from the Illinois General Assembly. Your first step before scraping should always be to read the Terms of Use or Terms of Agreement for a website. Many websites will explicitly prohibit scraping in any form. Moreover, if you're affiliated with an institution, you may be breaching existing contracts by engaging in scraping. UC Berkeley's Library [recommends](http://guides.lib.berkeley.edu/text-mining) following this workflow:

![UCB-library-workflow](img/UCB-library-workflow.png)

While our source's [Terms of Use](http://www.ilga.gov/disclaimer.asp) do not explicitly prohibit scraping (nor do their [robots.txt](http://www.ilga.gov/robots.txt)), it is advisable to still contact the web administrator of the website. We will not be placing too much stress on their servers today, so please keep this in mind while following along and executing the code. You should always attempt to contact the web administrator of the site you plan to scrape. Oftentimes there is an easier way to get the data that you want.

Let's go ahead and import the Python libraries we'll need:

In [1]:
import requests  # to make GET request
from bs4 import BeautifulSoup  # to parse the HTML response
import time  # to pause between calls
import csv  # to write data to csv
import pandas  # to see CSV

# 1. Using `BeautifulSoup`
*****

## 1.1 Make a GET request and parse the HTML response

We use the `requests` library just as we did with APIs, but this time we won't get JSON or XML back, but we'll get an HTML response.

In [2]:
# make a GET request
response = requests.get('http://www.ilga.gov/senate/default.asp')

# read the content of the server’s response as a string
page_source = response.text
print(page_source[:1000])

<html lang="en">

<title>Illinois General Assembly - Senate Members</title>

<head>

<link rel="stylesheet" type="text/css" href="/style/lis.css">
<link rel="stylesheet" type="text/css" media="print" href="/style/print.css">
<link rel="GILS" href="http://info.er.usgs.gov/public/gils/gilsexec.html">
<link rel="Shortcut Icon" HREF="/LISlogo1.ico">
<SCRIPT LANGUAGE="JavaScript" TYPE="text/javascript">
<!--

if(window.event + "" == "undefined") event = null;
function HM_f_PopUp(){return false};
function HM_f_PopDown(){return false};
popUp = HM_f_PopUp;
popDown = HM_f_PopDown;

//-->
</SCRIPT>
  
<!--
    option explicit
  -->
<meta http-equiv="PICS-Label" content='(PICS-1.1 "http://www.weburbia.com/safe/ratings.htm" l r (s 0))'>
<meta name="classification" content="Government">
<meta name="distribution" content="Global">
<meta name="rating" content="General">
<meta name="contactState" content="IL">
<meta name="siteTitle" content="Illinois General Assembly">
<meta name=

## 1.2 *soup* it

Now we use the `BeautifulSoup` function to make an object of the response, which allows us to parse the HTML tree. This returns an object (called a *soup* object) with all of the HTML in the original document.

In [3]:
# parse the response into an HTML tree soup object
soup = BeautifulSoup(page_source, 'html5lib')

# take a look
print(soup.prettify()[:1000])

<html lang="en">
 <head>
  <title>
   Illinois General Assembly - Senate Members
  </title>
  <link href="/style/lis.css" rel="stylesheet" type="text/css"/>
  <link href="/style/print.css" media="print" rel="stylesheet" type="text/css"/>
  <link href="http://info.er.usgs.gov/public/gils/gilsexec.html" rel="GILS"/>
  <link href="/LISlogo1.ico" rel="Shortcut Icon"/>
  <script language="JavaScript" type="text/javascript">
   <!--

if(window.event + "" == "undefined") event = null;
function HM_f_PopUp(){return false};
function HM_f_PopDown(){return false};
popUp = HM_f_PopUp;
popDown = HM_f_PopDown;

//-->
  </script>
  <!--
    option explicit
  -->
  <meta content='(PICS-1.1 "http://www.weburbia.com/safe/ratings.htm" l r (s 0))' http-equiv="PICS-Label"/>
  <meta content="Government" name="classification"/>
  <meta content="Global" name="distribution"/>
  <meta content="General" name="rating"/>
  <meta content="IL" name="contactState"/>
  <meta content="Illinois General Assembly" name="si

## 1.3 Find Elements

`BeautifulSoup` has a number of functions to find things on a page. Like other scraping tools, `BeautifulSoup` lets you find elements by their:

1. HTML tags
2. HTML Attributes
3. CSS Selectors


Let's search first for **HTML tags**. 

The function `find_all` searches the `soup` tree to find all the elements with a particular HTML tag, and returns all of those elements.

What does the example below do?

In [4]:
soup.find_all("a")

[<a href="/default.asp"><img alt="Illinois General Assembly" border="0" height="49" src="/images/logo_sm.gif" width="462"/></a>,
 <a class="mainmenu" href="/">Home</a>,
 <a class="mainmenu" href="/legislation/" onblur="HM_f_PopDown('elMenu1')" onfocus="HM_f_PopUp('elMenu1',event)" onmouseout="HM_f_PopDown('elMenu1')" onmouseover="HM_f_PopUp('elMenu1',event)">Legislation &amp; Laws</a>,
 <a class="mainmenu" href="/senate/" onblur="HM_f_PopDown('elMenu3')" onfocus="HM_f_PopUp('elMenu3',event)" onmouseout="HM_f_PopDown('elMenu3')" onmouseover="HM_f_PopUp('elMenu3',event)">Senate</a>,
 <a class="mainmenu" href="/house/" onblur="HM_f_PopDown('elMenu2')" onfocus="HM_f_PopUp('elMenu2',event)" onmouseout="HM_f_PopDown('elMenu2')" onmouseover="HM_f_PopUp('elMenu2',event)">House</a>,
 <a class="mainmenu" href="/mylegislation/" onblur="HM_f_PopDown('elMenu4')" onfocus="HM_f_PopUp('elMenu4',event)" onmouseout="HM_f_PopDown('elMenu4')" onmouseover="HM_f_PopUp('elMenu4',event)">My Legislation</a>,
 

**NB**: Because `find_all()` is the most popular method in the `BeautifulSoup` search library, you can use a shortcut for it. If you treat the `BeautifulSoup` object as though it were a function, then it’s the same as calling `find_all()` on that object. 

In [5]:
soup("a")

[<a href="/default.asp"><img alt="Illinois General Assembly" border="0" height="49" src="/images/logo_sm.gif" width="462"/></a>,
 <a class="mainmenu" href="/">Home</a>,
 <a class="mainmenu" href="/legislation/" onblur="HM_f_PopDown('elMenu1')" onfocus="HM_f_PopUp('elMenu1',event)" onmouseout="HM_f_PopDown('elMenu1')" onmouseover="HM_f_PopUp('elMenu1',event)">Legislation &amp; Laws</a>,
 <a class="mainmenu" href="/senate/" onblur="HM_f_PopDown('elMenu3')" onfocus="HM_f_PopUp('elMenu3',event)" onmouseout="HM_f_PopDown('elMenu3')" onmouseover="HM_f_PopUp('elMenu3',event)">Senate</a>,
 <a class="mainmenu" href="/house/" onblur="HM_f_PopDown('elMenu2')" onfocus="HM_f_PopUp('elMenu2',event)" onmouseout="HM_f_PopDown('elMenu2')" onmouseover="HM_f_PopUp('elMenu2',event)">House</a>,
 <a class="mainmenu" href="/mylegislation/" onblur="HM_f_PopDown('elMenu4')" onfocus="HM_f_PopUp('elMenu4',event)" onmouseout="HM_f_PopDown('elMenu4')" onmouseover="HM_f_PopUp('elMenu4',event)">My Legislation</a>,
 

That's a lot! Many elements on a page will have the same HTML tag. For instance, if you search for everything with the `a` tag, you're likely to get a lot of stuff, much of which you don't want. What if we wanted to search for HTML tags ONLY with certain attributes, like particular CSS classes? 

We can do this by adding an additional argument to the `find_all`. In the example below, we are finding all the `a` tags, and then filtering those with `class = "sidemenu"`.

In [6]:
# get only the 'a' tags in 'sidemenu' class
soup("a", class_="sidemenu")

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>,
 <a class="sidemenu" href="/senate/rules.asp">  Rules  </a>,
 <a class="sidemenu" href="/senateaudvid.asp">  Live Audio/Video  </a>]

Oftentimes a more efficient way to search and find things on a website is by **CSS selector.** For this we have to use a different method, `select()`. Just pass a string into the `.select()` to get all elements with that string as a valid CSS selector.

In the example above, we can use "a.sidemenu" as a CSS selector, which returns all `a` tags with class `sidemenu`.

In [7]:
# get elements with "a.sidemenu" CSS Selector.
soup.select("a.sidemenu")

[<a class="sidemenu" href="/senate/default.asp">  Members  </a>,
 <a class="sidemenu" href="/senate/committees/default.asp">  Committees  </a>,
 <a class="sidemenu" href="/senate/schedules/default.asp">  Schedules  </a>,
 <a class="sidemenu" href="/senate/journals/default.asp">  Journals  </a>,
 <a class="sidemenu" href="/senate/transcripts/default.asp">  Transcripts  </a>,
 <a class="sidemenu" href="/senate/rules.asp">  Rules  </a>,
 <a class="sidemenu" href="/senateaudvid.asp">  Live Audio/Video  </a>]

Using CSS is one way to organize how we stylize a website. They allow us to categorize and label certain HTML elements, and use these categories and labels to apply specfic styling. CSS selectors are what we use to identify these elements, and then decide what style to apply. We won't have time today to go into detail about HTML and CSS, but it's worth talking about the three most important CSS selectors:

1. **element selector**: simply including the element type, such as `a` above, will select all elements on the page of that element type. Try using your development tools (Chrome, Firefox, or Safari) to change all elements of the type `a` to a background color of `red`.

2. **class selector**: if you put a period (`.`) before the name of a class, all elements belonging to that class will be selected. Try using your development tools to change all elements of the class `detail` to a background color of `red`.

3. **ID selector**: if you put a hashtag (`#`) before the name of an id, all elements with that id will be selected. Try using the development tools to change all elements with the id `Senate` to a background color of `red`.

The above three examples will take all elements with the given property, but oftentimes you only want certain elements within the hierarchy. We can do that by simply placing elements side-by-side separated by a space.

### Challenge 1

Find all the `<a>` elements in class `mainmenu`

In [8]:
soup.select("a.mainmenu") # your code here

[<a class="mainmenu" href="/">Home</a>,
 <a class="mainmenu" href="/legislation/" onblur="HM_f_PopDown('elMenu1')" onfocus="HM_f_PopUp('elMenu1',event)" onmouseout="HM_f_PopDown('elMenu1')" onmouseover="HM_f_PopUp('elMenu1',event)">Legislation &amp; Laws</a>,
 <a class="mainmenu" href="/senate/" onblur="HM_f_PopDown('elMenu3')" onfocus="HM_f_PopUp('elMenu3',event)" onmouseout="HM_f_PopDown('elMenu3')" onmouseover="HM_f_PopUp('elMenu3',event)">Senate</a>,
 <a class="mainmenu" href="/house/" onblur="HM_f_PopDown('elMenu2')" onfocus="HM_f_PopUp('elMenu2',event)" onmouseout="HM_f_PopDown('elMenu2')" onmouseover="HM_f_PopUp('elMenu2',event)">House</a>,
 <a class="mainmenu" href="/mylegislation/" onblur="HM_f_PopDown('elMenu4')" onfocus="HM_f_PopUp('elMenu4',event)" onmouseout="HM_f_PopDown('elMenu4')" onmouseover="HM_f_PopUp('elMenu4',event)">My Legislation</a>,
 <a class="mainmenu" href="/sitemap.asp">Site Map</a>]

## 1.4 Get Attributes and Text of Elements

Once we identify elements, we want to access information in that element. Oftentimes this means two things:

1. Text
2. Attributes

Getting the text inside an element is easy. All we have to do is use the `text` member of a `tag` object:

In [9]:
# this is a list
soup.select("a.sidemenu")

# we first want to get an individual tag object
first_link = soup.select("a.sidemenu")[0]

# check out its class
print(type(first_link))

<class 'bs4.element.Tag'>


It's a tag! Which means it has a `text` member:

In [10]:
print(first_link.text)

  Members  


You'll see there is some extra spacing here, we can use the `strip` method to remove that:

In [11]:
print(first_link.text.strip())

Members


Sometimes we want the value of certain attributes. This is particularly relevant for `a` tags, or links, where the `href` attribute tells us where the link goes.

You can access a tag’s attributes by treating the tag like a dictionary:

In [12]:
print(first_link['href'])

/senate/default.asp


Nice, but that doesn't look like a full URL! Don't worry, we'll get to this soon.

### Challenge 2

Find all the `href` attributes (url) from the mainmenu by writing a list comprehension and assign to it `rel_paths`.

In [13]:
rel_paths = [link['href'] for link in soup.select('a.mainmenu')] # your code here

In [14]:
print(rel_paths)

['/', '/legislation/', '/senate/', '/house/', '/mylegislation/', '/sitemap.asp']


# 2. Collecting information
*****

Believe it or not, that's all you need to scrape a website. Let's apply these skills to scrape the [98th general assembly](http://www.ilga.gov/senate/default.asp?GA=98).

Our goal is to scrape information on each senator, including their:
* name
* district
* party

## 2.1 First, make the GET request and *soup* it

In [15]:
# make a GET request
response = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')

# read the content of the server’s response
page_source = response.text

# soup it
soup = BeautifulSoup(page_source, "html5lib")

## 2.2 Find the right elements and text

Now let's try to get a list of rows in that table. Remember that rows are identified by the `tr` tag.

In [16]:
# get all tr elements
rows = soup.find_all("tr")
print(len(rows))

73


But remember, `find_all` gets all the elements with the `tr` tag. We can use smart CSS selectors to get only the rows we want.

In [17]:
# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')
print(rows[2].prettify())

<tr>
 <td bgcolor="white" class="detail" width="40%">
  <a href="/senate/Senator.asp?GA=98&amp;MemberID=1911">
   Pamela J. Althoff
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenatorBills.asp?GA=98&amp;MemberID=1911">
   Bills
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  <a href="SenCommittees.asp?GA=98&amp;MemberID=1911">
   Committees
  </a>
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  32
 </td>
 <td align="center" bgcolor="white" class="detail" width="15%">
  R
 </td>
</tr>



We can use the `select` method on anything. Let's say we want to find everything with the CSS selector `td.detail` in an item of the list we created above.

In [18]:
# select only those 'td' tags with class 'detail'
row = rows[2]
detail_cells = row.select('td.detail')
detail_cells

[<td bgcolor="white" class="detail" width="40%"><a href="/senate/Senator.asp?GA=98&amp;MemberID=1911">Pamela J. Althoff</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenatorBills.asp?GA=98&amp;MemberID=1911">Bills</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%"><a href="SenCommittees.asp?GA=98&amp;MemberID=1911">Committees</a></td>,
 <td align="center" bgcolor="white" class="detail" width="15%">32</td>,
 <td align="center" bgcolor="white" class="detail" width="15%">R</td>]

Most of the time, we're interested in the actual **text** of a website, not its tags. Remember, to get the text of an HTML element, use the `text` member.

In [19]:
# Keep only the text in each of those cells
row_data = [cell.text for cell in detail_cells]
print(row_data)

['Pamela J. Althoff', 'Bills', 'Committees', '32', 'R']


Now we can combine the `BeautifulSoup` tools with our basic python skills to scrape an entire web page.

In [20]:
# check it out
print(row_data[0]) # name
print(row_data[3]) # district
print(row_data[4]) # party

Pamela J. Althoff
32
R


## 2.3 Loop it all together

Let's use a `for` loop to get 'em all! We'll start at the beginning with the request:

In [21]:
# make a GET request
response = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')

# read the content of the server’s response
page_source = response.text

# soup it
soup = BeautifulSoup(page_source, "html5lib")

# create empty list to store our data
members = []

# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# loop through all rows
for row in rows:

    # select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    
    # get rid of junk rows
    if len(detail_cells) is not 5: 
        continue
        
    # keep only the text in each of those cells
    row_data = [cell.text for cell in detail_cells]
    
    # collect information
    name = row_data[0]
    district = int(row_data[3])
    party = row_data[4]
    
    # store in a tuple
    tup = (name, district, party)
    
    # append to list
    members.append(tup)

In [22]:
print(len(members))
print()
print(members)

61

[('Pamela J. Althoff', 32, 'R'), ('Jason A. Barickman', 53, 'R'), ('Scott M Bennett', 52, 'D'), ('Jennifer Bertino-Tarrant', 49, 'D'), ('Daniel Biss', 9, 'D'), ('Tim Bivins', 45, 'R'), ('William E. Brady', 44, 'R'), ('Melinda Bush', 31, 'D'), ('James F. Clayborne, Jr.', 57, 'D'), ('Jacqueline Y. Collins', 16, 'D'), ('Michael Connelly', 21, 'R'), ('John J. Cullerton', 6, 'D'), ('Thomas Cullerton', 23, 'D'), ('Bill Cunningham', 18, 'D'), ('William Delgado', 2, 'D'), ('Kirk W. Dillard', 24, 'R'), ('Dan Duffy', 26, 'R'), ('Gary Forby', 59, 'D'), ('Michael W. Frerichs', 52, 'D'), ('William R. Haine', 56, 'D'), ('Don Harmon', 39, 'D'), ('Napoleon Harris, III', 15, 'D'), ('Michael E. Hastings', 19, 'D'), ('Linda Holmes', 42, 'D'), ('Mattie Hunter', 3, 'D'), ('Toi W. Hutchinson', 40, 'D'), ('Mike Jacobs', 36, 'D'), ('Emil Jones, III', 14, 'D'), ('David Koehler', 46, 'D'), ('Dan Kotowski', 28, 'D'), ('Darin M. LaHood', 37, 'R'), ('Steven M. Landek', 12, 'D'), ('Kimberly A. Lightford', 4, 'D

### Challenge 3: Get HREF element pointing to members' bills

The code above retrieves information on:  

* the senator's name
* their district number
* and their party

We now want to retrieve the URL for each senator's list of bills. The format for the list of bills for a given senator is:

http://www.ilga.gov/senate/SenatorBills.asp + ? + GA=98 + &MemberID=**_memberID_** + &Primary=True

to get something like:

http://www.ilga.gov/senate/SenatorBills.asp?MemberID=1911&GA=98&Primary=True

You should be able to see that, unfortunately, _memberID_ is not currently something pulled out in our scraping code.

Your initial task is to modify the code above so that we also **retrieve the full URL which points to the corresponding page of primary-sponsored bills**, for each member, and return it along with their name, district, and party.

Tips: 

* To do this, you will want to get the appropriate anchor element (`<a>`) in each legislator's row of the table. You can again use the `.select()` method on the `row` object in the loop to do this — similar to the command that finds all of the `td.detail` cells in the row. Remember that we only want the link to the legislator's bills, not the committees or the legislator's profile page.
* The anchor elements' HTML will look like `<a href="/senate/Senator.asp/...">Bills</a>`. The string in the `href` attribute contains the **relative** link we are after. You can access an attribute of a `BeatifulSoup` `Tag` object the same way you access a Python dictionary: `anchor['attributeName']`. (See the <a href="http://www.crummy.com/software/BeautifulSoup/bs4/doc/#tag">documentation</a> for more details). There are a _lot_ of different ways to use BeautifulSoup to get things done; whatever you need to do to pull that `href` out is fine.
* Since we will only get a relative link, you'll have to do some concatenating to get the full URLs.


**Use the code from the for-loop above and simply add the full path to the tuple, I've copied it in below**

In [23]:
# make a GET request
response = requests.get('http://www.ilga.gov/senate/default.asp?GA=98')

# read the content of the server’s response
page_source = response.text

# soup it
soup = BeautifulSoup(page_source, "html5lib")

# create empty list to store our data
members = []

# returns every ‘tr tr tr’ css selector in the page
rows = soup.select('tr tr tr')

# loop through all rows
for row in rows:

    # select only those 'td' tags with class 'detail'
    detail_cells = row.select('td.detail')
    
    # get rid of junk rows
    if len(detail_cells) is not 5: 
        continue
        
    # keep only the text in each of those cells
    row_data = [cell for cell in detail_cells]
    
    # collect information
    name = row_data[0].text
    district = int(row_data[3].text)
    party = row_data[4].text
    url = 'http://www.ilga.gov/senate/' + row_data[1].select('a')[0]['href'] + '&Primary=True'

    # GET FULL URLS HERE
    
    # store in a tuple
    tup = (name, district, party, url)  # ADD URL VARIABLE TO TUPLE
    
    # append to list
    members.append(tup)

In [24]:
members[:5]

[('Pamela J. Althoff',
  32,
  'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True'),
 ('Jason A. Barickman',
  53,
  'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2018&Primary=True'),
 ('Scott M Bennett',
  52,
  'D',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2272&Primary=True'),
 ('Jennifer Bertino-Tarrant',
  49,
  'D',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2022&Primary=True'),
 ('Daniel Biss',
  9,
  'D',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=2020&Primary=True')]

Cool! Now you can probably guess how to loop it all together by iterating through the links we just extracted.

# 3. Following links to scrape bills
*****

## 3.1 Writing a scraper function

Now we want to scrape the webpages corresponding to bills sponsored by each senator.

We're going to write a function called `get_bills(url)` to parse a given bill's URL. This will involve:

  - requesting the URL using the <a href="http://docs.python-requests.org/en/latest/">`requests`</a> library
  - using the features of the `BeautifulSoup` library to find all of the `<td>` elements with the class `billlist`
  - return a `list` of tuples, each with:
      - description (2nd column)
      - chamber (S or H) (3rd column)
      - the last action (4th column)
      - the last action date (5th column)

In [25]:
def get_bills(url):
    
    # make the GET request
    response = requests.get(url)
    page_source = response.text
    soup = BeautifulSoup(page_source, "html5lib")
    
    # get the table rows
    rows = soup.select('tr tr tr')
    
    # make empty list to collect the info
    bills = []
    for row in rows:
        
        # get columns
        detail_cells = row.select('td.billlist')
        if len(detail_cells) is not 5:
            continue
            
        # get text in each column
        row_data = [cell.text for cell in row]

        # append data in columns 2-5
        bills.append(tuple(row_data[2:6]))
        
    return(bills)

In [26]:
# function test
test_url = members[0][3]
print(test_url)
get_bills(test_url)[0:5]

http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True


[('MEDICAID BUDGET NOTE ACT', 'S', 'Session Sine Die', '1/13/2015'),
 ('HOMELESS VETERANS SHELTER ACT', 'S', 'Session Sine Die', '1/13/2015'),
 ('ROAD FUND-NO TRANSFERS', 'S', 'Session Sine Die', '1/13/2015'),
 ('EPA-RULES-DOCUMENT SUBMISSION',
  'S',
  'Public Act . . . . . . . . . 98-0072',
  '7/15/2013'),
 ('MIN WAGE-OVERTIME-ALTERN SHIFT', 'S', 'Session Sine Die', '1/13/2015')]

## 3.2 Get all the bills

Finally, we create a dictionary `bills_dict` which maps a district number (the key) onto a list_of_bills (the value) eminating from that district. You can do this by looping over all of the senate members in `members_dict` and calling `get_bills()` for each of their associated bill URLs.

NOTE: Please call the function `time.sleep(5)` for each iteration of the loop, so that we don't destroy the state's web site.

In [27]:
bills_info = []
for member in members[:3]:  # only go through 5 members
    
    print(member[0])
    member_bills = get_bills(member[3])
    for b in member_bills:
        bill = list(member) + list(b)
        bills_info.append(bill)

    time.sleep(5)

Pamela J. Althoff
Jason A. Barickman
Scott M Bennett


In [28]:
bills_info

[['Pamela J. Althoff',
  32,
  'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True',
  'MEDICAID BUDGET NOTE ACT',
  'S',
  'Session Sine Die',
  '1/13/2015'],
 ['Pamela J. Althoff',
  32,
  'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True',
  'HOMELESS VETERANS SHELTER ACT',
  'S',
  'Session Sine Die',
  '1/13/2015'],
 ['Pamela J. Althoff',
  32,
  'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True',
  'ROAD FUND-NO TRANSFERS',
  'S',
  'Session Sine Die',
  '1/13/2015'],
 ['Pamela J. Althoff',
  32,
  'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True',
  'EPA-RULES-DOCUMENT SUBMISSION',
  'S',
  'Public Act . . . . . . . . . 98-0072',
  '7/15/2013'],
 ['Pamela J. Althoff',
  32,
  'R',
  'http://www.ilga.gov/senate/SenatorBills.asp?GA=98&MemberID=1911&Primary=True',
  'MIN WAGE-OVERTIME-ALTERN SHIFT',
  'S',
  'Session Sine Die',
  '1/13/2015'],
 

# 4. Export to CSV

We can write this to a CSV too:

In [29]:
# manually decide on header names
header = ['Senator', 'District', 'Party', 'Bills Link', 'Description', 'Chamber', 'Last Action', 'Last Action Date']

with open('all-bills.csv', 'w') as output_file:
    csv_writer = csv.writer(output_file)
    csv_writer.writerow(header)
    csv_writer.writerows(bills_info)
    
pandas.read_csv('all-bills.csv')

Unnamed: 0,Senator,District,Party,Bills Link,Description,Chamber,Last Action,Last Action Date
0,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,MEDICAID BUDGET NOTE ACT,S,Session Sine Die,1/13/2015
1,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,HOMELESS VETERANS SHELTER ACT,S,Session Sine Die,1/13/2015
2,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,ROAD FUND-NO TRANSFERS,S,Session Sine Die,1/13/2015
3,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,EPA-RULES-DOCUMENT SUBMISSION,S,Public Act . . . . . . . . . 98-0072,7/15/2013
4,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,MIN WAGE-OVERTIME-ALTERN SHIFT,S,Session Sine Die,1/13/2015
5,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,TRANSPORTATION-TECH,S,Session Sine Die,1/13/2015
6,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,FOX WATERWAY FEE,S,Session Sine Die,1/13/2015
7,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,UNEMPLOYMENT INS PUBLIC SAFETY,S,Session Sine Die,1/13/2015
8,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,COUNTY BOARD MEMBERS,S,Public Act . . . . . . . . . 98-1159,1/9/2015
9,Pamela J. Althoff,32,R,http://www.ilga.gov/senate/SenatorBills.asp?GA...,VETERANS&MENTAL HEALTH COURTS,S,Public Act . . . . . . . . . 98-0152,8/2/2013
