# DSI Python Workshop, Part 2

Information about the U.S. Congress is available in analysis-friendly formats (CSV, JSON, etc) from several different online sources. One example is the [ProPublica Congress API][propublica-api]. On the other hand, information about the California State Legislature is less accessible.

Today we'll scrape the web to get biographical details about members of the California legislature, and then use this data to make a few visualizations.

[propublica-api]: https://propublica.github.io/congress-api-docs/

The California Assembly and Senate official web sites have member names and addresses, but not much else. Wikipedia is a conviently centralized source of information about public office holders. We'll start by scraping the [California State Legislature 2017-18 page][ca-legislature]
 at:
```
https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session
```

[ca-legislature]: https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session

## Notes

In [1]:
import requests
import pandas as pd
import lxml.html as lx

In [2]:
url = "https://en.wikipedia.org/wiki/California_State_Legislature,_2017%E2%80%9318_session"
#url = "https://www.google.com/dgagegeh"

response = requests.get(url)
response.raise_for_status()

html = lx.fromstring(response.text)
html.make_links_absolute(url)

tables = html.cssselect("table")
table = tables[5]

links = table.cssselect("tr td:nth-of-type(3) a")

In [3]:
senator_links = [link.get("href") for link in links]

### Step 1: Downloading The Page

The first step is to download the page. We'll use __requests__ for this. __Requests__ is a Python package for sending HTTP requests--the same requests your web browser sends to download web pages.

There are a few [different kinds of HTTP requests][http-requests]. We're only going to send __GET__ requests, which are for getting web pages or other data from a server.

The server's response comes with an [HTTP status code][http-status] to indicate whether anything went wrong.

[http-status]: https://en.wikipedia.org/wiki/List_of_HTTP_status_codes
[http-requests]: https://en.wikipedia.org/wiki/Hypertext_Transfer_Protocol#Request_methods

### Step 2: Extracting Links

Now that we have the page, we need to extract a link for each of the 40 senators. Like most web pages, the page is written in HTML.

HTML uses "tags" to mark how text should be structured and formatted. Tags are written in angle brackets. Most tags come in pairs that surround the text they apply to. Closing tags begin with `</` rather than `<`. For instance, the `<b>` tag marks text as bold:
```html
This is regular text. <b>This is bold text.</b> This is more regular text.
```
Tags may include other tags as sub-elements.

Start tags can also include "attributes", which are additional information written after the tag name. As an example, the `<a>` tag marks a hyperlink and usually includes an `href` attribute with the URL to link to:
```html
<a href="http://www.google.com/" id="my_link">Google</a>
```
Attributes are space-separated; the tag above has two attributes.

__lxml__ is a Python package for parsing HTML (and XML) documents. We can use __lxml__ to extract text from tags and attributes in the page we downloaded.

We'll use CSS selectors to tell __lxml__ which tags we want. You can learn more about CSS selectors through [this interactive tutorial][diner].

[diner]: http://flukeout.github.io/

### Your Turn!

Get links to the Wikipedia pages for each of California's 80 assembly members.

### Step 3: Downloading And Extracting Biographical Details

Extracting the biographical details for one senator follows the same procedure as the previous two steps. First we download the page with __requests__, and then parse the HTML with __lxml__.

The data are in a table again, and since we want the table's content rather than its attributes, it makes sense to use the `read_html()` function from __pandas__. This function uses __lxml__ behind the scenes to convert all of the tables in a page into data frames.

Finally, since we need to apply these operations to the pages for all 40 senators, we'll call the code from a loop over the senator URLs. 

Loop bodies are good candidates for function definitions, so we should move the code into a function definition before writing the loop. This is especially important for Python, where loops can be written as a comprehensions. Comprehensions use less memory and are sometimes faster than regular loops.

In [30]:
import pandas as pd
import time

#url_senator = senator_links[0]

def scrape_bio(url_senator):
    print(url_senator)
    response = requests.get(url_senator)
    response.raise_for_status()

    table = pd.read_html(response.text, attrs={"class": "infobox vcard"})[0]

    name = table.iloc[0, 0]
    name

    has_born = table.iloc[:, 0].str.contains("Born", na = False)
    #print(has_born)
    born = table.iloc[:, 1].loc[has_born].values[0]
    
    time.sleep(0.5)
    
    return {"name": name, "born": born}

In [27]:
bios = [scrape_bio(u) for u in senator_links]

https://en.wikipedia.org/wiki/Ted_Gaines
https://en.wikipedia.org/wiki/Mike_McGuire_(politician)
https://en.wikipedia.org/wiki/Bill_Dodd_(California_politician)
https://en.wikipedia.org/wiki/Jim_Nielsen
https://en.wikipedia.org/wiki/Cathleen_Galgiani
https://en.wikipedia.org/wiki/Richard_Pan
https://en.wikipedia.org/wiki/Steve_Glazer
https://en.wikipedia.org/wiki/Tom_Berryhill
https://en.wikipedia.org/wiki/Nancy_Skinner_(California_politician)
https://en.wikipedia.org/wiki/Bob_Wieckowski
https://en.wikipedia.org/wiki/Scott_Wiener
https://en.wikipedia.org/wiki/Anthony_Cannella
https://en.wikipedia.org/wiki/Jerry_Hill_(politician)
https://en.wikipedia.org/wiki/Andy_Vidak
https://en.wikipedia.org/wiki/Jim_Beall_(California_politician)
https://en.wikipedia.org/wiki/Jean_Fuller
https://en.wikipedia.org/wiki/Bill_Monning
https://en.wikipedia.org/wiki/Robert_Hertzberg
https://en.wikipedia.org/wiki/Hannah-Beth_Jackson
https://en.wikipedia.org/wiki/Connie_Leyva
https://en.wikipedia.org/wiki/Sco

In [29]:
pd.DataFrame(bios)

Unnamed: 0,born,name
0,"(1958-04-25) April 25, 1958 (age 58) Roseville...",Ted Gaines
1,"(1979-07-07) July 7, 1979 (age 37) Healdsburg,...",Mike McGuire
2,"William Harold Dodd (1956-06-10) June 10, 1956...",Bill Dodd
3,"(1944-07-31) July 31, 1944 (age 72) Fresno, Ca...",Jim Nielsen
4,"(1964-01-04) January 4, 1964 (age 53) Stockton...",Cathleen Galgiani
5,"(1965-10-28) October 28, 1965 (age 51)","Richard Pan M.D., M.P.H., FAAP"
6,"Steven Mitchell Glazer (1957-08-10) August 10,...",Steve Glazer
7,Thomas Charles Berryhill (1953-08-27) August 2...,Tom Berryhill
8,"(1954-08-12) August 12, 1954 (age 62)",Nancy Skinner
9,"(1955-02-18) February 18, 1955 (age 62) San Fr...",Bob Wieckowski


### Your Turn!

Modify `scrape_bio()` so that it also retrieves "Alma mater" and "Residence". If you finish early, try retrieving other details as well.

### Step 4: Scrubbing The Data

Some of the columns in the extracted data frame contain multiple biographical details. For instance, each senator's age is embedded in the `born` column.

Let's extract the ages with the text processing methods in __pandas__. With these we can operate on entire columns at once, which is more convenient than writing a loop (akin to vectorization in _R_).

### Step 5: Visualizing The Data

__Bokeh__ is a Python package for creating semi-interactive visualizations. The package is [well-documentated][bokeh-docs] and all of its plotting functions expect [tidy data][tidy] as input.

[bokeh-docs]: http://bokeh.pydata.org/en/latest/docs/user_guide.html
[tidy]: http://vita.had.co.nz/papers/tidy-data.html

In [6]:
senators = pd.read_csv("senators.csv")
senators.head()

Unnamed: 0,born,name,party,term
0,"(1958-04-25) April 25, 1958 (age 58) Roseville...",Ted Gaines,Republican,Member of the California Senate from the 1st d...
1,"(1979-07-07) July 7, 1979 (age 37) Healdsburg,...",Mike McGuire,Democratic,Member of the California State Senate from the...
2,"William Harold Dodd (1956-06-10) June 10, 1956...",Bill Dodd,Democratic (2013–present),Member of the California State Senate from the...
3,"(1944-07-31) July 31, 1944 (age 72) Fresno, Ca...",Jim Nielsen,Republican,Member of the California State Senate from the...
4,"(1964-01-04) January 4, 1964 (age 53) Stockton...",Cathleen Galgiani,Democratic,Member of the California State Senate from the...


In [7]:
senators.born[0]

'(1958-04-25) April 25, 1958 (age\xa058) Roseville, California'

In [18]:
senators["age"] = senators.born.str.rsplit(")").str.get(-2).str.rsplit().str.get(-1)
senators["age"] = pd.to_numeric(senators["age"])

In [20]:
import bokeh.charts as bkh
bkh.output_notebook()

In [21]:
plt = bkh.Histogram(senators, "age")
bkh.show(plt)

### Your Turn!

Extract each senator's birth state from the data frame. Make a bar plot of that shows the number of senators born in each place.