# Bill Analysis

This script takes the most and least central bills and analyzes their texts.

We get the data about the most and least central bills from the jupyter notebook that ranks senators and bills.

## Most / Least Central Bills

The most central bills from our ranking script include sres254, sres292, sres193, s1616, and sres184, all with vast cosponsorship and a custom centrality measure of 9.737875, as well as slightly less central bills s1182, sres6, sres173, s1598, and s722.

The least central bills include sres4, sconres1, sres16, sres7, s1848, s371, sres210, s1631, sres62, and s1662.

## Getting Bill Texts

Getting full texts of bills can prove challenging.  The ProPublica API does not supply full texts for bills.  We can obtain them by screen-scraping from the Government Publishing Office (GPO).  For example, the page with the text for sres254 is https://www.gpo.gov/fdsys/pkg/BILLS-115sres254ats/html/BILLS-115sres254ats.htm, while senate bill 1616 has several versions, each of which represents a state of the legislation as it goes through consideration.  The final text can be found at https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm .

Some investigation and experimentation reveals that the URL we seek can be composed like this:

* `https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm` 
* bill stub (e.g. "sres254" or "s1616") 
* code for stage of legislation 
 -  for bills we find: Introduced = "is", Referred in House = "rfh", Reported = "rs", Placed on Calendar = "pcs", Engrossed = "es", Enrolled = "enr"
 -  for resolutions we find: Introduced = "is" , Reported = "rs", Agreed to = "ats"
 -  more info about these codes and what they mean can be found at https://www.senate.gov/reference/Printedlegislationkey.htm or  https://www.gpo.gov/help/index.html#about_congressional_bills.htm
* `/html/BILLS-115s`
* bill stub 
* code for stage of legislation
* `.htm`

When we look at the html of a given bill or resolution, we also discover that there's a non-trivial amount of metadata included, such as the legislative sponsors, which we'll want to discard for text analysis (we only want to analyze the contents of the legislation itself).

So, we have several challenges:

* identify the legislative stage of the bill or resolution so we can find the most up-to-date text
* construct the URL
* obtain, parse, and scrape the html
* from the text scraped, obtain just the bill text (as opposed to the list of sponsors, for example)

Let's set up a few variables with data we'll need.

In [2]:
most_central = ['sres254', 'sres292', 'sres193', 's1616', 'sres184', 's1182', 'sres6', 'sres173', 's1598', 's722']
least_central = ['sres4', 'sconres1', 'sres16', 'sres7', 's1848', 's371', 'sres210', 's1631', 'sres62', 's1662']
bill_status = ['enr', 'es', 'rs', 'rfh', 'is'] # latest to earliest stages
resolution_status = ['ats', 'rs', 'is']  # latest to earliest stages

Ideally we'd like to set up a system that would construct these URL's.  For now we just manually obtained them from the GPO website.

https://www.gpo.gov/fdsys/pkg/BILLS-115sres254ats/html/BILLS-115sres254ats.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sres292ats/html/BILLS-115sres292ats.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sres193ats/html/BILLS-115sres193ats.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sres184ats/html/BILLS-115sres184ats.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115s1182es/html/BILLS-115s1182es.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sres6rs/html/BILLS-115sres6rs.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sres173ats/html/BILLS-115sres173ats.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115s1598rs/html/BILLS-115s1598rs.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115s722es/html/BILLS-115s722es.htm

https://www.gpo.gov/fdsys/pkg/BILLS-115sres4is/html/BILLS-115sres4is.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sconres1enr/html/BILLS-115sconres1enr.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sres16ats/html/BILLS-115sres16ats.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sres7ats/html/BILLS-115sres7ats.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115s1848pcs/html/BILLS-115s1848pcs.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115s371enr/html/BILLS-115s371enr.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sres210ats/html/BILLS-115sres210ats.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115s1631rs/html/BILLS-115s1631rs.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115sres62pcs/html/BILLS-115sres62pcs.htm
https://www.gpo.gov/fdsys/pkg/BILLS-115s1662pcs/html/BILLS-115s1662pcs.htm

Obtain the text of the bill

In [17]:
from bs4 import BeautifulSoup
import requests
url = "https://www.gpo.gov/fdsys/pkg/BILLS-115sres254ats/html/BILLS-115sres254ats.htm"
r = requests.get(url)

Make Soup

In [44]:
soup = BeautifulSoup(r.text, 'html.parser')
print(soup.prettify())

<html>
 <body>
  <pre>[Congressional Bills 115th Congress]
[From the U.S. Government Publishing Office]
[S. Res. 254 Agreed to Senate (ATS)]

&lt;DOC&gt;






115th CONGRESS
  1st Session
S. RES. 254

Relative to the death of Pietro ``Pete'' Vichi Domenici, former United 
              States Senator for the State of New Mexico.


_______________________________________________________________________


                   IN THE SENATE OF THE UNITED STATES

                           September 13, 2017

 Mr. McConnell (for himself, Mr. Schumer, Mr. Udall, Mr. Heinrich, Mr. 
 Alexander, Ms. Baldwin, Mr. Barrasso, Mr. Bennet, Mr. Blumenthal, Mr. 
Blunt, Mr. Booker, Mr. Boozman, Mr. Brown, Mr. Burr, Ms. Cantwell, Mrs. 
 Capito, Mr. Cardin, Mr. Carper, Mr. Casey, Mr. Cassidy, Mr. Cochran, 
 Ms. Collins, Mr. Coons, Mr. Corker, Mr. Cornyn, Ms. Cortez Masto, Mr. 
 Cotton, Mr. Crapo, Mr. Cruz, Mr. Daines, Mr. Donnelly, Ms. Duckworth, 
  Mr. Durbin, Mr. Enzi, Mrs. Ernst, Mrs. Feinstein, Mrs. F

Get just the "pre" tag content

In [64]:
s = soup.find('pre').find_all(text=True, recursive=False)
print(str(s).replace('\\n',''))

[u"[Congressional Bills 115th Congress][From the U.S. Government Publishing Office][S. Res. 254 Agreed to Senate (ATS)]<DOC>115th CONGRESS  1st SessionS. RES. 254Relative to the death of Pietro ``Pete'' Vichi Domenici, former United               States Senator for the State of New Mexico._______________________________________________________________________                   IN THE SENATE OF THE UNITED STATES                           September 13, 2017 Mr. McConnell (for himself, Mr. Schumer, Mr. Udall, Mr. Heinrich, Mr.  Alexander, Ms. Baldwin, Mr. Barrasso, Mr. Bennet, Mr. Blumenthal, Mr. Blunt, Mr. Booker, Mr. Boozman, Mr. Brown, Mr. Burr, Ms. Cantwell, Mrs.  Capito, Mr. Cardin, Mr. Carper, Mr. Casey, Mr. Cassidy, Mr. Cochran,  Ms. Collins, Mr. Coons, Mr. Corker, Mr. Cornyn, Ms. Cortez Masto, Mr.  Cotton, Mr. Crapo, Mr. Cruz, Mr. Daines, Mr. Donnelly, Ms. Duckworth,   Mr. Durbin, Mr. Enzi, Mrs. Ernst, Mrs. Feinstein, Mrs. Fischer, Mr.    Flake, Mr. Franken, Mr. Gardner, Mrs. Gill

Strip out everything before the second `______________ `

In [69]:
import re
legislation =  re.sub(r'(.+_{4,})', '', str(s).replace('\\n',' '))
legislation =  re.sub('\\n','', legislation)
legislation

'                                 RESOLUTION     Relative to the death of Pietro ``Pete\'\' Vichi Domenici, former United                States Senator for the State of New Mexico.  Whereas Pete V. Domenici was born in Albuquerque, New Mexico in 1932; graduated          from the University of New Mexico and Denver University Law School; and          practiced law in Albuquerque; Whereas Pete V. Domenici was elected to the Albuquerque City Commission in 1966,          and as Chairman in 1967; Whereas Pete V. Domenici was first elected to the United States Senate in 1972          and served six terms as a Senator from the State of New Mexico with          honor and distinction, making him the longest serving Senator in New          Mexico history; Whereas Pete V. Domenici served the Senate as Chairman of the Committee on the          Budget for the One Hundred Fourth through One Hundred Sixth Congresses,          and during the One Hundred Seventh Congress; Whereas Pete V. Domenici serve

In [70]:
import nltk
from nltk import word_tokenize
tokens = nltk.word_tokenize(legislation)

In [71]:
tokens

['RESOLUTION',
 'Relative',
 'to',
 'the',
 'death',
 'of',
 'Pietro',
 '``',
 'Pete',
 "''",
 'Vichi',
 'Domenici',
 ',',
 'former',
 'United',
 'States',
 'Senator',
 'for',
 'the',
 'State',
 'of',
 'New',
 'Mexico',
 '.',
 'Whereas',
 'Pete',
 'V.',
 'Domenici',
 'was',
 'born',
 'in',
 'Albuquerque',
 ',',
 'New',
 'Mexico',
 'in',
 '1932',
 ';',
 'graduated',
 'from',
 'the',
 'University',
 'of',
 'New',
 'Mexico',
 'and',
 'Denver',
 'University',
 'Law',
 'School',
 ';',
 'and',
 'practiced',
 'law',
 'in',
 'Albuquerque',
 ';',
 'Whereas',
 'Pete',
 'V.',
 'Domenici',
 'was',
 'elected',
 'to',
 'the',
 'Albuquerque',
 'City',
 'Commission',
 'in',
 '1966',
 ',',
 'and',
 'as',
 'Chairman',
 'in',
 '1967',
 ';',
 'Whereas',
 'Pete',
 'V.',
 'Domenici',
 'was',
 'first',
 'elected',
 'to',
 'the',
 'United',
 'States',
 'Senate',
 'in',
 '1972',
 'and',
 'served',
 'six',
 'terms',
 'as',
 'a',
 'Senator',
 'from',
 'the',
 'State',
 'of',
 'New',
 'Mexico',
 'with',
 'honor',