# Bill Analysis

This script takes the most and least central bills and analyzes their texts.

We get the data about the most and least central bills from the jupyter notebook that ranks senators and bills.

## Most / Least Central Bills

The most central bills from our ranking script include sres254, sres292, sres193, s1616, and sres184, all with vast cosponsorship and a custom centrality measure of 9.737875, as well as slightly less central bills s1182, sres6, sres173, s1598, and s722.

The least central bills include sres4, sconres1, sres16, sres7, s1848, s371, sres210, s1631, sres62, and s1662.

## Getting Bill Texts

Getting full texts of bills can prove challenging.  The ProPublica API does not supply full texts for bills.  We can obtain them by screen-scraping from the Government Publishing Office (GPO).  For example, the page with the text for sres254 is https://www.gpo.gov/fdsys/pkg/BILLS-115sres254ats/html/BILLS-115sres254ats.htm, while senate bill 1616 has several versions, each of which represents a state of the legislation as it goes through consideration.  The final text can be found at https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm .

Some investigation and experimentation reveals that the URL we seek can be composed like this:

* `https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm` 
* bill stub (e.g. "sres254" or "s1616") 
* code for stage of legislation 
 -  for bills we find: Introduced = "is", Referred in House = "rfh", Reported = "rs", Placed on Calendar = "pcs", Engrossed = "es", Enrolled = "enr"
 -  for resolutions we find: Introduced = "is" , Reported = "rs", Agreed to = "ats"
 -  more info about these codes and what they mean can be found at https://www.senate.gov/reference/Printedlegislationkey.htm or  https://www.gpo.gov/help/index.html#about_congressional_bills.htm
* `/html/BILLS-115s`
* bill stub 
* code for stage of legislation
* `.htm`

When we look at the html of a given bill or resolution, we also discover that there's a non-trivial amount of metadata included, such as the legislative sponsors, which we'll want to discard for text analysis (we only want to analyze the contents of the legislation itself).  For example, resolutions have a long separator line followed by the word "RESOLUTION".  Everything above that is non-interesting to us (sponsor names and other metadata).  Bills are a little different: they also have a separator line which could be followed by A BILL or AN ACT.

So, we have several challenges:

* identify the legislative stage of the bill or resolution so we can find the most up-to-date text
* construct the URL
* obtain, parse, and scrape the html
* from the text scraped, obtain just the bill or resolution text (as opposed to the list of sponsors, for example)

Let's set up a few variables with data we'll need.

In [2]:
most_central = ['sres254', 'sres292', 'sres193', 's1616', 'sres184', 's1182', 'sres6', 'sres173', 's1598', 's722']
least_central = ['sres4', 'sconres1', 'sres16', 'sres7', 's1848', 's371', 'sres210', 's1631', 'sres62', 's1662']
bill_status = ['enr', 'es', 'rs', 'rfh', 'is'] # latest to earliest stages
resolution_status = ['ats', 'rs', 'is']  # latest to earliest stages

Ideally we'd like to set up a system that would construct these URL's.  For now we just manually obtained them from the GPO website.

In [138]:
more_central_legislation = ["https://www.gpo.gov/fdsys/pkg/BILLS-115sres254ats/html/BILLS-115sres254ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres292ats/html/BILLS-115sres292ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres193ats/html/BILLS-115sres193ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres184ats/html/BILLS-115sres184ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1182es/html/BILLS-115s1182es.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres6rs/html/BILLS-115sres6rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres173ats/html/BILLS-115sres173ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1598rs/html/BILLS-115s1598rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s722es/html/BILLS-115s722es.htm"]

less_central_legislation = ["https://www.gpo.gov/fdsys/pkg/BILLS-115sres4is/html/BILLS-115sres4is.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sconres1enr/html/BILLS-115sconres1enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres16ats/html/BILLS-115sres16ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres7ats/html/BILLS-115sres7ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1848pcs/html/BILLS-115s1848pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s371enr/html/BILLS-115s371enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres210ats/html/BILLS-115sres210ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1631rs/html/BILLS-115s1631rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres62pcs/html/BILLS-115sres62pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1662pcs/html/BILLS-115s1662pcs.htm"]

Obtain the text of the bill

In [141]:
from bs4 import BeautifulSoup
import requests

more_central_legislation_text = ""
less_central_legislation_text = ""
def create_corpus(urllist): 
    legislation_text = ""
    for url in urllist:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        s = soup.find('pre').find_all(text=True, recursive=False)
        clean = str(s).replace('\\n','')
        split = re.split("_{20,}\s*(AN ACT|RESOLUTION|A BILL)", clean)
        legislation = " ".join(split[1:])
        legislation_text += legislation
    return(legislation_text)

In [142]:
more_central_legislation_text = create_corpus(more_central_legislation)
less_central_legislation_text = create_corpus(less_central_legislation)

300776

## Comparing Corpora

Now that we have two corpora -- most and least central legislation in the 115th Senate -- we can detect if there is any linguistic differences detectable in the text.  We're looking, specifically, for distinctive vocabulary. We can do this using TF-IDF (term frequency - inverse document frequency) analysis using ScikitLearn and NLTK.

In [122]:
import nltk
from nltk import word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem.porter import PorterStemmer


In [123]:
tokens

['To',
 'provide',
 'congressional',
 'review',
 'and',
 'to',
 'counter',
 'Iranian',
 'and',
 'Russian',
 'governments',
 "'",
 'aggression',
 '.',
 'Be',
 'it',
 'enacted',
 'by',
 'the',
 'Senate',
 'and',
 'House',
 'of',
 'Representatives',
 'of',
 'the',
 'United',
 'States',
 'of',
 'America',
 'in',
 'Congress',
 'assembled',
 ',',
 'SECTION',
 '1',
 '.',
 'SHORT',
 'TITLE',
 ';',
 'TABLE',
 'OF',
 'CONTENTS',
 '.',
 '(',
 'a',
 ')',
 'Short',
 'Title.',
 '--',
 'This',
 'Act',
 'may',
 'be',
 'cited',
 'as',
 'the',
 '``',
 'Countering',
 'Iran',
 "'s",
 'Destabilizing',
 'Activities',
 'Act',
 'of',
 '2017',
 "''",
 '.',
 '(',
 'b',
 ')',
 'Table',
 'of',
 'Contents.',
 '--',
 'The',
 'table',
 'of',
 'contents',
 'for',
 'this',
 'Act',
 'is',
 'as',
 'follows',
 ':',
 'Sec',
 '.',
 '1',
 '.',
 'Short',
 'title',
 ';',
 'table',
 'of',
 'contents.Sec',
 '.',
 '2',
 '.',
 'Definitions.Sec',
 '.',
 '3',
 '.',
 'Regional',
 'strategy',
 'for',
 'countering',
 'conventional',
 