# Bill Analysis

This script takes the most and least central bills and analyzes their texts.

We get the data about the most and least central bills from the jupyter notebook that ranks senators and bills.

## Most / Least Central Bills

The most central bills from our ranking script include sres254, sres292, sres193, s1616, and sres184, all with vast cosponsorship and a custom centrality measure of 9.737875, as well as slightly less central bills s1182, sres6, sres173, s1598, and s722.

The least central bills include sres4, sconres1, sres16, sres7, s1848, s371, sres210, s1631, sres62, and s1662.

## Getting Bill Texts

Getting full texts of bills can prove challenging.  The ProPublica API does not supply full texts for bills.  We can obtain them by screen-scraping from the Government Publishing Office (GPO).  For example, the page with the text for sres254 is https://www.gpo.gov/fdsys/pkg/BILLS-115sres254ats/html/BILLS-115sres254ats.htm, while senate bill 1616 has several versions, each of which represents a state of the legislation as it goes through consideration.  The final text can be found at https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm .

Some investigation and experimentation reveals that the URL we seek can be composed like this:

* `https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm` 
* bill stub (e.g. "sres254" or "s1616") 
* code for stage of legislation 
 -  for bills we find: Introduced = "is", Referred in House = "rfh", Reported = "rs", Placed on Calendar = "pcs", Engrossed = "es", Enrolled = "enr"
 -  for resolutions we find: Introduced = "is" , Reported = "rs", Agreed to = "ats"
 -  more info about these codes and what they mean can be found at https://www.senate.gov/reference/Printedlegislationkey.htm or  https://www.gpo.gov/help/index.html#about_congressional_bills.htm
* `/html/BILLS-115s`
* bill stub 
* code for stage of legislation
* `.htm`

When we look at the html of a given bill or resolution, we also discover that there's a non-trivial amount of metadata included, such as the legislative sponsors, which we'll want to discard for text analysis (we only want to analyze the contents of the legislation itself).  For example, resolutions have a long separator line followed by the word "RESOLUTION".  Everything above that is non-interesting to us (sponsor names and other metadata).  Bills are a little different: they also have a separator line which could be followed by A BILL or AN ACT.

So, we have several challenges:

* identify the legislative stage of the bill or resolution so we can find the most up-to-date text
* construct the URL
* obtain, parse, and scrape the html
* from the text scraped, obtain just the bill or resolution text (as opposed to the list of sponsors, for example)

Let's set up a few variables with data we'll need.

In [49]:
most_central = ['sres254', 'sres292', 'sres193', 's1616', 'sres184', 's1182', 'sres6', 'sres173', 's1598', 's722']
least_central = ['sres4', 'sconres1', 'sres16', 'sres7', 's1848', 's371', 'sres210', 's1631', 'sres62', 's1662']
bill_status = ['enr', 'es', 'rs', 'rfh', 'is'] # latest to earliest stages
resolution_status = ['ats', 'rs', 'is']  # latest to earliest stages

Ideally we'd like to set up a system that would construct these URL's.  For now we just manually obtained them from the GPO website.

In [50]:
more_central_legislation = ["https://www.gpo.gov/fdsys/pkg/BILLS-115sres254ats/html/BILLS-115sres254ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres292ats/html/BILLS-115sres292ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres193ats/html/BILLS-115sres193ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres184ats/html/BILLS-115sres184ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1182es/html/BILLS-115s1182es.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres6rs/html/BILLS-115sres6rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres173ats/html/BILLS-115sres173ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1598rs/html/BILLS-115s1598rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s722es/html/BILLS-115s722es.htm"]

less_central_legislation = ["https://www.gpo.gov/fdsys/pkg/BILLS-115sres4is/html/BILLS-115sres4is.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sconres1enr/html/BILLS-115sconres1enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres16ats/html/BILLS-115sres16ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres7ats/html/BILLS-115sres7ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1848pcs/html/BILLS-115s1848pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s371enr/html/BILLS-115s371enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres210ats/html/BILLS-115sres210ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1631rs/html/BILLS-115s1631rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres62pcs/html/BILLS-115sres62pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1662pcs/html/BILLS-115s1662pcs.htm"]

Obtain the texts of the various legislation

In [51]:
from bs4 import BeautifulSoup
import requests
import re

def create_corpus(urllist): 
    legislation_text_list = []
    for url in urllist:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        s = soup.find('pre').find_all(text=True, recursive=False)
        clean = str(s).replace('\\n','')
        split = re.split("_{20,}\s*(AN ACT|RESOLUTION|A BILL)", clean)
        legislation = " ".join(split[1:])
        legislation = legislation.replace('.',' ') # keep.this from being keepthis
        legislation = legislation.replace('-',' ')
        legislation = legislation.replace('`',' ')
        legislation = legislation.replace(';',' ')
        legislation = legislation.replace(']',' ')
        legislation_text_list.append(legislation)
    return(legislation_text_list)

more_central_legislation_text = create_corpus(more_central_legislation)
less_central_legislation_text = create_corpus(less_central_legislation)
legislation = more_central_legislation_text + less_central_legislation_text
legislation_category = ["more" for x in more_central_legislation_text] + ["less" for x in less_central_legislation_text]

## Comparing Corpora: Most vs. Least Central
Now that we have access to two sets of data -- most and least central legislation in the 115th Senate -- we can detect if there is any linguistic differences detectable in the text. We're looking, specifically, for distinctive vocabulary. We can do this using TF-IDF (term frequency - inverse document frequency) analysis using ScikitLearn and NLTK.

First, we'll check out the overall TF-IDF by category, just to get a feel for the overall tone of the more vs. less central legislation, then we'll do individual TF-IDF by piece of legislation to build a model for machine learning, to see if we can classify text reliably.

In [52]:
from string import punctuation
import string
from nltk.corpus import stopwords
from nltk import word_tokenize

# Add some known stop words as well as punctuation
stop_words = stopwords.words('english') + list(punctuation) + \
            list(['`', 'united', 'states', 'resolution', 'act', 'bill', 'shall', 'section', 'subsection',
                 'sec', 'paragraph', 'subparagraph'])
def tokenize(text):
    words = word_tokenize(text)
    words = [w.lower() for w in words]
    return [w for w in words if w not in stop_words and not w.isdigit()]

In [53]:
vocabulary = set()
for x in legislation:
    vocabulary.update(tokenize(x))
vocabulary = list(vocabulary)

In [60]:
from sklearn.feature_extraction.text import TfidfVectorizer
 
tfidf_category_model = TfidfVectorizer(analyzer='word', stop_words=stop_words, tokenizer=tokenize, vocabulary=vocabulary)
 
# Transform a document into TfIdf coordinates
tfidf_category_matrix = tfidf_category_model.fit_transform([' '.join(more_central_legislation_text),
                                         ' '.join(less_central_legislation_text)])
feature_names = tfidf_category_model.get_feature_names() 

## Convert from Sparse Matrix to Understandable List

In [61]:
dense = tfidf_category_matrix.todense()
most = dense[0].tolist()[0]
least = dense[1].tolist()[0]

most_phrase_scores = [pair for pair in zip(range(0, len(most)), most)]
least_phrase_scores = [pair for pair in zip(range(0, len(least)), least)]

## Sort by Highest TF-IDF and display 

What are the most distinctive, say, 20 terms in the more central legislation?  To figure this out, we'll want to do tf-idf on the *combination* of all the more central legislation.

In [62]:
sorted_most_phrase_scores = sorted(most_phrase_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_most_phrase_scores
                      if len(feature_names[word_id]) >2][:20]:
   print('{0: <20} {1}'.format(phrase, score))

sanctions            0.198727800339
russian              0.177296370891
federation           0.173881047315
secretary            0.173399747355
president            0.163658188515
person               0.157813253211
respect              0.139304291414
educational          0.137355979646
described            0.136381823762
program              0.135407667878
title                0.134433511994
assistance           0.116898706082
may                  0.114950394314
report               0.110079614894
ukraine              0.106793084178
foreign              0.10423467959
date                 0.103260523706
inserting            0.0954672766335
committee            0.0915706530975
appropriate          0.0896223413294


And in the least central legislation?

In [63]:
sorted_least_phrase_scores = sorted(least_phrase_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_least_phrase_scores
    if len(feature_names[word_id]) >2][:20]:
   print('{0: <20} {1}'.format(phrase, score))

expenses             0.191430546723
department           0.187872358122
committee            0.186449082682
available            0.172927965999
may                  0.170793052838
provided             0.170793052838
authorized           0.164388313357
exceed               0.155848660715
public               0.13236461595
including            0.128094789629
law                  0.127383151908
secretary            0.115285310666
services             0.112438759785
expended             0.110303846625
title                0.108880571184
funds                0.103187469423
period               0.098917643102
government           0.0932245413408
general              0.0910896281803
programs             0.0889547150198


Interestingly, the "more central" legislation has terms that refer to outside countries, like "sanctions", "russian", "federation", "ukraine", and "foreign", while the "less central" legislation has terms that refer to money, like "expenses", "exceed", "expended", and "funds".  Will differences like these be enough to distinguish them?  Let's set up some tf-idf analysis and machine learning where we use each document as a data point (instead of just sticking them together for a quick overall look).

In [65]:
tfidf_model = TfidfVectorizer(analyzer='word', stop_words=stop_words, tokenizer=tokenize, vocabulary=vocabulary)
 
# Transform a document into TfIdf coordinates
tfidf_matrix = tfidf_model.fit_transform(legislation)
feature_names = tfidf_model.get_feature_names() 

In [66]:
clf = MultinomialNB().fit(tfidf_matrix, legislation_category)

In [68]:
predicted = clf.predict(tfidf_model.transform(legislation))

In [74]:
for i in range(0,len(predicted)):
    print("Actual centrality: " + legislation_category[i] + ",\tPredicted centrality: " + predicted[i])

Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: less
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centr

Not too shabby!  We only got one wrong out of 20, which is 95% accuracy.  Still, this is almost certainly overfit.  Let's try predicting a few pieces of legislation that weren't used to create the model.  We'll pick a couple that are nearly as extreme in their centrality as the ones used to generate the model. 

For example, let's check out 