# Bill Analysis

This script takes the most and least central bills and analyzes their texts and does the same for strongly democratically supported and strongly republican supported bills.

We get the data about the most and least central bills as well as the most strongly partisan bills from the jupyter notebook that ranks senators and bills.

## Most / Least Central Bills

The 20 most central bills from our ranking script, of which we'll use 15 as training and five as test, include sres184, sres292, s1616, sres193, sres254, s1182, sres6, sres173, s1598, s722, sres328, sjres49, sres69, sres211, s274, sres176, s1595, s204, s1094, and s324.

The 20 least central bills include s133, s99, s1866, s1609, sres1, sconres24, sres16, sres2, sres57, sres7, sres4, sconres1, sconres2, sres3, s371, s1848, s1631, sres210, sres62, and s1662.


## Strongly Partisan Bills

The 20 most strongly republican bills include sres1, s723, s724, s590, s1866, s218, s1894, s34, s644, s466, sconres25, s21, s215, s189, s1116, sres298, sres306, s585, sres297, and sres133.

The 20 most strongly democratic bills include s729, s617, sres147, sres318, s513, s502, s675, sres275, sjres22, s508, sconres14, s734, s607, s730, sres187, s861, sconres23, s1395, s55, and s225.


## Getting Bill Texts

Getting full texts of bills can prove challenging.  The ProPublica API does not supply full texts for bills.  We can obtain them by screen-scraping from the Government Publishing Office (GPO).  For example, the page with the text for sres254 is https://www.gpo.gov/fdsys/pkg/BILLS-115sres254ats/html/BILLS-115sres254ats.htm, while senate bill 1616 has several versions, each of which represents a state of the legislation as it goes through consideration.  The final text can be found at https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm .

Some investigation and experimentation reveals that the URL we seek can be composed like this:

* `https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm` 
* bill stub (e.g. "sres254" or "s1616") 
* code for stage of legislation 
 -  for bills we find: Introduced = "is", Referred in House = "rfh", Reported = "rs", Placed on Calendar = "pcs", Engrossed = "es", Enrolled = "enr"
 -  for resolutions we find: Introduced = "is" , Reported = "rs", Agreed to = "ats"
 -  more info about these codes and what they mean can be found at https://www.senate.gov/reference/Printedlegislationkey.htm or  https://www.gpo.gov/help/index.html#about_congressional_bills.htm
* `/html/BILLS-115s`
* bill stub 
* code for stage of legislation
* `.htm`

When we look at the html of a given bill or resolution, we also discover that there's a non-trivial amount of metadata included, such as the legislative sponsors, which we'll want to discard for text analysis (we only want to analyze the contents of the legislation itself).  For example, resolutions have a long separator line followed by the word "RESOLUTION".  Everything above that is non-interesting to us (sponsor names and other metadata).  Bills are a little different: they also have a separator line which could be followed by A BILL or AN ACT.

So, we have several challenges:

* identify the legislative stage of the bill or resolution so we can find the most up-to-date text
* construct the URL
* obtain, parse, and scrape the html
* from the text scraped, obtain just the bill or resolution text (as opposed to the list of sponsors, for example)

Let's set up a few variables with data we'll need, if we want to extract these bills in an automated way.

In [1]:
most_central = ['sres184', 'sres292', 's1616', 'sres193', 'sres254', 's1182', 'sres6', 'sres173', 's1598', 's722', \
                'sres328', 'sjres49', 'sres69', 'sres211', 's274', 'sres176', 's1595', 's204', 's1094', 's324']
least_central = ['s133', 's99', 's1866', 's1609', 'sres1', 'sconres24', 'sres16', 'sres2', 'sres57', 'sres7', \
                 'sres4', 'sconres1', 'sconres2', 'sres3', 's371', 's1848', 's1631', 'sres210', 'sres62',  's1662']

rep = ['sres1', 's723', 's724', 's590', 's1866', 's218', 's1894', 's34', 's644', 's466', 'sconres25', 's21', 's215',\
       's189', 's1116', 'sres298', 'sres306', 's585', 'sres297', 'sres133']
dem = ['s729', 's617', 'sres147', 'sres318', 's513', 's502', 's675', 'sres275', 'sjres22', 's508', 'sconres14', \
       's734', 's607', 's730', 'sres187', 's861', 'sconres23', 's1395', 's55',  's225']

bill_status = ['enr', 'es', 'rs', 'rfh', 'is'] # latest to earliest stages
resolution_status = ['ats', 'rs', 'is']  # latest to earliest stages

Ideally we'd like to set up a system that would construct these URL's.  For now we just manually obtained them from the GPO website by choosing the most recent version of each legislation.

In [2]:
dem_legislation =[\
"https://www.gpo.gov/fdsys/pkg/BILLS-115s729rs/html/BILLS-115s729rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s617rs/html/BILLS-115s617rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres147ats/html/BILLS-115sres147ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres318ats/html/BILLS-115sres318ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s513rs/html/BILLS-115s513rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s502rs/html/BILLS-115s502rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s675rs/html/BILLS-115s675rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres275ats/html/BILLS-115sres275ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sjres22es/html/BILLS-115sjres22es.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s508rs/html/BILLS-115s508rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sconres14enr/html/BILLS-115sconres14enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s734rs/html/BILLS-115s734rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s607rs/html/BILLS-115s607rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s730rs/html/BILLS-115s730rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres187ats/html/BILLS-115sres187ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s861pcs/html/BILLS-115s861pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sconres23enr/html/BILLS-115sconres23enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1395rs/html/BILLS-115s1395rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s55rs/html/BILLS-115s55rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s225rs/html/BILLS-115s225rs.htm"]


rep_legislation = [\
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres1ats/html/BILLS-115sres1ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s723rs/html/BILLS-115s723rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s724rs/html/BILLS-115s724rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s590rs/html/BILLS-115s590rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1866enr/html/BILLS-115s1866enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s218rs/html/BILLS-115s218rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1894pcs/html/BILLS-115s1894pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s34rs/html/BILLS-115s34rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s644rs/html/BILLS-115s644rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s466rs/html/BILLS-115s466rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sconres25pcs/html/BILLS-115sconres25pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s21rs/html/BILLS-115s21rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s215rs/html/BILLS-115s215rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s189rs/html/BILLS-115s189rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1116is/html/BILLS-115s1116is.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres298ats/html/BILLS-115sres298ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres306ats/html/BILLS-115sres306ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s585enr/html/BILLS-115s585enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres297ats/html/BILLS-115sres297ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres133ats/html/BILLS-115sres133ats.htm"]


more_central_legislation = [\
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres184ats/html/BILLS-115sres184ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres292ats/html/BILLS-115sres292ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1616enr/html/BILLS-115s1616enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres193ats/html/BILLS-115sres193ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres254ats/html/BILLS-115sres254ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1182es/html/BILLS-115s1182es.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres6rs/html/BILLS-115sres6rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres173ats/html/BILLS-115sres173ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1598rs/html/BILLS-115s1598rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s722es/html/BILLS-115s722es.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres328ats/html/BILLS-115sres328ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sjres49enr/html/BILLS-115sjres49enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres69ats/html/BILLS-115sres69ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres211ats/html/BILLS-115sres211ats.htm",      
"https://www.gpo.gov/fdsys/pkg/BILLS-115s274pcs/html/BILLS-115s274pcs.htm",                    
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres176ats/html/BILLS-115sres176ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1595rfh/html/BILLS-115s1595rfh.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s204rfh/html/BILLS-115s204rfh.htm",                  
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1094enr/html/BILLS-115s1094enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s324es/html/BILLS-115s324es.htm"]
                            
                            
less_central_legislation = [\
"https://www.gpo.gov/fdsys/pkg/BILLS-115s133rs/html/BILLS-115s133rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s99rs/html/BILLS-115s99rs.htm",                 
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1866enr/html/BILLS-115s1866enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1609pcs/html/BILLS-115s1609pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres1ats/html/BILLS-115sres1ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sconres24enr/html/BILLS-115sconres24enr.htm",                          
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres16ats/html/BILLS-115sres16ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres2ats/html/BILLS-115sres2ats.htm",          
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres57ats/html/BILLS-115sres57ats.htm",              
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres7ats/html/BILLS-115sres7ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres4is/html/BILLS-115sres4is.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sconres1enr/html/BILLS-115sconres1enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sconres2ats/html/BILLS-115sconres2ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres3ats/html/BILLS-115sres3ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s371enr/html/BILLS-115s371enr.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1848pcs/html/BILLS-115s1848pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1631rs/html/BILLS-115s1631rs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres210ats/html/BILLS-115sres210ats.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115sres62pcs/html/BILLS-115sres62pcs.htm",
"https://www.gpo.gov/fdsys/pkg/BILLS-115s1662pcs/html/BILLS-115s1662pcs.htm"]

Obtain the texts of the various legislation

In [3]:
from bs4 import BeautifulSoup
import requests
import re

def create_corpus(urllist): 
    legislation_text_list = []
    for url in urllist:
        r = requests.get(url)
        soup = BeautifulSoup(r.text, 'html.parser')
        s = soup.find('pre').find_all(text=True, recursive=False)
        clean = str(s).replace('\\n','')
        split = re.split("_{20,}\s*(AN ACT|RESOLUTION|A BILL)", clean)
        legislation = " ".join(split[1:])
        legislation = legislation.replace('.',' ') # keep.this from being keepthis
        legislation = legislation.replace('-',' ')
        legislation = legislation.replace('`',' ')
        legislation = legislation.replace(';',' ')
        legislation = legislation.replace(']',' ')
        legislation_text_list.append(legislation)
    return(legislation_text_list)

more_central_legislation_text = create_corpus(more_central_legislation)
less_central_legislation_text = create_corpus(less_central_legislation)
rep_legislation_text = create_corpus(rep_legislation)
dem_legislation_text = create_corpus(dem_legislation)

We'll split our data into training and test data for machine learning.

In [4]:
from sklearn.model_selection import train_test_split
more_train, more_test, less_train, less_test = train_test_split(more_central_legislation_text, 
                                                                less_central_legislation_text, 
                                                                test_size=0.25, random_state=42)

legislation_train = more_train + less_train
legislation_test = more_test + less_test
legislation_category_train = ["more" for x in more_train] + ["less" for x in less_train]
legislation_category_test = ["more" for x in more_test] + ["less" for x in less_test]


dem_train, dem_test, rep_train, rep_test = train_test_split(dem_legislation_text, 
                                                                rep_legislation_text, 
                                                                test_size=0.25, random_state=42)

party_train = dem_train + rep_train
party_test = dem_test + rep_test
party_category_train = ["dem" for x in dem_train] + ["rep" for x in rep_train]
party_category_test = ["dem" for x in dem_test] + ["rep" for x in rep_test]

## Comparing and Classifying Corpora: Most vs. Least Central
Now that we have access to two sets of data -- most and least central legislation in the 115th Senate -- we can detect if there is any linguistic differences detectable in the text. We're looking, specifically, for distinctive vocabulary. We can do this using TF-IDF (term frequency - inverse document frequency) analysis using ScikitLearn and NLTK.

First, we'll check out the overall TF-IDF by category, just to get a feel for the overall tone of the more vs. less central legislation, then we'll do individual TF-IDF by piece of legislation to build a model for machine learning, to see if we can classify text reliably.

In [5]:
from string import punctuation
import string
from nltk.corpus import stopwords
from nltk import word_tokenize

# Add some known stop words as well as punctuation
stop_words = stopwords.words('english') + list(punctuation) + \
            list(['`', 'united', 'states', 'resolution', 'act', 'bill', 'shall', 'section', 'subsection',
                 'sec', 'paragraph', 'subparagraph'])
def tokenize(text):
    words = word_tokenize(text)
    words = [w.lower() for w in words]
    return [w for w in words if w not in stop_words and not w.isdigit()]

In [6]:
vocabulary = set()
for x in legislation_train:
    vocabulary.update(tokenize(x))
vocabulary = list(vocabulary)

In [7]:
from sklearn.feature_extraction.text import TfidfVectorizer
 
tfidf_category_model = TfidfVectorizer(analyzer='word', stop_words=stop_words, tokenizer=tokenize, vocabulary=vocabulary)
 
# Transform a document into TfIdf coordinates
tfidf_category_matrix = tfidf_category_model.fit_transform([' '.join(more_train),
                                         ' '.join(less_train)])
feature_names = tfidf_category_model.get_feature_names() 

## Convert from Sparse Matrix to Understandable List

In [8]:
dense = tfidf_category_matrix.todense()
most = dense[0].tolist()[0]
least = dense[1].tolist()[0]

most_phrase_scores = [pair for pair in zip(range(0, len(most)), most)]
least_phrase_scores = [pair for pair in zip(range(0, len(least)), least)]

## Sort by Highest TF-IDF and display 

What are the most distinctive, say, 20 terms in the more central legislation?  To figure this out, we'll want to do tf-idf on the *combination* of all the more central legislation.

In [9]:
sorted_most_phrase_scores = sorted(most_phrase_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_most_phrase_scores
                      if len(feature_names[word_id]) >2][:20]:
   print('{0: <20} {1}'.format(phrase, score))

russian              0.280591045849
sanctions            0.235557425096
president            0.222881689216
federation           0.195968349482
person               0.191192349517
foreign              0.167953500405
respect              0.143658339969
committee            0.138376783352
described            0.134151538059
appropriate          0.121475802179
report               0.119363179533
ukraine              0.115799479239
secretary            0.11196900027
may                  0.106687443653
government           0.10563113233
financial            0.10246219836
hizballah            0.102438000866
congressional        0.100349575713
state                0.0961243304203
term                 0.095068019097


And in the least central legislation?

In [10]:
sorted_least_phrase_scores = sorted(least_phrase_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_least_phrase_scores
    if len(feature_names[word_id]) >2][:20]:
   print('{0: <20} {1}'.format(phrase, score))

available            0.23035534531
expenses             0.227300234629
provided             0.212024681227
department           0.19736014996
may                  0.175974375197
authorized           0.160087799658
committee            0.156421666842
exceed               0.142368157711
including            0.133813847806
funds                0.130147714989
expended             0.128925670717
law                  0.122815449356
public               0.121593405084
title                0.114872161587
secretary            0.108761940226
services             0.103873763137
program              0.102040696729
activities           0.10020763032
necessary            0.0959304753677
year                 0.0940974089594


Interestingly, the "more central" legislation has terms that refer to outside countries, like "sanctions", "russian", "federation", "ukraine", and "foreign", while the "less central" legislation has terms that refer to money, like "expenses", "exceed", "expended", and "funds".  Will differences like these be enough to distinguish them?  Let's set up some tf-idf analysis and machine learning where we use each document as a data point (instead of just sticking them together for a quick overall look).

In [11]:
tfidf_model = TfidfVectorizer(analyzer='word', stop_words=stop_words, tokenizer=tokenize, vocabulary=vocabulary)
 
# Transform a document into TfIdf coordinates
tfidf_matrix = tfidf_model.fit_transform(legislation_train)
feature_names = tfidf_model.get_feature_names() 

In [12]:
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(tfidf_matrix, legislation_category_train)

Let's check our accuracy on our training set!

In [13]:
predicted_train = clf.predict(tfidf_model.transform(legislation_train))

for i in range(0,len(predicted_train)):
    print("Actual centrality: " + legislation_category_train[i] + ",\tPredicted centrality: " + predicted_train[i])

Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: less
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: less
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: less
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: less
Actual centr

Not too shabby!  We only got three wrong out of 30, which is 90% accuracy.  Still, this is almost certainly overfit.  Let's try predicting a few pieces of legislation that weren't used to create the model.  

In [14]:
predicted_test = clf.predict(tfidf_model.transform(legislation_test))

for i in range(0,len(predicted_test)):
    print("Actual centrality: " + legislation_category_test[i] + ",\tPredicted centrality: " + predicted_test[i])

Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: less
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: more,	Predicted centrality: more
Actual centrality: less,	Predicted centrality: less
Actual centrality: less,	Predicted centrality: more
Actual centrality: less,	Predicted centrality: more
Actual centrality: less,	Predicted centrality: more
Actual centrality: less,	Predicted centrality: less


Ugh, only 6/10 right!  That's around what we'd expect for randomly guessing.  It looks like the text of central and non-central texts is not very predictive when it comes to classifying whether a given bill will be central or non-central.  Legislation text does not work well for classification into "more" and "less" central.  Could it, however, help classify into "republican" or "democratic" texts?

## Comparing and Classifying Corpora: Republican vs Democrat
Now that we have access to two sets of data -- most strongly democratic and most strongly republican legislation in the 115th Senate -- we can detect if there is any linguistic differences detectable in the text. We're looking, specifically, for distinctive vocabulary. We can do this using TF-IDF (term frequency - inverse document frequency) analysis using ScikitLearn and NLTK.

First, we'll check out the overall TF-IDF by category, just to get a feel for the overall tone of the partisan legislation, then we'll do individual TF-IDF by piece of legislation to build a model for machine learning, to see if we can classify text reliably.

In [15]:
vocabulary = set()
for x in party_train:
    vocabulary.update(tokenize(x))
vocabulary = list(vocabulary)

tfidf_category_model = TfidfVectorizer(analyzer='word', stop_words=stop_words, tokenizer=tokenize, vocabulary=vocabulary)
 
# Transform a document into TfIdf coordinates
tfidf_category_matrix = tfidf_category_model.fit_transform([' '.join(dem_train),
                                         ' '.join(rep_train)])
feature_names = tfidf_category_model.get_feature_names() 

dense = tfidf_category_matrix.todense()
dem = dense[0].tolist()[0]
rep = dense[1].tolist()[0]

dem_scores = [pair for pair in zip(range(0, len(dem)), dem)]
rep_scores = [pair for pair in zip(range(0, len(rep)), rep)]

Show democratic distinctive words:

In [16]:
sorted_dem_scores = sorted(dem_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_dem_scores
                      if len(feature_names[word_id]) >2][:20]:
   print('{0: <20} {1}'.format(phrase, score))

land                 0.321281019517
secretary            0.277799377778
business             0.210161268406
native               0.17634221372
eligible             0.169755361966
federal              0.157017039614
applicant            0.14938471853
long                 0.125618967855
management           0.125613631691
incubator            0.122223860615
date                 0.120782338164
sound                0.118828753376
island               0.118828753376
oregon               0.115951044638
public               0.101457164058
area                 0.09845810994
program              0.0966258705314
may                  0.0917945770048
general              0.0893789302416
businesses           0.0893789302416


Show "republican" distinctive words:

In [17]:
sorted_rep_scores = sorted(rep_scores, key=lambda t: t[1] * -1)
for phrase, score in [(feature_names[word_id], score) for (word_id, score) in sorted_rep_scores
                      if len(feature_names[word_id]) >2][:20]:
   print('{0: <20} {1}'.format(phrase, score))

rule                 0.304083428846
indian               0.238307212623
joint                0.235171591404
congress             0.17245916703
house                0.163052303373
rules                0.141024198885
development          0.137967333624
national             0.137967333624
federal              0.128560469968
senate               0.12228922753
report               0.119153606311
motion               0.110175155379
university           0.106611121436
may                  0.103475500218
date                 0.103475500218
described            0.100339878999
programs             0.0972042577803
whereas              0.0972042577803
consideration        0.0925471305182
following            0.0909330153429


Build and test a machine learning model

In [18]:
tfidf_model = TfidfVectorizer(analyzer='word', stop_words=stop_words, tokenizer=tokenize, vocabulary=vocabulary)
 
# Transform a document into TfIdf coordinates
tfidf_matrix = tfidf_model.fit_transform(party_train)
feature_names = tfidf_model.get_feature_names() 

In [19]:
clf = MultinomialNB().fit(tfidf_matrix, party_category_train)

What's our success rate on our training data?

In [20]:
predicted_train = clf.predict(tfidf_model.transform(party_train))

for i in range(0,len(predicted_train)):
    print("Actual centrality: " + party_category_train[i] + ",\tPredicted centrality: " + predicted_train[i])

Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: rep,	Predicted centrality: rep
Actual centrality: rep,	Predicted centrality: rep
Actual centrality: rep,	Predicted centrality: rep
Actual centrality: rep,	Predicted centrality: rep
Actual centrality: rep,	Predicted centrality: rep


Again, 90% accuracy on training -- but how does it hold up in test?

In [21]:
predicted_test = clf.predict(tfidf_model.transform(party_test))

for i in range(0,len(predicted_test)):
    print("Actual centrality: " + party_category_test[i] + ",\tPredicted centrality: " + predicted_test[i])

Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: rep
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: dem,	Predicted centrality: dem
Actual centrality: rep,	Predicted centrality: rep
Actual centrality: rep,	Predicted centrality: dem
Actual centrality: rep,	Predicted centrality: rep
Actual centrality: rep,	Predicted centrality: dem
Actual centrality: rep,	Predicted centrality: dem


Ugh, not too fantastic, same as the "centrality" model.  Looks like text classification of senate bills either needs more texts to train on, or is simply an unreliable method for classification.