### Original

The data are at: opt/data/bills-out.csv

We will train 3 classifiers, likely SVM as number of categories is a lot:

1. predict 'Minor': column labeled 'minor'
2. predict 'Major': column labeled 'major'
3. predict only categories where topic_code.csv (also in opt/data) says include ==1 (I will update this soon.)


How?
1. Tokenize bill text
2. Break into 5000 word chunks per bill
3. Use SVM to predict --- cross-validate to tune
4. Show classification success and print out top coefficients for each category so that we can verify that the model makes sense
5. Predict congressional speech --- it is in opt/cong (or you can download via capitolwords API based on the script you wrote, we want to predict for each congress.)
   Predict each category. If you have 9 categories, you will get 9 columns. 
   
### Recap

1. Let us again remove Roberts rules bigrams/trigrams again
2. With just 10k tokens, I think we are selling ourselves short. Let us remove only bigrams/trigrams that are in 20 documents or less.
3. Let us try GradientBoostedClassifier also
4. Let us not chunk in 5k as bills seem to be ok length except for 5 or so bills which have less than 1000 characters

What to predict?
1. Major topics
2. Minor topics: but take out all with the label 'other' or 'general'. To find out the topic codes, see topic_code.csv. We can also take out minor topics that don't have more than 10 bills.
5. The third model will only include minor topics which in topic_code.csv column topic_1 ==1 And again we want minor topics that don't have more than 10 bills

Next steps after the models:
1. Output top 20 bigrams/trigrams for each topic where column_name = topic_label (from topic_code.csv) for each model

In [1]:
import pandas as pd
import numpy as np
import time

## Read Bills dataset (cleaned version)

In [2]:
df = pd.read_csv('../data/bills-out-clean.csv.gz', nrows=10)
df.columns

Index([u'uid', u'Major', u'Minor', u'clean_text'], dtype='object')

In [3]:
df = pd.read_csv('../data/bills-out-clean.csv.gz', usecols=['Major', 'Minor', 'clean_text'])
df

Unnamed: 0,Major,Minor,clean_text
0,20,2012,congression bill 103th congress us govern prin...
1,3,300,congression bill 103th congress us govern prin...
2,15,1520,congression bill 103th congress us govern prin...
3,20,2000,congression bill 103th congress us govern prin...
4,15,1522,congression bill 103th congress us govern prin...
5,1,107,congression bill 103th congress us govern prin...
6,14,1401,congression bill 103th congress us govern prin...
7,1,107,congression bill 103th congress us govern prin...
8,14,1406,congression bill 103th congress us govern prin...
9,3,331,congression bill 103th congress us govern prin...


## Split long bill to smaller chunk (5000 words)

In [4]:
import re
import textwrap

def insert_chars_split_marker(text, cc=2500):
    # FIXME: text still has number
    text = re.sub('\d+', '', text)
    out = '|'.join(textwrap.wrap(text, cc))
    return out

def insert_words_split_marker(text, wc=500):
    text = re.sub('\d+', '', text)
    words = text.split()
    out = ''
    for i, w in enumerate(words):
        if i != 0 and i % wc == 0:
            out += '|' + w
        else:
            out += ' ' + w
    return out

In [5]:
if False:
    df['clean_text'] = df['clean_text'].apply(lambda c: insert_words_split_marker(c, 5000))

In [6]:
if False:
    s = df['clean_text'].str.split('|', expand=True).stack()
    i = s.index.get_level_values(0)
    new_df = df.loc[i].copy()
    new_df['chunk'] = s.index.get_level_values(1)
    new_df['clean_text'] = s.values
    df = new_df
    df

## Vectorize

In [7]:
import nltk
from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer
import re
import string

stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

#def tokenize(text):
#    tokens = nltk.word_tokenize(text)
#    stems = stem_tokens(tokens, stemmer)
#    return stems

def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split

In [9]:
with open('../roberts_rules/all_text.txt', 'rt') as f:
    text = f.read()
text = text.decode('ascii', 'ignore')
text = re.sub(r'\d+', '', text)

vect = CountVectorizer(tokenizer=tokenize, stop_words='english', ngram_range=(2, 3)) 
vect.fit([text])
roberts_rules = set(vect.get_feature_names())

In [10]:
#vect = CountVectorizer(tokenizer=tokenize, stop_words='english', ngram_range=(2, 3), min_df=0.01)
#vect = CountVectorizer(ngram_range=(2, 3), min_df=0.01) 
#vect = CountVectorizer(ngram_range=(2, 3)) 
vect = CountVectorizer(ngram_range=(2, 3), min_df=20) 
vect.fit(df.clean_text)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=20,
        ngram_range=(2, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [11]:
len(vect.get_feature_names())

247335

In [12]:
vocab = []
i = 0
for a in vect.vocabulary_:
    if a not in roberts_rules:
        vocab.append(a)
    else:
        #print a
        i += 1
print("Removed {0:d}".format(i))
print("Total {0:d}".format(len(vocab)))

Removed 2653
Total 244682


In [13]:
ng_df = pd.DataFrame(vocab)
ng_df.columns = ['ngram']
ng_df

Unnamed: 0,ngram
0,agenc sponsor
1,payment method
2,shall determin take
3,shall submit year
4,period commiss
5,improv awar
6,gener rulesubject
7,amount 10 percent
8,amount specifi paragraph
9,day includ


In [14]:
#ng_df.to_csv('../data/bills-23gram.csv', index=False)

In [15]:
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):
    labelid = 3 # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " | ".join(feature_names[j] for j in top10)))

def get_top_features(vectorizer, clf, class_labels, n=20):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    top_features = {}
    for i, class_label in enumerate(class_labels):
        topN = np.argsort(clf.coef_[i])[-n:]
        top_features[class_label] = [feature_names[j] for j in topN][::-1]
    return top_features

## Model (Major)

In [16]:
df.groupby('Major').agg({'Major': 'count'})

Unnamed: 0_level_0,Major
Major,Unnamed: 1_level_1
1,1021
2,539
3,2869
4,558
5,1429
6,1153
7,1228
8,979
10,973
12,1570


In [17]:
X = df.clean_text
y = df.Major

X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [18]:
#vect = CountVectorizer(ngram_range=(2, 3), min_df=20, vocabulary=vocab)
vect = CountVectorizer(ngram_range=(2, 3), min_df=10, max_df=5000, vocabulary=vocab)

In [19]:
X_train = vect.transform(X_train)
transformer = TfidfTransformer()
X_train = transformer.fit_transform(X_train)

In [20]:
X_test = vect.transform(X_test)
transformer = TfidfTransformer()
X_test = transformer.fit_transform(X_test)

In [21]:
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(X_train, y_train)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(X_test)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1

print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(y_test, prediction_liblinear))

Results for LinearSVC()
Training time: 39.974688s; Prediction time: 0.319057s
             precision    recall  f1-score   support

          1       0.63      0.67      0.65       204
          2       0.76      0.50      0.60       108
          3       0.85      0.91      0.88       574
          4       0.88      0.76      0.81       112
          5       0.77      0.77      0.77       286
          6       0.88      0.83      0.85       231
          7       0.81      0.77      0.79       246
          8       0.84      0.85      0.84       196
         10       0.82      0.85      0.83       195
         12       0.76      0.76      0.76       314
         13       0.79      0.73      0.76       130
         14       0.71      0.69      0.70        96
         15       0.74      0.74      0.74       337
         16       0.73      0.75      0.74       296
         17       0.71      0.70      0.71       103
         18       0.94      0.94      0.94       479
         19       0.

In [22]:
most_informative_feature_for_class(vect, classifier_liblinear, 99)
#most_informative_feature_for_class_svm(vect, classifier_linear)

99 committe judiciari 0.798276270577
99 provid relief 0.804124337037
99 refer committe judiciari 0.864308227234
99 author request 0.916199746916
99 emerg relief 0.952969615349
99 mean bill provid 1.0280892288
99 honolulu hawaii 1.12010521615
99 act relief 1.34671244907
99 judiciari bill relief 2.50058938693
99 bill relief 3.76298655211


In [23]:
print_top10(vect, classifier_liblinear, classifier_liblinear.classes_)

1: busi interest | meal entertain | penalti relief | budget year | deficit reduct | mileag rate | capit gain | gift tax | tax system | public debt
2: religi freedom | civil right | privaci act | grammleachbliley act | drug test | vote right | person inform | sexual orient | genet inform | perform abort
3: health insur | medic malpractic | titl xviii social | medicar program | public health | amend titl xviii | titl xviii | health care | medic care | prescript drug
4: incom averag | farm incom | agricultur product | food safeti | genet engin | amend agricultur | anim drug | refer committe agricultur | committe agricultur | secretari agricultur
5: labor organ | individu retir | illeg alien | amend immigr nation | nation act | child care | immigr nation act | amend immigr | immigr nation | unemploy compens
6: depart educ | qualifi tuition | public school | educ loan | educ act | elementari secondari | higher educ | educ assist | secretari educ | student loan
7: hazard wast | hazard materi

In [24]:
topics_df = pd.read_csv('../data/topic_code.csv')

In [25]:
top_features = get_top_features(vect, classifier_liblinear, classifier_liblinear.classes_)
top_features_df = pd.DataFrame(top_features)
columns = []
for c in top_features_df.columns:
    cname = topics_df[topics_df.code == c].topic.values[0]
    columns.append(cname)
top_features_df.columns = columns
#top_features_df.to_csv('../data/major_bills_top20.csv', index=False)
top_features_df

Unnamed: 0,Macroeconomics,"Civil Rights, Minority Issues, and Civil Liberties",Health,Agriculture,"Labor, Employment, and Immigration",Education,Environment,Energy,Transportation,"Law, Crime, and Family Issues",Social Welfare,Community Development and Housing Issues,"Banking, Finance, and Domestic Commerce",Defense,"Space, Science, Technology and Communications",Foreign Trade,International Affairs and Foreign Aid,Government Operations,Public Lands and Water Management,"Other, Miscellaneous, and Human Interest"
0,public debt,perform abort,prescript drug,secretari agricultur,unemploy compens,student loan,solid wast,crude oil,secretari transport,foster care,social secur,hous act,small busi,arm forc,motion pictur,trade agreement,unit nation,district columbia,indian tribe,bill relief
1,tax system,genet inform,medic care,committe agricultur,immigr nation,secretari educ,environment protect,oil ga,air transport,committe judiciari,social servic,hous credit,15 usc,depart defens,commun act,exportimport bank,22 usc,joint committe,resourc bill,judiciari bill relief
2,gift tax,sexual orient,health care,refer committe agricultur,amend immigr,educ assist,administr shall,secretari energi,transport infrastructur,refer committe judiciari,food stamp,empower zone,antitrust law,homeland secur,commun commiss,suspens duti,human right,public build,american samoa,act relief
3,capit gain,person inform,titl xviii,anim drug,immigr nation act,higher educ,drink water,depart energi,air carrier,feder prison,nutrit act,nation hous act,secur exchang,nation guard,commerci space,hong kong,refer committe foreign,administr bill,25 usc,honolulu hawaii
4,mileag rate,vote right,amend titl xviii,amend agricultur,child care,elementari secondari,migratori bird,energi polici,highway vehicl,death penalti,older individu,rural develop,insur compani,arm servic bill,feder commun commiss,trade act 1974,peac corp,feder employe,indian reserv,mean bill provid
5,deficit reduct,drug test,public health,genet engin,nation act,educ act,protect agenc,electr energi,feder aviat,child support,home energi,invest save,disast relief,disabl veteran,feder commun,financ bill provid,develop countri,refer committe hous,refer committe resourc,emerg relief
6,budget year,grammleachbliley act,medicar program,food safeti,amend immigr nation,educ loan,endang speci,natur ga,rail passeng,firearm ammunit,boy scout,princip resid,flood insur,refer committe arm,commun act 1934,harmon tariff,secretari state,feder elect,committe resourc,author request
7,penalti relief,privaci act,titl xviii social,agricultur product,illeg alien,public school,environment protect agenc,energi properti,highway trust,adopt expens,individu disabl,renew commun,profession sport,air forc,47 usc,harmon tariff schedul,panama canal,polit parti,nation histor,refer committe judiciari
8,meal entertain,civil right,medic malpractic,farm incom,individu retir,qualifi tuition,hazard materi,energi effici,public transport,money launder,42 usc,mortgag bond,product safeti,closur realign,news inform,trade deficit,relat bill,independ counsel,indian tribal,provid relief
9,busi interest,religi freedom,health insur,incom averag,labor organ,depart educ,hazard wast,clean energi,transport system,domest violenc,deliveri mail,commun develop,credit card,secretari navi,satellit servic,19 usc,depart state,postal servic,puerto rico,committe judiciari


In [26]:
from sklearn.externals import joblib

#joblib.dump(vect, "../models/vec_count_bills_23gram.joblib")
#joblib.dump(classifier_liblinear, "../models/major_bills_clf_liblinear.joblib")

## Try SGDClassifier

In [27]:
if False:
    from sklearn.linear_model import SGDClassifier

    elastic_clf = SGDClassifier(loss='log', alpha=.00002, n_iter=200, penalty="elasticnet")
    t0 = time.time()
    elastic_clf.fit(X_train, y_train)
    t1 = time.time()
    prediction_elastic = elastic_clf.predict(X_test)
    t2 = time.time()
    time_elastic_train = t1-t0
    time_elastic_predict = t2-t1

    print("Results for Elastic Net")
    print("Training time: %fs; Prediction time: %fs" % (time_elastic_train, time_elastic_predict))
    print(classification_report(y_test, prediction_elastic))

In [28]:
if False:
    most_informative_feature_for_class(vect, elastic_clf, 99)

In [29]:
if False:
    print_top10(vect,  elastic_clf,  elastic_clf.classes_)

## Minor Model

#### Was removed!!!

## Model (topic_1 == 1)

In [30]:
selected_topics = topics_df[topics_df.topic_1 == 1].code.unique()

In [31]:
selected_topics 

array([ 101,  103,  105,  107,  201,  202,  206,  207,  209,  301,  302,
        323,  324,  331,  332,  333,  335,  403,  404,  501,  502,  503,
        504,  505,  508,  529,  601,  603,  607,  609,  701,  703,  704,
        705,  709,  710,  801,  802,  803,  806,  807,  900, 1002, 1003,
       1005, 1006, 1007, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209,
       1301, 1302, 1303, 1304, 1401, 1501, 1502, 1504, 1507, 1521, 1522,
       1523, 1525, 1602, 1603, 1605, 1609, 1610, 1612, 1615, 1701, 1706,
       1707, 1709, 1802, 1807, 1901, 1915, 1925, 1926, 1927, 2002, 2003,
       2006, 2010, 2011, 2012, 2013, 2101, 2102])

In [32]:
X = df[df.Minor.isin(selected_topics)].clean_text
y = df[df.Minor.isin(selected_topics)].Minor

X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

ValueError: The least populated class in y has only 1 member, which is too few. The minimum number of labels for any class cannot be less than 2.

In [None]:
X_train = vect.transform(X_train)
transformer = TfidfTransformer()
X_train = transformer.fit_transform(X_train)

In [None]:
X_test = vect.transform(X_test)
transformer = TfidfTransformer()
X_test = transformer.fit_transform(X_test)

In [None]:
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(X_train, y_train)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(X_test)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1

print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(y_test, prediction_liblinear))

In [None]:
most_informative_feature_for_class(vect, classifier_liblinear, 501)

In [None]:
print_top10(vect, classifier_liblinear, classifier_liblinear.classes_)

In [None]:
top_features = get_top_features(vect, classifier_liblinear, classifier_liblinear.classes_)
top_features_df = pd.DataFrame(top_features)
columns = []
for c in top_features_df.columns:
    cname = topics_df[topics_df.code == c].topic.values[0]
    columns.append(cname)
top_features_df.columns = columns
#top_features_df.to_csv('../data/topic_1_bills_top20.csv', index=False)
top_features_df

In [None]:
from sklearn.externals import joblib

#joblib.dump(vect, "../models/vec_count_bills_23gram.joblib")
#joblib.dump(classifier_liblinear, "../models/topic_1_bills_clf_liblinear.joblib")

## Try LogisticRegression

In [None]:
if True:
    from sklearn.linear_model import LogisticRegression

    # Perform classification with LogisticRegression
    logreg_clf = LogisticRegression()
    t0 = time.time()
    logreg_clf.fit(X_train, y_train)
    t1 = time.time()
    prediction_logreg = logreg_clf.predict(X_test)
    t2 = time.time()
    time_logreg_train = t1-t0
    time_logreg_predict = t2-t1

    print("Results for LogisticRegression()")
    print("Training time: %fs; Prediction time: %fs" % (time_logreg_train, time_logreg_predict))
    print(classification_report(y_test, prediction_logreg))
    most_informative_feature_for_class(vect, logreg_clf, 501)
    print_top10(vect, logreg_clf, logreg_clf.classes_)