### Latest

There is just one more iteration of models that I would like to try. We will only do major and topic_1 models (no minor topics model, and only SVM with linear kernel) with the following changes:

1. Use bills93-114.csv as it has more data. It will need to be cleaned and tokenized though.

*** use old bills dataset between 103 - 112 ***

2. Support for some topics is very low. For instance, just 2--5 rows for some minor topics. I have revised topic_1 in topic_code.csv to remove most of these topics. So that should improve our classification success a lot.

3. For test data, can we also output:
true class, prob_class_1, prob_class_2 ...., .....prob_class_k

Final steps for maj_topics (similar for topic_1):

1. Read bills93-114.csv
2. Clean and tokenize

*** use old bills dataset between 103 - 112 ***

3. Split long bills into 2.5k chunks. 5k may be still too long. (And I mean 2500 characters, not 2500 words, which is super long. 2500 characters ~ 500 words. I apologize for any confusion.) 
4. Read in Roberts rules and takes out those tokens
5. Remove tokens that appear in 10 or fewer bills. Also remove tokens that appear in more than 5,000 bills.
6. Fit SVM with linear kernel
7. Merge with topic_code
8. For test data, produce: true_class, prob_class_1,..... prob_class_k
9. Output top 20 most informative for each class
10. Predict all congress (after chunking into 2.5k characters) and all news 

Other Notes:

1. It doesn't appear we are tuning c or gamma in SVM. We can also decide on kernels using cross-validation. See this: 
http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html 

It is ok for now. But something to look into in the future. We can also look into other models.


In [1]:
import pandas as pd
import numpy as np
import time

## Read Bills dataset (cleaned version)

In [2]:
df = pd.read_csv('../data/bills-out-clean.csv.gz', nrows=10)
df.columns

Index([u'uid', u'Major', u'Minor', u'clean_text'], dtype='object')

In [3]:
df = pd.read_csv('../data/bills-out-clean.csv.gz', usecols=['Major', 'Minor', 'clean_text'])
df

Unnamed: 0,Major,Minor,clean_text
0,20,2012,congression bill 103th congress us govern prin...
1,3,300,congression bill 103th congress us govern prin...
2,15,1520,congression bill 103th congress us govern prin...
3,20,2000,congression bill 103th congress us govern prin...
4,15,1522,congression bill 103th congress us govern prin...
5,1,107,congression bill 103th congress us govern prin...
6,14,1401,congression bill 103th congress us govern prin...
7,1,107,congression bill 103th congress us govern prin...
8,14,1406,congression bill 103th congress us govern prin...
9,3,331,congression bill 103th congress us govern prin...


## Split long bill to smaller chunk (2500 chars)

In [4]:
import re
import textwrap

def insert_chars_split_marker(text, cc=2500):
    # FIXME: text still has number
    text = re.sub('\d+', '', text)
    out = '|'.join(textwrap.wrap(text, cc))
    return out

def insert_words_split_marker(text, wc=500):
    text = re.sub('\d+', '', text)
    words = text.split()
    out = ''
    for i, w in enumerate(words):
        if i != 0 and i % wc == 0:
            out += '|' + w
        else:
            out += ' ' + w
    return out

In [5]:
if True:
    df['clean_text'] = df['clean_text'].apply(lambda c: insert_chars_split_marker(c, 2500))

In [6]:
if True:
    s = df['clean_text'].str.split('|', expand=True).stack()
    i = s.index.get_level_values(0)
    new_df = df.loc[i].copy()
    new_df['chunk'] = s.index.get_level_values(1)
    new_df['clean_text'] = s.values
    df = new_df

## Vectorize

In [7]:
import nltk
from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer
import re
import string

stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

#def tokenize(text):
#    tokens = nltk.word_tokenize(text)
#    stems = stem_tokens(tokens, stemmer)
#    return stems

def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split

In [9]:
with open('../roberts_rules/all_text.txt', 'rt') as f:
    text = f.read()
text = text.decode('ascii', 'ignore')
text = re.sub(r'\d+', '', text)

vect = CountVectorizer(tokenizer=tokenize, stop_words='english', ngram_range=(2, 3)) 
vect.fit([text])
roberts_rules = set(vect.get_feature_names())

In [10]:
#vect = CountVectorizer(tokenizer=tokenize, stop_words='english', ngram_range=(2, 3), min_df=0.01)
#vect = CountVectorizer(ngram_range=(2, 3), min_df=0.01) 
#vect = CountVectorizer(ngram_range=(2, 3)) 
#vect = CountVectorizer(ngram_range=(2, 3), min_df=20) 
vect = CountVectorizer(ngram_range=(2, 3), min_df=10, max_df=5000) 
vect.fit(df.clean_text)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=5000, max_features=None, min_df=10,
        ngram_range=(2, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [11]:
len(vect.get_feature_names())

690480

In [12]:
vocab = []
i = 0
for a in vect.vocabulary_:
    if a not in roberts_rules:
        vocab.append(a)
    else:
        #print a
        i += 1
print("Removed {0:d}".format(i))
print("Total {0:d}".format(len(vocab)))

Removed 4013
Total 686467


In [13]:
ng_df = pd.DataFrame(vocab)
ng_df.columns = ['ngram']
ng_df

Unnamed: 0,ngram
0,author examin book
1,mill process
2,printer connect
3,bb approv
4,agenc sponsor
5,payment method
6,compon personnel polici
7,impos health
8,health center ii
9,shall submit year


In [14]:
ng_df.to_csv('../data/bills-23gram-new.csv', index=False)

In [15]:
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):
    labelid = 3 # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " | ".join(feature_names[j] for j in top10)))

def get_top_features(vectorizer, clf, class_labels, n=20):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    top_features = {}
    for i, class_label in enumerate(class_labels):
        topN = np.argsort(clf.coef_[i])[-n:]
        top_features[class_label] = [feature_names[j] for j in topN][::-1]
    return top_features

## Model (Major)

In [16]:
X = df.clean_text
y = df.Major

X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [17]:
#vect = CountVectorizer(ngram_range=(2, 3), min_df=20, vocabulary=vocab)
vect = CountVectorizer(ngram_range=(2, 3), min_df=10, max_df=5000, vocabulary=vocab)

In [18]:
X_train = vect.transform(X_train)
transformer = TfidfTransformer()
X_train = transformer.fit_transform(X_train)

In [19]:
X_test = vect.transform(X_test)
transformer = TfidfTransformer()
X_test = transformer.fit_transform(X_test)

In [20]:
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(X_train, y_train)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(X_test)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1

print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(y_test, prediction_liblinear))

Results for LinearSVC()
Training time: 118.641524s; Prediction time: 0.552515s
             precision    recall  f1-score   support

          1       0.60      0.56      0.58      1071
          2       0.73      0.62      0.67       368
          3       0.87      0.92      0.89      3062
          4       0.83      0.85      0.84       607
          5       0.76      0.77      0.77      1159
          6       0.85      0.83      0.84      1256
          7       0.82      0.80      0.81      1050
          8       0.82      0.84      0.83      1205
         10       0.86      0.90      0.88      1029
         12       0.80      0.78      0.79      1448
         13       0.70      0.72      0.71       819
         14       0.79      0.76      0.78       516
         15       0.81      0.80      0.80      1589
         16       0.82      0.83      0.83      1840
         17       0.75      0.65      0.70       434
         18       0.82      0.79      0.80       898
         19       0

In [21]:
most_informative_feature_for_class(vect, classifier_liblinear, 99)
#most_informative_feature_for_class_svm(vect, classifier_linear)

99 author request 1.02265473396
99 amend section amend 1.08523569314
99 mr barlow 1.23269931321
99 mean bill provid 1.29354438373
99 privat relief 1.31074900264
99 act relief 1.50230090366
99 st session relief 1.84302419851
99 judiciari bill relief 2.06457812132
99 session relief 2.55369540224
99 bill relief 2.56454572632


In [22]:
print_top10(vect, classifier_liblinear, classifier_liblinear.classes_)

1: deficit reduct | gift tax | public debt | portland oregon | feder tax | amend relat section | qualifi manufactur | direct spend | sale tax | medicar choic
2: vote system | person health inform | perman partnership | free speech | perform abort | partialbirth abort | racial profil | vote right | sexual orient | genet inform
3: health insur | medic malpractic | amend titl xviii | secretari health human | protect individu | item servic | nation health | health servic act | health plan | prescript drug
4: crop year | depart agricultur | food product | agricultur research | agricultur product | genet engin | farmer rancher | anim drug | food safeti | agricultur commod
5: immigr servic | amend immigr | pension plan | compens act | immigr enforc | wto particip | infrastructur project | illeg alien | youth apprenticeship | child care
6: profession develop | number children | depart educ | educ save | educ act | educ assist | head start | secretari educ | higher educ | student loan
7: hazard

In [43]:
topics_df = pd.read_csv('../data/topic_code.csv')
topics_map = {}
for r in topics_df.iterrows():
    topics_map[r[1].code] = r[1].topic

In [24]:
top_features = get_top_features(vect, classifier_liblinear, classifier_liblinear.classes_)
top_features_df = pd.DataFrame(top_features)
columns = []
for c in top_features_df.columns:
    cname = topics_df[topics_df.code == c].topic.values[0]
    columns.append(cname)
top_features_df.columns = columns
top_features_df.to_csv('../data/major_bills_top20_new.csv', index=False)
top_features_df

Unnamed: 0,Macroeconomics,"Civil Rights, Minority Issues, and Civil Liberties",Health,Agriculture,"Labor, Employment, and Immigration",Education,Environment,Energy,Transportation,"Law, Crime, and Family Issues",Social Welfare,Community Development and Housing Issues,"Banking, Finance, and Domestic Commerce",Defense,"Space, Science, Technology and Communications",Foreign Trade,International Affairs and Foreign Aid,Government Operations,Public Lands and Water Management,"Other, Miscellaneous, and Human Interest"
0,medicar choic,genet inform,prescript drug,agricultur commod,child care,student loan,asbesto claim,oil ga,coast guard,shall imprison,older individu,hous act,small busi,depart defens,commun act,trade agreement,human right,postal servic,water resourc,bill relief
1,sale tax,sexual orient,health plan,food safeti,youth apprenticeship,higher educ,coral reef,crude oil,public transport,administr fema,individu disabl,empower zone,nation insur,homeland secur,geoloc inform,nafta countri,unit nation,district columbia,indian tribe,session relief
2,direct spend,vote right,health servic act,anim drug,illeg alien,secretari educ,ballast water,clean energi,transport infrastructur,bureau prison,nonprofit agenc,hous credit,offshor aquacultur,air forc,broadband servic,custom servic,depart state,feder elect,nativ hawaiian,judiciari bill relief
3,qualifi manufactur,racial profil,nation health,farmer rancher,infrastructur project,head start,toxic mold,feder power,air carrier,money launder,care account,consensu committe,interst insur,secretari defens,space transport,export control,peac corp,public build,miner activ,st session relief
4,amend relat section,partialbirth abort,item servic,genet engin,wto particip,educ assist,fisheri manag,energi polici,surfac transport,death penalti,account holder,delta region,depositori institut,arm forc,news inform,tariff act,foreign assist,joint committe,resourc bill,act relief
5,feder tax,perform abort,protect individu,agricultur product,immigr enforc,educ act,preserv area,secretari energi,rail carrier,committe judiciari,disabl beneficiari,hous agenc,flood insur,secretari navi,region ocean,custom offic,develop countri,dutch john,committe resourc,privat relief
6,portland oregon,free speech,secretari health human,agricultur research,compens act,educ save,conserv plan,natur ga,air transport,cover grant,state registri,commun develop,financi compani,war terror,telephon servic,exportimport bank,foreign servic,polit parti,critic miner,mean bill provid
7,public debt,perman partnership,amend titl xviii,food product,pension plan,depart educ,insur particip,energi laboratori,transport system,child support,opportun board,invest save,develop compani,feder director,video servic,amend act amend,secretari state,addit amount,dam safeti,mr barlow
8,gift tax,person health inform,medic malpractic,depart agricultur,amend immigr,number children,hazard substanc,fuel cell,secretari transport,volunt firefight,nonvisu access,grant amount,comptrol currenc,section ahva,motion pictur,trade deficit,australia unit,feder employe,indian reserv,amend section amend
9,deficit reduct,vote system,health insur,crop year,immigr servic,profession develop,hazard wast,energi effici,depart transport,feder prison,licens vendor,invest save account,profession box,reserv compon,protect comput,rough diamond,obstetr fistula,contract personnel,commonwealth guam,author request


In [25]:
from sklearn.externals import joblib

joblib.dump(vect, "../models/vec_count_bills_23gram_new.joblib")
joblib.dump(classifier_liblinear, "../models/major_bills_clf_liblinear_new.joblib")

['../models/major_bills_clf_liblinear_new.joblib',
 '../models/major_bills_clf_liblinear_new.joblib_01.npy',
 '../models/major_bills_clf_liblinear_new.joblib_02.npy',
 '../models/major_bills_clf_liblinear_new.joblib_03.npy']

In [27]:
y_test_df = pd.DataFrame(y_test)
y_test_df['true_value'] = y_test_df['Major'].apply(lambda c: topics_map[c])
y_test_df.reset_index(drop=True, inplace=True)

prob = classifier_liblinear.decision_function(X_test)
prob_df = pd.DataFrame(prob)
columns = []
for c in classifier_liblinear.classes_:
    cname = topics_map[c]
    columns.append(cname)
prob_df.columns = columns

result_df = pd.concat([y_test_df[['true_value']], prob_df], axis=1)
result_df.to_csv('../data/test_prediction_major.csv', index=False)
result_df

Unnamed: 0,true_value,Macroeconomics,"Civil Rights, Minority Issues, and Civil Liberties",Health,Agriculture,"Labor, Employment, and Immigration",Education,Environment,Energy,Transportation,...,Social Welfare,Community Development and Housing Issues,"Banking, Finance, and Domestic Commerce",Defense,General,Foreign Trade,International Affairs and Foreign Aid,Government Operations,Public Lands and Water Management,"Other, Miscellaneous, and Human Interest"
0,"Banking, Finance, and Domestic Commerce",-0.861343,-1.008931,-1.158614,-0.954937,-1.082820,-1.117373,-1.078556,-1.254831,-1.073032,...,-1.163921,-1.049487,0.183315,-0.980719,-0.901211,-1.097071,-1.044722,-1.031193,-0.947636,-1.100004
1,Energy,-0.753342,-1.247482,-1.305094,-1.210755,-1.222735,-1.175831,-0.880161,0.823387,-1.196501,...,-1.471674,-1.034546,-1.411772,-1.194345,-1.105871,-1.083693,-1.197433,-0.897949,-1.370701,-1.125865
2,Transportation,-1.069932,-1.048763,-0.985838,-1.142901,-1.180988,-1.064085,-0.995711,-1.109906,1.065710,...,-1.039921,-1.146749,-1.026901,-1.086161,-1.179275,-1.115897,-1.092789,-1.062279,-0.988133,-1.081797
3,Government Operations,-1.254428,-1.147355,-1.366535,-1.123537,-0.987607,-1.288791,-1.264218,-1.228389,-1.186798,...,-1.205806,-0.990163,-0.404475,-0.695648,-1.037077,-0.845048,-1.169742,0.700476,-1.391173,-1.116940
4,Health,-0.957739,-1.040966,-0.320155,-1.115559,-0.842429,-0.931291,-1.091276,-0.932796,-1.061089,...,-1.050637,-0.960605,-1.003033,-0.874497,-1.198142,-1.076320,-0.902582,-1.034207,-0.784423,-1.091652
5,Environment,-1.002318,-0.965677,-1.090658,-0.997184,-1.117146,-0.737097,-0.692105,-1.033451,-0.896662,...,-0.919928,-0.792269,-0.624518,-0.894662,-1.166832,-1.219741,-0.799114,-0.691679,-1.180109,-1.036864
6,Government Operations,-1.172486,-0.971226,-1.030317,-1.234571,-0.461194,-1.253081,-1.015368,-1.091266,-1.089628,...,-0.736440,-1.159611,-1.132853,-0.751268,-1.195207,-1.130024,-1.250744,-0.132233,-1.022317,-1.037608
7,Public Lands and Water Management,-1.104816,-0.903985,-0.840409,-1.054195,-0.995387,-1.031975,-1.117311,-0.975389,-0.759021,...,-0.984910,-1.013377,-1.010105,-0.966611,-1.105862,-0.924050,-0.946871,-1.349261,0.361563,-1.195482
8,Public Lands and Water Management,-1.055518,-1.222473,-1.040603,-1.077004,-1.325142,-1.107111,-0.800961,-0.947036,-1.151100,...,-1.315437,-1.139486,-1.230130,-1.289411,-1.008626,-1.119100,-1.239694,-1.175083,0.636826,-1.047380
9,Macroeconomics,0.302720,-1.206150,-1.203636,-0.991555,-0.985654,-1.122308,-1.154285,-1.252815,-1.036294,...,-1.167657,-1.247124,-1.278490,-1.176303,-1.145323,-1.209477,-1.382276,-0.374527,-1.291967,-1.109033


## Try SGDClassifier

In [28]:
if False:
    from sklearn.linear_model import SGDClassifier

    elastic_clf = SGDClassifier(loss='log', alpha=.00002, n_iter=200, penalty="elasticnet")
    t0 = time.time()
    elastic_clf.fit(X_train, y_train)
    t1 = time.time()
    prediction_elastic = elastic_clf.predict(X_test)
    t2 = time.time()
    time_elastic_train = t1-t0
    time_elastic_predict = t2-t1

    print("Results for Elastic Net")
    print("Training time: %fs; Prediction time: %fs" % (time_elastic_train, time_elastic_predict))
    print(classification_report(y_test, prediction_elastic))

In [29]:
if False:
    most_informative_feature_for_class(vect, elastic_clf, 99)

In [30]:
if False:
    print_top10(vect,  elastic_clf,  elastic_clf.classes_)

## Model (topic_1 == 1)

In [44]:
selected_topics = topics_df[topics_df.topic_1 == 1].code.unique()

In [45]:
selected_topics 

array([ 101,  105,  107,  201,  202,  206,  207,  301,  302,  323,  324,
        331,  332,  333,  335,  403,  404,  501,  502,  503,  504,  505,
        508,  529,  601,  603,  607,  609,  701,  703,  704,  705,  709,
        710,  801,  802,  803,  806,  807,  900, 1002, 1003, 1005, 1006,
       1007, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209, 1301, 1302,
       1303, 1304, 1401, 1501, 1502, 1504, 1507, 1521, 1522, 1523, 1525,
       1602, 1603, 1605, 1609, 1610, 1612, 1615, 1701, 1706, 1707, 1709,
       1802, 1807, 1901, 1925, 1926, 1927, 2002, 2003, 2006, 2011, 2012,
       2013, 2101, 2102])

In [46]:
X = df[df.Minor.isin(selected_topics)].clean_text
y = df[df.Minor.isin(selected_topics)].Minor

X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [47]:
X_train = vect.transform(X_train)
transformer = TfidfTransformer()
X_train = transformer.fit_transform(X_train)

In [48]:
X_test = vect.transform(X_test)
transformer = TfidfTransformer()
X_test = transformer.fit_transform(X_test)

In [49]:
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(X_train, y_train)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(X_test)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1

print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(y_test, prediction_liblinear))

Results for LinearSVC()
Training time: 217.284066s; Prediction time: 1.189586s
             precision    recall  f1-score   support

        101       0.60      0.43      0.50         7
        105       0.63      0.60      0.61       312
        107       0.67      0.74      0.70       615
        201       0.80      0.67      0.73        12
        202       0.77      0.47      0.59        51
        206       0.94      0.68      0.79        25
        207       0.81      0.61      0.69        28
        301       0.68      0.77      0.72       864
        302       0.57      0.59      0.58       440
        323       0.54      0.36      0.43        73
        324       0.59      0.29      0.39        65
        331       0.68      0.59      0.63       156
        332       0.65      0.67      0.66       207
        333       0.69      0.64      0.67        67
        335       0.80      0.74      0.77       187
        403       0.82      0.96      0.89       102
        404       1

In [50]:
most_informative_feature_for_class(vect, classifier_liblinear, 501)

501 expos person 0.985945793682
501 occup safeti health 0.995477851399
501 occup safeti 1.04555188393
501 modif thereof 1.07830710963
501 safeti health act 1.09487221909
501 energi employe 1.14035913555
501 asbesto claim 1.15650860763
501 possibl modif 1.17003062869
501 drugfre workplac 1.1713101741
501 safeti health 1.59389728242


In [51]:
print_top10(vect, classifier_liblinear, classifier_liblinear.classes_)

101: nation emerg | price goug | statist servic | function offic | feder statist servic | statist data center | statist data | price stabil | state shall effect | feder statist
105: commiss bill | poundag quota | nation dividend | budget year | budget author | deficit reduct | sequestr order | direct spend | public debt | medicar choic
107: amend intern revenu | amend relat section | feder tax | foreign corpor | capit gain | gift tax | busi entiti | revenu servic | sale tax | intern revenu servic
201: elimin racial profil | elimin racial | harass intimid | morril act | qualifi minor | servic personnel | cultur compet | institut slaveri | custom servic personnel | racial profil
202: basi sexual | women scientist | econom selfsuffici | perman partnership | administr judg | gender equiti | sexual harass | girl women | perman partner | sexual orient
206: elect assist | protect vote right | electron vote | parti candid | absente ballot | requir payment | servic voter | uniform servic voter 

In [52]:
top_features = get_top_features(vect, classifier_liblinear, classifier_liblinear.classes_)
top_features_df = pd.DataFrame(top_features)
columns = []
for c in top_features_df.columns:
    cname = topics_df[topics_df.code == c].topic.values[0]
    columns.append(cname)
top_features_df.columns = columns
top_features_df.to_csv('../data/topic_1_bills_top20_new.csv', index=False)
top_features_df

Unnamed: 0,"Inflation, Prices, and Interest Rates",National Budget and Debt,"Taxation, Tax policy, and Tax Reform",Ethnic Minority and Racial Group Discrimination,Gender and Sexual Orientation Discrimination,"Voting Rights, Participation, and Related Issues",Freedom of Speech & Religion,Comprehensive health care reform,"Insurance reform, availability, and cost",Provider and insurer payment and regulation,...,"International Organizations other than Finance: United Nations (UN), UNESCO, International Red Cross","Terrorism, Hijacking",Government Efficiency and Bureaucratic Oversight,Postal Service Issues (Including Mail Fraud),"Currency, Commemorative Coins, Medals, U.S. Mint","Federal Government Branch Relations and Administrative Issues, Congressional Operations","Regulation of Political Campaigns, Political Advertising, PAC regulation, Government Ethics",Census,"National Parks, Memorials, Historic Sites, and Recreation",Native American Affairs
0,feder statist,medicar choic,intern revenu servic,racial profil,sexual orient,vote right,partialbirth abort,region allianc,health insur,amend titl xviii,...,unit nation,militari commiss,regulatori action,postal servic,commemor coin,member employe,feder elect,decenni censu,nation park,indian tribe
1,state shall effect,public debt,sale tax,custom servic personnel,perman partner,vote system,perform abort,nation health,hapi plan,physician servic,...,foreclosur prevent,insur loss,inspector gener,postag stamp,gold medal,member hous repres,gener elect,redistrict plan,nation histor,nativ hawaiian
2,price stabil,direct spend,revenu servic,institut slaveri,girl women,uniform servic voter,unborn child,health secur,medihealth plan,mortgag fraud,...,special olymp,review commiss,public printer,postal regulatori,coin issu,legisl branch,lobbi disclosur,american resid,nation monument,nativ american
3,statist data,sequestr order,busi entiti,cultur compet,sexual harass,servic voter,religi freedom,state health,health coverag,longterm care provid,...,world health,terrorist organ,legisl propos,postal regulatori commiss,educ outreach,librari congress,lobbi activ,censu popul,heritag area,indian reserv
4,statist data center,deficit reduct,gift tax,servic personnel,gender equiti,requir payment,free speech,health board,independ home,regist nurs,...,state olymp,al qaeda,risk assess,post offic,congression gold,use forc,presidenti elect,resid abroad,scienc park,indian affair
5,feder statist servic,budget author,capit gain,qualifi minor,administr judg,absente ballot,reproduct health,commun health,transit care,fee schedul,...,strike constitut,classifi inform,public servic,bypass mail,congression gold medal,presidenti order,report individu,censu bureau,nation museum,game oper
6,function offic,budget year,foreign corpor,morril act,perman partnership,parti candid,flag unit,advanc direct,health benefit,sunsetthi section shall,...,unit state olymp,protect america,action purpos,rate postag,reserv note,librarian congress,elect campaign,apportion repres congress,nation memori,indian land
7,statist servic,nation dividend,feder tax,harass intimid,econom selfsuffici,electron vote,first amend,rural health,applic author,eye examin,...,peacekeep oper,plastic explos,nontax debt,postmast gener,feder reserv note,disapprov bill,polit committe,director censu,concess contract,indian child
8,price goug,poundag quota,amend relat section,elimin racial,women scientist,protect vote right,free exercis,health inform,valuebas payment,sunsetthi section,...,intern crimin court,militari commiss chapter,execut branch,frank mail,cent coin,war power,elect offici,independ redistrict,paleontolog resourc,indian game
9,nation emerg,commiss bill,amend intern revenu,elimin racial profil,basi sexual,elect assist,flag unit state,health plan,americar supplement,section uu,...,intern crimin,commiss chapter,elig invest,cover postal,coin act,consent decre,polit parti,bureau censu,nation heritag,indian tribal


In [53]:
from sklearn.externals import joblib

#joblib.dump(vect, "../models/vec_count_bills_23gram.joblib")
joblib.dump(classifier_liblinear, "../models/topic_1_bills_clf_liblinear_new.joblib")

['../models/topic_1_bills_clf_liblinear_new.joblib',
 '../models/topic_1_bills_clf_liblinear_new.joblib_01.npy',
 '../models/topic_1_bills_clf_liblinear_new.joblib_02.npy',
 '../models/topic_1_bills_clf_liblinear_new.joblib_03.npy']

In [54]:
y_test_df = pd.DataFrame(y_test)
y_test_df['true_value'] = y_test_df['Minor'].apply(lambda c: topics_map[c])
y_test_df.reset_index(drop=True, inplace=True)

prob = classifier_liblinear.decision_function(X_test)
prob_df = pd.DataFrame(prob)
columns = []
for c in classifier_liblinear.classes_:
    cname = topics_map[c]
    columns.append(cname)
prob_df.columns = columns

result_df = pd.concat([y_test_df[['true_value']], prob_df], axis=1)
result_df.to_csv('../data/test_prediction_topic_1.csv', index=False)
result_df

Unnamed: 0,true_value,"Inflation, Prices, and Interest Rates",National Budget and Debt,"Taxation, Tax policy, and Tax Reform",Ethnic Minority and Racial Group Discrimination,Gender and Sexual Orientation Discrimination,"Voting Rights, Participation, and Related Issues",Freedom of Speech & Religion,Comprehensive health care reform,"Insurance reform, availability, and cost",...,"International Organizations other than Finance: United Nations (UN), UNESCO, International Red Cross","Terrorism, Hijacking",Government Efficiency and Bureaucratic Oversight,Postal Service Issues (Including Mail Fraud),"Currency, Commemorative Coins, Medals, U.S. Mint","Federal Government Branch Relations and Administrative Issues, Congressional Operations","Regulation of Political Campaigns, Political Advertising, PAC regulation, Government Ethics",Census,"National Parks, Memorials, Historic Sites, and Recreation",Native American Affairs
0,"Regulation of Political Campaigns, Political A...",-1.107077,-1.549945,-1.596182,-1.079140,-1.100206,-1.221608,-0.995564,-1.311448,-1.351030,...,-1.066654,-1.148387,-1.223445,-1.028366,-1.082228,-1.216256,2.426858,-1.077312,-1.098698,-1.057382
1,Securities and Commodities Regulation,-1.032157,-1.197227,-0.668575,-1.049665,-1.151155,-1.079883,-1.026701,-1.272465,-1.012034,...,-1.050600,-1.017667,-1.047944,-1.057719,-1.052687,-1.086413,-1.076360,-1.038653,-1.022290,-0.995943
2,"Regulation of Political Campaigns, Political A...",-1.088610,-1.168366,-1.105994,-1.088262,-1.090961,-0.930258,-1.056814,-0.975562,-1.003319,...,-1.046787,-1.029106,-1.092031,-1.109533,-1.035212,-1.110178,0.893431,-1.035591,-1.016139,-0.988895
3,Copyrights and Patents,-1.045571,-1.058546,-1.130624,-1.049828,-1.029396,-1.092169,-1.128457,-0.791159,-1.140605,...,-1.054653,-1.062590,-1.056964,-0.999693,-1.061501,-1.034330,-0.883339,-1.029260,-0.839616,-1.115757
4,Pollution and Conservation in Coastal & Other ...,-1.046634,-1.109501,-1.143791,-1.052139,-1.043627,-1.051555,-1.084159,-1.072716,-1.153650,...,-1.032685,-1.059996,-1.159185,-1.026985,-1.064639,-1.107748,-1.082761,-1.050243,-1.151186,-0.939102
5,"Insurance reform, availability, and cost",-1.071205,-1.113711,-1.184860,-1.046072,-1.076841,-1.096624,-0.963748,-0.583206,-0.017169,...,-1.078286,-1.101059,-1.207267,-1.092806,-1.084644,-1.053536,-0.877470,-1.063181,-1.044151,-1.052684
6,"Highway Construction, Maintenance, and Safety",-1.052126,-0.873274,-0.791366,-1.056896,-1.050442,-1.081380,-1.037715,-0.924475,-1.241232,...,-1.071296,-1.048064,-1.072026,-1.071816,-1.039698,-1.106636,-1.062970,-1.050064,-1.073998,-1.089232
7,Energy Conservation,-1.063401,-1.083980,-1.062762,-1.048121,-1.079833,-1.041964,-1.046446,-1.229600,-1.089328,...,-1.049118,-1.072987,-0.927269,-1.084041,-1.038054,-1.085617,-1.053394,-1.068489,-1.094002,-1.065521
8,Employee Benefits,-1.073147,-1.280055,-0.847472,-1.025855,-1.136141,-1.070814,-1.097683,-1.459350,-1.224050,...,-1.057737,-1.124938,-1.130410,-1.091323,-1.086101,-1.122652,-1.065983,-1.059820,-1.105556,0.154583
9,Comprehensive health care reform,-1.064598,-0.752202,-0.736890,-1.071926,-1.052613,-1.067140,-1.060683,0.492241,-0.965557,...,-1.033672,-1.006227,-1.118688,-1.049169,-1.049387,-1.129119,-0.967398,-1.075158,-1.100709,-1.086444


## Try LogisticRegression

In [55]:
if False:
    from sklearn.linear_model import LogisticRegression

    # Perform classification with LogisticRegression
    logreg_clf = LogisticRegression()
    t0 = time.time()
    logreg_clf.fit(X_train, y_train)
    t1 = time.time()
    prediction_logreg = logreg_clf.predict(X_test)
    t2 = time.time()
    time_logreg_train = t1-t0
    time_logreg_predict = t2-t1

    print("Results for LogisticRegression()")
    print("Training time: %fs; Prediction time: %fs" % (time_logreg_train, time_logreg_predict))
    print(classification_report(y_test, prediction_logreg))
    most_informative_feature_for_class(vect, logreg_clf, 501)
    print_top10(vect, logreg_clf, logreg_clf.classes_)