### Latest

There is just one more iteration of models that I would like to try. We will only do major and topic_1 models (no minor topics model, and only SVM with linear kernel) with the following changes:

1. Use bills93-114.csv as it has more data. It will need to be cleaned and tokenized though.

*** use old bills dataset between 103 - 112 ***

2. Support for some topics is very low. For instance, just 2--5 rows for some minor topics. I have revised topic_1 in topic_code.csv to remove most of these topics. So that should improve our classification success a lot.

3. For test data, can we also output:
true class, prob_class_1, prob_class_2 ...., .....prob_class_k

Final steps for maj_topics (similar for topic_1):

1. Read bills93-114.csv
2. Clean and tokenize

*** use old bills dataset between 103 - 112 ***

3. Split long bills into 2.5k chunks. 5k may be still too long. (And I mean 2500 characters, not 2500 words, which is super long. 2500 characters ~ 500 words. I apologize for any confusion.) 
4. Read in Roberts rules and takes out those tokens
5. Remove tokens that appear in 10 or fewer bills. Also remove tokens that appear in more than 5,000 bills.
6. Fit SVM with linear kernel
7. Merge with topic_code
8. For test data, produce: true_class, prob_class_1,..... prob_class_k
9. Output top 20 most informative for each class
10. Predict all congress (after chunking into 2.5k characters) and all news 

Other Notes:

1. It doesn't appear we are tuning c or gamma in SVM. We can also decide on kernels using cross-validation. See this: 
http://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html 

It is ok for now. But something to look into in the future. We can also look into other models.


In [1]:
import pandas as pd
import numpy as np
import time

## Read Bills dataset (cleaned version)

In [2]:
df = pd.read_csv('../data/bills-out-clean.csv.gz', nrows=10)
df.columns

Index([u'uid', u'Major', u'Minor', u'clean_text'], dtype='object')

In [3]:
df = pd.read_csv('../data/bills-out-clean.csv.gz', usecols=['Major', 'Minor', 'clean_text'])
df

Unnamed: 0,Major,Minor,clean_text
0,20,2012,congression bill 103th congress us govern prin...
1,3,300,congression bill 103th congress us govern prin...
2,15,1520,congression bill 103th congress us govern prin...
3,20,2000,congression bill 103th congress us govern prin...
4,15,1522,congression bill 103th congress us govern prin...
5,1,107,congression bill 103th congress us govern prin...
6,14,1401,congression bill 103th congress us govern prin...
7,1,107,congression bill 103th congress us govern prin...
8,14,1406,congression bill 103th congress us govern prin...
9,3,331,congression bill 103th congress us govern prin...


## Split long bill to smaller chunk (2500 chars)

In [4]:
import re
import textwrap

def insert_chars_split_marker(text, cc=2500):
    # FIXME: text still has number
    text = re.sub('\d+', '', text)
    out = '|'.join(textwrap.wrap(text, cc))
    return out

def insert_words_split_marker(text, wc=500):
    text = re.sub('\d+', '', text)
    words = text.split()
    out = ''
    for i, w in enumerate(words):
        if i != 0 and i % wc == 0:
            out += '|' + w
        else:
            out += ' ' + w
    return out

In [5]:
if True:
    df['clean_text'] = df['clean_text'].apply(lambda c: insert_chars_split_marker(c, 2500))

In [92]:
if True:
    s = df['clean_text'].str.split('|', expand=True).stack()
    i = s.index.get_level_values(0)
    new_df = df.loc[i].copy()
    new_df['chunk'] = s.index.get_level_values(1)
    new_df['clean_text'] = s.values
    df = new_df

ValueError: Length of values does not match length of index

## Vectorize

In [7]:
import nltk
from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer
import re
import string

stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

#def tokenize(text):
#    tokens = nltk.word_tokenize(text)
#    stems = stem_tokens(tokens, stemmer)
#    return stems

def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split

In [9]:
with open('../roberts_rules/all_text.txt', 'rt') as f:
    text = f.read()
text = text.decode('ascii', 'ignore')
text = re.sub(r'\d+', '', text)

vect = CountVectorizer(tokenizer=tokenize, stop_words='english', ngram_range=(2, 3)) 
vect.fit([text])
roberts_rules = set(vect.get_feature_names())

In [10]:
#vect = CountVectorizer(tokenizer=tokenize, stop_words='english', ngram_range=(2, 3), min_df=0.01)
#vect = CountVectorizer(ngram_range=(2, 3), min_df=0.01) 
#vect = CountVectorizer(ngram_range=(2, 3)) 
#vect = CountVectorizer(ngram_range=(2, 3), min_df=20) 
vect = CountVectorizer(ngram_range=(2, 3), min_df=10, max_df=5000) 
vect.fit(df.clean_text)

CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        dtype=<type 'numpy.int64'>, encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=5000, max_features=None, min_df=10,
        ngram_range=(2, 3), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)

In [11]:
len(vect.get_feature_names())

690480

In [12]:
vocab = []
i = 0
for a in vect.vocabulary_:
    if a not in roberts_rules:
        vocab.append(a)
    else:
        #print a
        i += 1
print("Removed {0:d}".format(i))
print("Total {0:d}".format(len(vocab)))

Removed 4013
Total 686467


In [13]:
ng_df = pd.DataFrame(vocab)
ng_df.columns = ['ngram']
ng_df

Unnamed: 0,ngram
0,author examin book
1,mill process
2,printer connect
3,bb approv
4,agenc sponsor
5,payment method
6,compon personnel polici
7,impos health
8,health center ii
9,shall submit year


In [14]:
ng_df.to_csv('../data/bills-23gram-new.csv', index=False)

In [15]:
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):
    labelid = 3 # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " | ".join(feature_names[j] for j in top10)))

def get_top_features(vectorizer, clf, class_labels, n=20):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    top_features = {}
    for i, class_label in enumerate(class_labels):
        topN = np.argsort(clf.coef_[i])[-n:]
        top_features[class_label] = [feature_names[j] for j in topN][::-1]
    return top_features

## Model (Major)

In [17]:
X = df.clean_text
y = df.Major

X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [18]:
#vect = CountVectorizer(ngram_range=(2, 3), min_df=20, vocabulary=vocab)
vect = CountVectorizer(ngram_range=(2, 3), min_df=10, max_df=5000, vocabulary=vocab)

In [19]:
X_train = vect.transform(X_train)
transformer = TfidfTransformer()
X_train = transformer.fit_transform(X_train)

In [20]:
X_test = vect.transform(X_test)
transformer = TfidfTransformer()
X_test = transformer.fit_transform(X_test)

In [21]:
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(X_train, y_train)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(X_test)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1

print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(y_test, prediction_liblinear))

Results for LinearSVC()
Training time: 123.404556s; Prediction time: 0.618879s
             precision    recall  f1-score   support

          1       0.60      0.55      0.57      1071
          2       0.75      0.65      0.70       368
          3       0.86      0.92      0.89      3062
          4       0.82      0.82      0.82       607
          5       0.74      0.76      0.75      1159
          6       0.83      0.85      0.84      1256
          7       0.80      0.78      0.79      1050
          8       0.82      0.85      0.84      1205
         10       0.87      0.88      0.88      1029
         12       0.78      0.79      0.79      1448
         13       0.73      0.68      0.70       819
         14       0.80      0.76      0.78       516
         15       0.80      0.80      0.80      1589
         16       0.83      0.82      0.82      1840
         17       0.76      0.65      0.70       434
         18       0.83      0.80      0.81       898
         19       0

In [22]:
most_informative_feature_for_class(vect, classifier_liblinear, 99)
#most_informative_feature_for_class_svm(vect, classifier_linear)

99 date file petit 0.950818487668
99 amend section amend 0.964538504304
99 author request 0.992795516663
99 mean bill provid 1.12612501026
99 reaffirm agreement 1.26709192351
99 act relief 1.45874390103
99 judiciari bill relief 1.9664402076
99 st session relief 2.32482001213
99 session relief 2.66888917435
99 bill relief 2.71392383165


In [23]:
print_top10(vect, classifier_liblinear, classifier_liblinear.classes_)

1: amend relat section | gift tax | deficit reduct | public debt | regular credit | tax administr | qualifi manufactur | direct spend | sale tax | medicar choic
2: partialbirth abort | vote system | girl women | perform abort | fair hous | person health inform | sexual orient | racial profil | vote right | genet inform
3: medic care | amend titl xviii | item servic | secretari health human | state health | medic malpractic | nation health | health servic act | health plan | prescript drug
4: agricultur product | depart agricultur | genet engin | food safeti | agricultur research | agricultur commod | crop year | anim drug | farmer rancher | food product
5: immigr enforc | amend immigr | commun board | pension plan | wto particip | unemploy compens | illeg alien | youth apprenticeship | infrastructur project | child care
6: local partnership | public school | educ save | elementari secondari | secretari educ | depart educ | higher educ | head start | educ assist | student loan
7: nuclea

In [24]:
topics_df = pd.read_csv('../data/topic_code.csv')
topics_map = {}
for r in topics_df.iterrows():
    topics_map[r[1].code] = r[1].topic

In [25]:
top_features = get_top_features(vect, classifier_liblinear, classifier_liblinear.classes_)
top_features_df = pd.DataFrame(top_features)
columns = []
for c in top_features_df.columns:
    cname = topics_df[topics_df.code == c].topic.values[0]
    columns.append(cname)
top_features_df.columns = columns
top_features_df.to_csv('../data/major_bills_top20_new.csv', index=False)
top_features_df

Unnamed: 0,Macroeconomics,"Civil Rights, Minority Issues, and Civil Liberties",Health,Agriculture,"Labor, Employment, and Immigration",Education,Environment,Energy,Transportation,"Law, Crime, and Family Issues",Social Welfare,Community Development and Housing Issues,"Banking, Finance, and Domestic Commerce",Defense,"Space, Science, Technology and Communications",Foreign Trade,International Affairs and Foreign Aid,Government Operations,Public Lands and Water Management,"Other, Miscellaneous, and Human Interest"
0,medicar choic,genet inform,prescript drug,food product,child care,student loan,asbesto claim,oil ga,coast guard,shall imprison,older individu,empower zone,small busi,depart defens,commun act,custom servic,human right,district columbia,water resourc,bill relief
1,sale tax,vote right,health plan,farmer rancher,infrastructur project,educ assist,fisheri manag,crude oil,public transport,committe judiciari,social servic,hous act,nation insur,arm forc,geoloc inform,trade agreement,unit nation,postal servic,indian tribe,session relief
2,direct spend,racial profil,health servic act,anim drug,youth apprenticeship,head start,coral reef,clean energi,transport infrastructur,money launder,obra amend,delta region,offshor aquacultur,homeland secur,broadband servic,export control,foreign assist,feder elect,indian reserv,st session relief
3,qualifi manufactur,sexual orient,nation health,crop year,illeg alien,higher educ,respons action,feder power,transport system,administr fema,welfar reform,consensu committe,depositori institut,secretari defens,realloc commiss,secretari trade,depart state,public build,nativ hawaiian,judiciari bill relief
4,tax administr,person health inform,medic malpractic,agricultur commod,unemploy compens,depart educ,ballast water,secretari energi,air carrier,bureau prison,welfar recipi,commun develop,interst insur,air forc,telecommun carrier,exportimport bank,peac corp,gener elect,miner activ,act relief
5,regular credit,fair hous,state health,agricultur research,wto particip,secretari educ,solid wast,natur ga,surfac transport,child support,account holder,grant amount,flood insur,war terror,region ocean,countervail duti,foreign servic,hass avocado,dam safeti,reaffirm agreement
6,public debt,perform abort,secretari health human,food safeti,pension plan,elementari secondari,insur particip,fuel cell,rail carrier,juvenil delinqu,individu disabl,afford hous,store valu,secretari navi,space transport,amend act amend,develop countri,independ counsel,critic miner,mean bill provid
7,deficit reduct,girl women,item servic,genet engin,commun board,educ save,water qualiti,pipelin safeti,air transport,feder prison,smart annuiti,hous agenc,profession box,chemic facil,commerci space,tariff act,australia unit,dutch john,nation forest,author request
8,gift tax,vote system,amend titl xviii,depart agricultur,amend immigr,public school,hazard wast,energi polici,transport plan,death penalti,nation servic,hous credit,depart commerc,militari instal,news inform,nafta countri,obstetr fistula,joint committe,nation park,amend section amend
9,amend relat section,partialbirth abort,medic care,agricultur product,immigr enforc,local partnership,nuclear wast,energi conserv,aviat secur,child abus,center organ,invest save,antitrust law,reserv compon,innov prize,trade deficit,govern sudan,execut agenc,nativ american,date file petit


In [26]:
from sklearn.externals import joblib

joblib.dump(vect, "../models/vec_count_bills_23gram_new.joblib")
joblib.dump(classifier_liblinear, "../models/major_bills_clf_liblinear_new.joblib")

['../models/major_bills_clf_liblinear_new.joblib',
 '../models/major_bills_clf_liblinear_new.joblib_01.npy',
 '../models/major_bills_clf_liblinear_new.joblib_02.npy',
 '../models/major_bills_clf_liblinear_new.joblib_03.npy']

In [None]:
y_test_df = pd.DataFrame(y_test)
y_test_df['true_value'] = y_test_df['Minor'].apply(lambda c: topics_map[c])
y_test_df.reset_index(drop=True, inplace=True)

prob = classifier_liblinear.decision_function(X_test)
prob_df = pd.DataFrame(prob)
columns = []
for c in classifier_liblinear.classes_:
    cname = topics_map[c]
    columns.append(cname)
prob_df.columns = columns

result_df = pd.concat([y_test_df[['true_value']], prob_df], axis=1)
result_df.to_csv('../data/test_prediction_major.csv', index=False)
result_df

## Try SGDClassifier

In [27]:
if False:
    from sklearn.linear_model import SGDClassifier

    elastic_clf = SGDClassifier(loss='log', alpha=.00002, n_iter=200, penalty="elasticnet")
    t0 = time.time()
    elastic_clf.fit(X_train, y_train)
    t1 = time.time()
    prediction_elastic = elastic_clf.predict(X_test)
    t2 = time.time()
    time_elastic_train = t1-t0
    time_elastic_predict = t2-t1

    print("Results for Elastic Net")
    print("Training time: %fs; Prediction time: %fs" % (time_elastic_train, time_elastic_predict))
    print(classification_report(y_test, prediction_elastic))

In [28]:
if False:
    most_informative_feature_for_class(vect, elastic_clf, 99)

In [29]:
if False:
    print_top10(vect,  elastic_clf,  elastic_clf.classes_)

## Model (topic_1 == 1)

In [30]:
selected_topics = topics_df[topics_df.topic_1 == 1].code.unique()

In [31]:
selected_topics 

array([ 101,  103,  105,  107,  201,  202,  206,  207,  209,  301,  302,
        323,  324,  331,  332,  333,  335,  403,  404,  501,  502,  503,
        504,  505,  508,  529,  601,  603,  607,  609,  701,  703,  704,
        705,  709,  710,  801,  802,  803,  806,  807,  900, 1002, 1003,
       1005, 1006, 1007, 1202, 1203, 1204, 1205, 1206, 1207, 1208, 1209,
       1301, 1302, 1303, 1304, 1401, 1501, 1502, 1504, 1507, 1521, 1522,
       1523, 1525, 1602, 1603, 1605, 1609, 1610, 1612, 1615, 1701, 1706,
       1707, 1709, 1802, 1807, 1901, 1915, 1925, 1926, 1927, 2002, 2003,
       2006, 2010, 2011, 2012, 2013, 2101, 2102])

In [32]:
X = df[df.Minor.isin(selected_topics)].clean_text
y = df[df.Minor.isin(selected_topics)].Minor

X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [33]:
X_train = vect.transform(X_train)
transformer = TfidfTransformer()
X_train = transformer.fit_transform(X_train)

In [34]:
X_test = vect.transform(X_test)
transformer = TfidfTransformer()
X_test = transformer.fit_transform(X_test)

In [35]:
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(X_train, y_train)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(X_test)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1

print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(y_test, prediction_liblinear))

Results for LinearSVC()
Training time: 237.174071s; Prediction time: 1.236352s
             precision    recall  f1-score   support

        101       0.40      0.29      0.33         7
        103       0.00      0.00      0.00         3
        105       0.68      0.56      0.61       312
        107       0.66      0.74      0.70       615
        201       0.80      0.67      0.73        12
        202       0.88      0.43      0.58        51
        206       0.88      0.60      0.71        25
        207       0.90      0.64      0.75        28
        209       0.00      0.00      0.00         1
        301       0.67      0.75      0.71       864
        302       0.62      0.60      0.61       440
        323       0.62      0.42      0.50        73
        324       0.50      0.35      0.41        65
        331       0.68      0.63      0.65       156
        332       0.67      0.62      0.65       207
        333       0.64      0.67      0.66        67
        335       0

  'precision', 'predicted', average, warn_for)


In [36]:
most_informative_feature_for_class(vect, classifier_liblinear, 501)

501 drugfre workplac program 0.954687110151
501 expos person 0.960065124841
501 safeti health act 0.976369504038
501 cover ill 0.986308017987
501 asbesto claim 1.01969459724
501 modif thereof 1.10235644819
501 energi employe 1.18290098909
501 possibl modif 1.18658501104
501 safeti health 1.36630269879
501 drugfre workplac 1.40521522807


In [37]:
print_top10(vect, classifier_liblinear, classifier_liblinear.classes_)

101: price goug | cost live | function offic | feder statist servic | statist servic | statist data center | statist data | price stabil | feder statist | state shall effect
103: regul guidanc | purpos tax impos | independ contractor | temporari increas | total wage | state workforc | unemploy rate | increas unemploy | unit state workforc | busi incub
105: commiss bill | budget author | committe budget | deficit reduct | sequestr order | act amend insert | spend reduct | direct spend | public debt | medicar choic
107: busi activ | portland oregon | capit gain | tax administr | foreign corpor | gift tax | busi entiti | revenu servic | intern revenu servic | sale tax
201: elimin racial profil | elimin racial | harass intimid | morril act | qualifi minor | servic personnel | cultur compet | institut slaveri | custom servic personnel | racial profil
202: conflict prevent | administr judg | women scientist | econom selfsuffici | girl women | perman partnership | gender equiti | sexual haras

In [38]:
top_features = get_top_features(vect, classifier_liblinear, classifier_liblinear.classes_)
top_features_df = pd.DataFrame(top_features)
columns = []
for c in top_features_df.columns:
    cname = topics_df[topics_df.code == c].topic.values[0]
    columns.append(cname)
top_features_df.columns = columns
top_features_df.to_csv('../data/topic_1_bills_top20_new.csv', index=False)
top_features_df

Unnamed: 0,"Inflation, Prices, and Interest Rates",Unemployment Rate,National Budget and Debt,"Taxation, Tax policy, and Tax Reform",Ethnic Minority and Racial Group Discrimination,Gender and Sexual Orientation Discrimination,"Voting Rights, Participation, and Related Issues",Freedom of Speech & Religion,Anti-Government Activities,Comprehensive health care reform,...,"Terrorism, Hijacking",Government Efficiency and Bureaucratic Oversight,Postal Service Issues (Including Mail Fraud),"Currency, Commemorative Coins, Medals, U.S. Mint",Presidential Impeachment & Scandal,"Federal Government Branch Relations and Administrative Issues, Congressional Operations","Regulation of Political Campaigns, Political Advertising, PAC regulation, Government Ethics",Census,"National Parks, Memorials, Historic Sites, and Recreation",Native American Affairs
0,state shall effect,busi incub,medicar choic,sale tax,racial profil,sexual orient,vote right,partialbirth abort,covert test,region allianc,...,militari commiss,regulatori action,postal servic,commemor coin,counterfeit mark,disapprov bill,feder elect,decenni censu,nation park,indian tribe
1,feder statist,unit state workforc,public debt,intern revenu servic,custom servic personnel,perman partner,vote system,unborn child,time war,nation health,...,al qaeda,inspector gener,postag stamp,gold medal,traffick counterfeit,member hous repres,gener elect,director censu,heritag area,nativ hawaiian
2,price stabil,increas unemploy,direct spend,revenu servic,institut slaveri,sexual harass,board advisor,perform abort,ensur citizen,advanc direct,...,plastic explos,public printer,postal regulatori,coin issu,code provid crimin,legisl branch,report individu,censu popul,nation histor,nativ american
3,statist data,unemploy rate,spend reduct,busi entiti,cultur compet,gender equiti,standard board,free speech,secur screener,health secur,...,insur loss,nation manufactur,postal regulatori commiss,educ outreach,penalti traffick,librari congress,elect campaign,redistrict plan,nation monument,indian reserv
4,statist data center,state workforc,act amend insert,gift tax,servic personnel,perman partnership,absente ballot,religi freedom,anoth countri,state health,...,terrorist organ,system transform,offici mail,travel promot,th cir,member employe,lobbi disclosur,american resid,scienc park,indian affair
5,statist servic,total wage,sequestr order,foreign corpor,qualifi minor,girl women,local elect,reproduct health,unless unit state,commun health,...,intern terror,risk assess,rate postag,bald eagl,provid crimin penalti,use forc,voter registr,apportion repres congress,smithsonian institut,indian land
6,feder statist servic,temporari increas,deficit reduct,tax administr,morril act,econom selfsuffici,uniform servic voter,relat abort,unless unit,health board,...,protect america,action purpos,postmast gener,state mint,provid crimin,independ counsel,polit committe,censu bureau,nation memori,game oper
7,function offic,independ contractor,committe budget,capit gain,harass intimid,women scientist,servic voter,sexual explicit,detain unit,health plan,...,review commiss,major rule,post offic,unit state mint,novemb attest secretari,librarian congress,polit organ,independ redistrict,nation heritag,indian game
8,cost live,purpos tax impos,budget author,portland oregon,elimin racial,administr judg,parti candid,flag unit,detain unit state,antimicrobi resist,...,foreign state,public pension,cover postal,coin act,senat novemb attest,avail internet,lobbi activ,resid abroad,commemor work,bureau indian
9,price goug,regul guidanc,commiss bill,busi activ,elimin racial profil,conflict prevent,individu convict,abort servic,unit state first,health inform,...,militari commiss chapter,public servic,bypass mail,congression gold medal,act amend titl,war power,elig candid,statist purpos,histor preserv,land claim


In [None]:
from sklearn.externals import joblib

#joblib.dump(vect, "../models/vec_count_bills_23gram.joblib")
joblib.dump(classifier_liblinear, "../models/topic_1_bills_clf_liblinear_new.joblib")

['../models/topic_1_bills_clf_liblinear_new.joblib',
 '../models/topic_1_bills_clf_liblinear_new.joblib_01.npy',
 '../models/topic_1_bills_clf_liblinear_new.joblib_02.npy',
 '../models/topic_1_bills_clf_liblinear_new.joblib_03.npy']

In [None]:
y_test_df = pd.DataFrame(y_test)
y_test_df['true_value'] = y_test_df['Minor'].apply(lambda c: topics_map[c])
y_test_df.reset_index(drop=True, inplace=True)

prob = classifier_liblinear.decision_function(X_test)
prob_df = pd.DataFrame(prob)
columns = []
for c in classifier_liblinear.classes_:
    cname = topics_map[c]
    columns.append(cname)
prob_df.columns = columns

result_df = pd.concat([y_test_df[['true_value']], prob_df], axis=1)
result_df.to_csv('../data/test_prediction_topic_1.csv', index=False)
result_df

## Try LogisticRegression

In [None]:
if False:
    from sklearn.linear_model import LogisticRegression

    # Perform classification with LogisticRegression
    logreg_clf = LogisticRegression()
    t0 = time.time()
    logreg_clf.fit(X_train, y_train)
    t1 = time.time()
    prediction_logreg = logreg_clf.predict(X_test)
    t2 = time.time()
    time_logreg_train = t1-t0
    time_logreg_predict = t2-t1

    print("Results for LogisticRegression()")
    print("Training time: %fs; Prediction time: %fs" % (time_logreg_train, time_logreg_predict))
    print(classification_report(y_test, prediction_logreg))
    most_informative_feature_for_class(vect, logreg_clf, 501)
    print_top10(vect, logreg_clf, logreg_clf.classes_)

Results for LogisticRegression()
Training time: 938.704970s; Prediction time: 1.289449s
             precision    recall  f1-score   support

        101       0.00      0.00      0.00         7
        103       0.00      0.00      0.00         3
        105       0.66      0.55      0.60       312
        107       0.51      0.84      0.64       615
        201       1.00      0.17      0.29        12
        202       0.94      0.29      0.45        51
        206       1.00      0.12      0.21        25
        207       1.00      0.18      0.30        28
        209       0.00      0.00      0.00         1
        301       0.43      0.84      0.57       864
        302       0.60      0.53      0.56       440
        323       0.71      0.16      0.27        73
        324       1.00      0.09      0.17        65
        331       0.79      0.40      0.53       156
        332       0.68      0.49      0.57       207
        333       0.87      0.49      0.63        67
        33