### Original

The data are at: opt/data/bills-out.csv

We will train 3 classifiers, likely SVM as number of categories is a lot:

1. predict 'Minor': column labeled 'minor'
2. predict 'Major': column labeled 'major'
3. predict only categories where topic_code.csv (also in opt/data) says include ==1 (I will update this soon.)


How?
1. Tokenize bill text
2. Break into 5000 word chunks per bill
3. Use SVM to predict --- cross-validate to tune
4. Show classification success and print out top coefficients for each category so that we can verify that the model makes sense
5. Predict congressional speech --- it is in opt/cong (or you can download via capitolwords API based on the script you wrote, we want to predict for each congress.)
   Predict each category. If you have 9 categories, you will get 9 columns. 
   
### Recap

1. Let us again remove Roberts rules bigrams/trigrams again
2. With just 10k tokens, I think we are selling ourselves short. Let us remove only bigrams/trigrams that are in 20 documents or less.
3. Let us try GradientBoostedClassifier also
4. Let us not chunk in 5k as bills seem to be ok length except for 5 or so bills which have less than 1000 characters

What to predict?
1. Major topics
2. Minor topics: but take out all with the label 'other' or 'general'. To find out the topic codes, see topic_code.csv. We can also take out minor topics that don't have more than 10 bills.
5. The third model will only include minor topics which in topic_code.csv column topic_1 ==1 And again we want minor topics that don't have more than 10 bills

Next steps after the models:
1. Output top 20 bigrams/trigrams for each topic where column_name = topic_label (from topic_code.csv) for each model

In [1]:
import pandas as pd
import numpy as np
import time

## Read Bills dataset (cleaned version)

In [2]:
df = pd.read_csv('../data/bills-out-clean.csv.gz', nrows=10)
df.columns

Index([u'uid', u'Major', u'Minor', u'clean_text'], dtype='object')

In [3]:
df = pd.read_csv('../data/bills-out-clean.csv.gz', usecols=['Major', 'Minor', 'clean_text'])
df

Unnamed: 0,Major,Minor,clean_text
0,20,2012,congression bill 103th congress us govern prin...
1,3,300,congression bill 103th congress us govern prin...
2,15,1520,congression bill 103th congress us govern prin...
3,20,2000,congression bill 103th congress us govern prin...
4,15,1522,congression bill 103th congress us govern prin...
5,1,107,congression bill 103th congress us govern prin...
6,14,1401,congression bill 103th congress us govern prin...
7,1,107,congression bill 103th congress us govern prin...
8,14,1406,congression bill 103th congress us govern prin...
9,3,331,congression bill 103th congress us govern prin...


## Split long bill to smaller chunk (5000 words)

In [4]:
import re
import textwrap

def insert_chars_split_marker(text, cc=2500):
    # FIXME: text still has number
    text = re.sub('\d+', '', text)
    out = '|'.join(textwrap.wrap(text, cc))
    return out

def insert_words_split_marker(text, wc=500):
    text = re.sub('\d+', '', text)
    words = text.split()
    out = ''
    for i, w in enumerate(words):
        if i != 0 and i % wc == 0:
            out += '|' + w
        else:
            out += ' ' + w
    return out

In [5]:
if False:
    df['clean_text'] = df['clean_text'].apply(lambda c: insert_words_split_marker(c, 5000))

In [6]:
if False:
    s = df['clean_text'].str.split('|', expand=True).stack()
    i = s.index.get_level_values(0)
    new_df = df.loc[i].copy()
    new_df['chunk'] = s.index.get_level_values(1)
    new_df['clean_text'] = s.values
    df = new_df
    df

## Vectorize

In [7]:
import nltk
from nltk import word_tokenize          
from nltk.stem.porter import PorterStemmer
import re
import string

stemmer = PorterStemmer()
def stem_tokens(tokens, stemmer):
    stemmed = []
    for item in tokens:
        stemmed.append(stemmer.stem(item))
    return stemmed

#def tokenize(text):
#    tokens = nltk.word_tokenize(text)
#    stems = stem_tokens(tokens, stemmer)
#    return stems

def tokenize(text):
    text = "".join([ch for ch in text if ch not in string.punctuation])
    tokens = nltk.word_tokenize(text)
    stems = stem_tokens(tokens, stemmer)
    return stems

In [8]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.cross_validation import train_test_split

In [None]:
with open('../roberts_rules/all_text.txt', 'rt') as f:
    text = f.read()
text = text.decode('ascii', 'ignore')
text = re.sub(r'\d+', '', text)

vect = CountVectorizer(tokenizer=tokenize, stop_words='english', ngram_range=(2, 3)) 
vect.fit([text])
roberts_rules = set(vect.get_feature_names())

In [None]:
#vect = CountVectorizer(tokenizer=tokenize, stop_words='english', ngram_range=(2, 3), min_df=0.01)
#vect = CountVectorizer(ngram_range=(2, 3), min_df=0.01) 
#vect = CountVectorizer(ngram_range=(2, 3)) 
vect = CountVectorizer(ngram_range=(2, 3), min_df=20) 
vect.fit(df.clean_text)

In [None]:
len(vect.get_feature_names())

In [None]:
vocab = []
i = 0
for a in vect.vocabulary_:
    if a not in roberts_rules:
        vocab.append(a)
    else:
        #print a
        i += 1
print("Removed {0:d}".format(i))
print("Total {0:d}".format(len(vocab)))

In [None]:
ng_df = pd.DataFrame(vocab)
ng_df.columns = ['ngram']
ng_df

In [None]:
#ng_df.to_csv('../data/bills-23gram.csv', index=False)

In [None]:
def most_informative_feature_for_class(vectorizer, classifier, classlabel, n=10):
    labelid = list(classifier.classes_).index(classlabel)
    feature_names = vectorizer.get_feature_names()
    topn = sorted(zip(classifier.coef_[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print classlabel, feat, coef

def most_informative_feature_for_class_svm(vectorizer, classifier,  n=10):
    labelid = 3 # this is the coef we're interested in. 
    feature_names = vectorizer.get_feature_names()
    svm_coef = classifier.coef_.toarray() 
    topn = sorted(zip(svm_coef[labelid], feature_names))[-n:]

    for coef, feat in topn:
        print feat, coef

def print_top10(vectorizer, clf, class_labels):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    for i, class_label in enumerate(class_labels):
        top10 = np.argsort(clf.coef_[i])[-10:]
        print("%s: %s" % (class_label,
              " | ".join(feature_names[j] for j in top10)))

def get_top_features(vectorizer, clf, class_labels, n=20):
    """Prints features with the highest coefficient values, per class"""
    feature_names = vectorizer.get_feature_names()
    top_features = {}
    for i, class_label in enumerate(class_labels):
        topN = np.argsort(clf.coef_[i])[-n:]
        top_features[class_label] = [feature_names[j] for j in topN][::-1]
    return top_features

## Model (Major)

In [None]:
df.groupby('Major').agg({'Major': 'count'})

In [None]:
X = df.clean_text
y = df.Major

X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [None]:
#vect = CountVectorizer(ngram_range=(2, 3), min_df=20, vocabulary=vocab)
vect = CountVectorizer(ngram_range=(2, 3), min_df=10, max_df=5000, vocabulary=vocab)

In [None]:
X_train = vect.transform(X_train)
transformer = TfidfTransformer()
X_train = transformer.fit_transform(X_train)

In [None]:
X_test = vect.transform(X_test)
transformer = TfidfTransformer()
X_test = transformer.fit_transform(X_test)

In [None]:
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(X_train, y_train)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(X_test)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1

print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(y_test, prediction_liblinear))

In [None]:
most_informative_feature_for_class(vect, classifier_liblinear, 99)
#most_informative_feature_for_class_svm(vect, classifier_linear)

In [None]:
print_top10(vect, classifier_liblinear, classifier_liblinear.classes_)

In [None]:
topics_df = pd.read_csv('../data/topic_code.csv')

In [None]:
top_features = get_top_features(vect, classifier_liblinear, classifier_liblinear.classes_)
top_features_df = pd.DataFrame(top_features)
columns = []
for c in top_features_df.columns:
    cname = topics_df[topics_df.code == c].topic.values[0]
    columns.append(cname)
top_features_df.columns = columns
#top_features_df.to_csv('../data/major_bills_top20.csv', index=False)
top_features_df

In [None]:
from sklearn.externals import joblib

#joblib.dump(vect, "../models/vec_count_bills_23gram.joblib")
#joblib.dump(classifier_liblinear, "../models/major_bills_clf_liblinear.joblib")

## Try SGDClassifier

In [None]:
if False:
    from sklearn.linear_model import SGDClassifier

    elastic_clf = SGDClassifier(loss='log', alpha=.00002, n_iter=200, penalty="elasticnet")
    t0 = time.time()
    elastic_clf.fit(X_train, y_train)
    t1 = time.time()
    prediction_elastic = elastic_clf.predict(X_test)
    t2 = time.time()
    time_elastic_train = t1-t0
    time_elastic_predict = t2-t1

    print("Results for Elastic Net")
    print("Training time: %fs; Prediction time: %fs" % (time_elastic_train, time_elastic_predict))
    print(classification_report(y_test, prediction_elastic))

In [None]:
if False:
    most_informative_feature_for_class(vect, elastic_clf, 99)

In [None]:
if False:
    print_top10(vect,  elastic_clf,  elastic_clf.classes_)

## Minor Model

#### Was removed!!!

## Model (topic_1 == 1)

In [None]:
selected_topics = topics_df[topics_df.topic_1 == 1].code.unique()

In [None]:
selected_topics 

In [None]:
X = df[df.Minor.isin(minors) & df.Minor.isin(selected_topics)].clean_text
y = df[df.Minor.isin(minors) & df.Minor.isin(selected_topics)].Minor

X_train,  X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y)

In [None]:
X_train = vect.transform(X_train)
transformer = TfidfTransformer()
X_train = transformer.fit_transform(X_train)

In [None]:
X_test = vect.transform(X_test)
transformer = TfidfTransformer()
X_test = transformer.fit_transform(X_test)

In [None]:
# Perform classification with SVM, kernel=linear
classifier_liblinear = svm.LinearSVC()
t0 = time.time()
classifier_liblinear.fit(X_train, y_train)
t1 = time.time()
prediction_liblinear = classifier_liblinear.predict(X_test)
t2 = time.time()
time_liblinear_train = t1-t0
time_liblinear_predict = t2-t1

print("Results for LinearSVC()")
print("Training time: %fs; Prediction time: %fs" % (time_liblinear_train, time_liblinear_predict))
print(classification_report(y_test, prediction_liblinear))

In [None]:
most_informative_feature_for_class(vect, classifier_liblinear, 501)

In [None]:
print_top10(vect, classifier_liblinear, classifier_liblinear.classes_)

In [None]:
top_features = get_top_features(vect, classifier_liblinear, classifier_liblinear.classes_)
top_features_df = pd.DataFrame(top_features)
columns = []
for c in top_features_df.columns:
    cname = topics_df[topics_df.code == c].topic.values[0]
    columns.append(cname)
top_features_df.columns = columns
#top_features_df.to_csv('../data/topic_1_bills_top20.csv', index=False)
top_features_df

In [None]:
from sklearn.externals import joblib

#joblib.dump(vect, "../models/vec_count_bills_23gram.joblib")
#joblib.dump(classifier_liblinear, "../models/topic_1_bills_clf_liblinear.joblib")

## Try LogisticRegression

In [None]:
if True:
    from sklearn.linear_model import LogisticRegression

    # Perform classification with LogisticRegression
    logreg_clf = LogisticRegression()
    t0 = time.time()
    logreg_clf.fit(X_train, y_train)
    t1 = time.time()
    prediction_logreg = logreg_clf.predict(X_test)
    t2 = time.time()
    time_logreg_train = t1-t0
    time_logreg_predict = t2-t1

    print("Results for LogisticRegression()")
    print("Training time: %fs; Prediction time: %fs" % (time_logreg_train, time_logreg_predict))
    print(classification_report(y_test, prediction_logreg))
    most_informative_feature_for_class(vect, logreg_clf, 501)
    print_top10(vect, logreg_clf, logreg_clf.classes_)