# Week 4-2: Text classification

For this assignment you will build a classifier that figures out the main topics of a bill, from its title.

In [7]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
%matplotlib inline

## 1. Create document vectors

In [64]:
# Load up bills.csv This is a list of thousands of bill titles from the California legislature, 
# and their subject classifications
df = pd.read_csv('bills.csv', encoding='latin-1')
df

Unnamed: 0,text,topic
0,An act to amend Section 44277 of the Education...,Education
1,An act to add Section 8314.4 to the Government...,Public Services
2,"An act to amend Sections 226, 233, and 234 of,...",Labor and Employment
3,"An act to amend Sections 12920, 12921, 12926, ...",Labor and Employment
4,"An act to amend Section 186.8 of, and to add S...",Crime
5,An act to amend Section 13823.17 of the Penal ...,Social Issues
6,"An act to add Sections 5017.1, 5017.5, and 510...",Business and Consumers
7,An act to add Section 15817.5 to the Governmen...,Housing and Property
8,An act to amend Section 35012 of the Education...,Education
9,"An act to amend Sections 8869.82, 91501, 91502...",Commerce


In [37]:
# Vectorize these suckers with the CountVectorizer, removing stopwords
from sklearn.feature_extraction.text import CountVectorizer

In [38]:
vectorizer = CountVectorizer(stop_words='english', min_df=2)
matrix = vectorizer.fit_transform(df.text)

In [62]:
vectorizer.get_feature_names()

['00',
 '0001',
 '001',
 '007',
 '01',
 '010',
 '011',
 '015',
 '018',
 '019',
 '02',
 '020',
 '023',
 '03',
 '030',
 '033',
 '04',
 '0439',
 '05',
 '06',
 '060',
 '0660',
 '07',
 '077',
 '08',
 '0820',
 '0890',
 '09',
 '10',
 '100',
 '1000',
 '100000',
 '1001',
 '10026',
 '10085',
 '10089',
 '101',
 '1010',
 '10105',
 '10111',
 '10112',
 '10119',
 '10123',
 '10127',
 '10128',
 '10133',
 '10140',
 '10144',
 '1016',
 '10177',
 '10181',
 '10198',
 '10199',
 '101990',
 '102',
 '10214',
 '1024',
 '10247',
 '10291',
 '10295',
 '103',
 '103628',
 '10384',
 '104',
 '104113',
 '105',
 '1050',
 '1051',
 '1052',
 '1055',
 '106',
 '10601',
 '10608',
 '10618',
 '1063',
 '10631',
 '10650',
 '107',
 '10700',
 '10708',
 '10709',
 '10710',
 '10711',
 '10712',
 '10713',
 '10752',
 '108',
 '10800',
 '10802',
 '10830',
 '1088',
 '109',
 '1090',
 '10902',
 '10920',
 '1095',
 '11',
 '110',
 '11004',
 '11011',
 '11014',
 '1103',
 '111',
 '11100',
 '11105',
 '11106',
 '1112',
 '1113',
 '11165',
 '1120',
 '11

## There are 3079 features in the 5750 bills

In [66]:
# How many different features do we have?
vectors = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
vectors.head()

Unnamed: 0,0558-001-0001,0650-001-0214,0650-011-0001,0650-101-0214,0820-001-3086,0840-001-0001,1.1,1.1b,1.2,1.5,...,wrongful,year,year-round,yellow,ymca,youth,zanger,zones,zoning,ì¢ââòan
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.300391,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [63]:
vectorizer.get_feature_names()

['00',
 '0001',
 '001',
 '007',
 '01',
 '010',
 '011',
 '015',
 '018',
 '019',
 '02',
 '020',
 '023',
 '03',
 '030',
 '033',
 '04',
 '0439',
 '05',
 '06',
 '060',
 '0660',
 '07',
 '077',
 '08',
 '0820',
 '0890',
 '09',
 '10',
 '100',
 '1000',
 '100000',
 '1001',
 '10026',
 '10085',
 '10089',
 '101',
 '1010',
 '10105',
 '10111',
 '10112',
 '10119',
 '10123',
 '10127',
 '10128',
 '10133',
 '10140',
 '10144',
 '1016',
 '10177',
 '10181',
 '10198',
 '10199',
 '101990',
 '102',
 '10214',
 '1024',
 '10247',
 '10291',
 '10295',
 '103',
 '103628',
 '10384',
 '104',
 '104113',
 '105',
 '1050',
 '1051',
 '1052',
 '1055',
 '106',
 '10601',
 '10608',
 '10618',
 '1063',
 '10631',
 '10650',
 '107',
 '10700',
 '10708',
 '10709',
 '10710',
 '10711',
 '10712',
 '10713',
 '10752',
 '108',
 '10800',
 '10802',
 '10830',
 '1088',
 '109',
 '1090',
 '10902',
 '10920',
 '1095',
 '11',
 '110',
 '11004',
 '11011',
 '11014',
 '1103',
 '111',
 '11100',
 '11105',
 '11106',
 '1112',
 '1113',
 '11165',
 '1120',
 '11

In [65]:
vectorizer = TfidfVectorizer(stop_words='english', tokenizer=tokenize)

matrix = vectorizer.fit_transform(df.text)

# The easiest way to see what happenned is to make a dataframe
tfidf = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
tfidf.head()

Unnamed: 0,0558-001-0001,0650-001-0214,0650-011-0001,0650-101-0214,0820-001-3086,0840-001-0001,1.1,1.1b,1.2,1.5,...,wrongful,year,year-round,yellow,ymca,youth,zanger,zones,zoning,ì¢ââòan
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.300391,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### What words correspond to the first 20 features?
The numbers listed in the first 20 categories above are probably sections of the bill, like laws that are referenced repetedly.

In [67]:
print_sorted_vector(tfidf.iloc[20])

('399.20', 0.7075118834651466)
('energy', 0.41281627170553725)
('utilities', 0.39594420212754344)
('public', 0.2656107565853799)
('amend', 0.16721930639081303)
('section', 0.1509106424746806)
('code', 0.13643690719458848)
('relating', 0.12821048606152977)
('act', 0.1260932136776669)
('ì¢\x89ââ\x81òan', 0.0)
('zoning', 0.0)
('zones', 0.0)
('zanger', 0.0)
('youth', 0.0)
('ymca', 0.0)
('yellow', 0.0)
('year-round', 0.0)
('year', 0.0)
('wrongful', 0.0)
('write-in', 0.0)


### The top words in all 5749 documents

In [69]:
docs = tfidf.iloc[:5749 :]
total = docs.sum(axis=0)
print_sorted_vector(total)

('act', 380.15261900177944)
('relating', 343.28508660004223)
('section', 337.24461395314495)
('code', 328.09589668762464)
('amend', 270.99268735643346)
('add', 222.50404918001374)
('health', 147.2043893651103)
('public', 146.6367778339316)
('sections', 141.23686491754327)
('education', 138.80273908643932)
('government', 131.76092174497018)
('relative', 127.11567762008956)
('taxation', 123.5003842683281)
('safety', 100.44180704540744)
('budget', 98.32867164470612)
('state', 92.8752218728166)
('repeal', 90.34143913369395)
('2009', 83.99980746461084)
('commencing', 76.35844884314109)
('immediately', 74.47646611714144)


## 2. Build a classifier

In [75]:
df.head(10)

Unnamed: 0,text,topic
0,An act to amend Section 44277 of the Education...,Education
1,An act to add Section 8314.4 to the Government...,Public Services
2,"An act to amend Sections 226, 233, and 234 of,...",Labor and Employment
3,"An act to amend Sections 12920, 12921, 12926, ...",Labor and Employment
4,"An act to amend Section 186.8 of, and to add S...",Crime
5,An act to amend Section 13823.17 of the Penal ...,Social Issues
6,"An act to add Sections 5017.1, 5017.5, and 510...",Business and Consumers
7,An act to add Section 15817.5 to the Governmen...,Housing and Property
8,An act to amend Section 35012 of the Education...,Education
9,"An act to amend Sections 8869.82, 91501, 91502...",Commerce


In [71]:
import csv, re, string
from sklearn.linear_model import LogisticRegression
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline

In [73]:
# Some basic setup for data-cleaning purposes
punctuation = re.compile('[' + re.escape(string.punctuation) + ']' )

In [79]:
# Make the 'topic' column categorical, so we can print a pretty confusion matrix later
df['topic'] = df['topic'].astype('category')
df.topic.head()

0               Education
1         Public Services
2    Labor and Employment
3    Labor and Employment
4                   Crime
Name: topic, dtype: category
Categories (44, object): [Agriculture and Food, Animal Rights and Wildlife Issues, Arts and Humanities, Budget, Spending, and Taxes, ..., Technology and Communication, Trade, Transportation, Welfare and Poverty]

In [82]:
df.topic.cat.categories

Index(['Agriculture and Food', 'Animal Rights and Wildlife Issues',
       'Arts and Humanities', 'Budget, Spending, and Taxes',
       'Business and Consumers', 'Campaign Finance and Election Issues',
       'Civil Liberties and Civil Rights', 'Commerce', 'Crime', 'Drugs',
       'Education', 'Energy', 'Environmental', 'Executive Branch',
       'Family and Children Issues', 'Federal, State, and Local Relations',
       'Gambling and Gaming', 'Government Reform', 'Guns', 'Health',
       'Housing and Property', 'Immigration', 'Indigenous Peoples',
       'Insurance', 'Judiciary', 'Labor and Employment', 'Legal Issues',
       'Legislative Affairs', 'Military', 'Municipal and County Issues',
       'Other', 'Public Services', 'Recreation', 'Reproductive Issues',
       'Resolutions', 'Science and Medical Research', 'Senior Issues',
       'Sexual Orientation and Gender Issues', 'Social Issues',
       'State Agencies', 'Technology and Communication', 'Trade',
       'Transportation', '

In [83]:
df.topic.head().cat.codes

0    10
1    31
2    25
3    25
4     8
dtype: int8

In [87]:
# Glue the topics back together with the document vectors, into one dataframe
features_and_target = pd.concat([df.topic, vectors],axis=1)

In [88]:
# Now split 20% of combined data into a test set
train, test = train_test_split(features_and_target, test_size=0.20)

In [89]:
x_train = train.iloc[:,1:].values
y_train = train.iloc[:,0].values

dt = tree.DecisionTreeClassifier()
dt.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [None]:
# Build a decision tree on the training data
x = df.drop(predict_col, axis=1).values
y = df[[predict_col]].values

In [None]:
from sklearn.tree import export_graphviz
import graphviz

feature_names = gss.columns.drop(predict_col)
export_graphviz(dt, 
                feature_names=feature_names, 
                rounded=True,
                out_file="mytree.dot")
with open("mytree.dot") as f:
    dot_graph = f.read()
graphviz.Source(dot_graph)

In [98]:
# Evaluate the tree on the test data and print out the accuracy
x_test = test.iloc[:,1:].values
y_test = test.iloc[:,0].values
y_test_pred = dt.predict(x_test)
metrics.accuracy_score(y_test_pred, y_test)

0.6608695652173913

In [97]:
# Now print out a nicely labelled confusion matrix
truecats = "True " + df.topic.cat.categories
predcats = "Guessed " + df.topic.cat.categories
pd.DataFrame(metrics.confusion_matrix(y_test_pred, y_test, labels=df.topic.cat.categories), columns=predcats, index=truecats)

Unnamed: 0,Guessed Agriculture and Food,Guessed Animal Rights and Wildlife Issues,Guessed Arts and Humanities,"Guessed Budget, Spending, and Taxes",Guessed Business and Consumers,Guessed Campaign Finance and Election Issues,Guessed Civil Liberties and Civil Rights,Guessed Commerce,Guessed Crime,Guessed Drugs,...,Guessed Resolutions,Guessed Science and Medical Research,Guessed Senior Issues,Guessed Sexual Orientation and Gender Issues,Guessed Social Issues,Guessed State Agencies,Guessed Technology and Communication,Guessed Trade,Guessed Transportation,Guessed Welfare and Poverty
True Agriculture and Food,6,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
True Animal Rights and Wildlife Issues,2,2,0,0,0,0,0,0,0,0,...,1,0,0,0,0,0,0,0,0,0
True Arts and Humanities,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"True Budget, Spending, and Taxes",0,0,0,56,0,0,0,3,0,0,...,0,0,0,0,0,0,1,0,2,0
True Business and Consumers,0,0,0,0,14,0,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
True Campaign Finance and Election Issues,0,0,0,0,0,36,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
True Civil Liberties and Civil Rights,0,0,0,0,0,0,2,0,0,0,...,0,0,0,0,0,0,0,0,0,0
True Commerce,0,0,0,2,2,0,0,22,1,0,...,0,0,0,0,0,1,1,0,1,0
True Crime,0,0,0,0,0,0,0,0,35,0,...,0,0,0,0,3,1,1,0,1,0
True Drugs,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


What's a case -- an entry in thie matrix -- where the classifier made a particularly large number of errors? Can you guess why?

Looking at this matrix, 7 documents were guessed "Budget, Spending, and Taxes" when they're actually "Housing and Property." It's possible these documents discussed property taxes, which caused them to be incorrectly classified.

## Bonus: try it on new data
How do we apply this to other bill titles? Ones that weren't originally in the test or training set?


In [12]:
# Here are some other bills
new_titles = [
    "Public postsecondary education: executive officer compensation.",
    "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
    "Political Reform Act of 1974: campaign disclosures.",
    "An act to add Section 236.3 to the Penal Code, relating to human trafficking."]

Your assighnment is to vectorize these titles, and predict their subject using the classifier we built.
The challenge here is to get these new documents encoded with the same features as the classifier expects. That is, we could just run them through `CountVectorizer` but then get_feature_names() would give us a different set of coluns, because the vocabulary of these documents is different.

The solution is to use the `vocabulary` parameter of `CountVectorizer` like this:


In [13]:
# Make a new vectorizer that maps the same words to the same feature positions as the old vectorizer
new_vectorizer = CountVectorizer(stop_words='english', vocabulary=vectorizer.get_feature_names())

In [5]:
# Now use this new_vectorizer to fit the new docs


In [6]:
# Predict the topics of the new documents, using our pre-existing classifier
