# Week 4-2: Text classification

For this assignment you will build a classifier that figures out the main topics of a bill, from its title.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics

%matplotlib inline

## 1. Create document vectors

In [2]:
# Load up bills.csv This is a list of thousands of bill titles from the California legislature, 
# and their subject classifications
pd.set_option('max_colwidth',-1)
df = pd.read_csv('bills.csv', encoding='latin-1')
df.head()

Unnamed: 0,text,topic
0,"An act to amend Section 44277 of the Education Code, relating to teachers.",Education
1,"An act to add Section 8314.4 to the Government Code, relating to public funds.",Public Services
2,"An act to amend Sections 226, 233, and 234 of, and to add Article 1.5 (commencing with Section 245) to Chapter 1 of Part 1 of Division 2 of, the Labor Code, relating to employment.",Labor and Employment
3,"An act to amend Sections 12920, 12921, 12926, 12940, and 12955.2 of the Government Code, relating to employment.",Labor and Employment
4,"An act to amend Section 186.8 of, and to add Section 236.4 to, the Penal Code, relating to human trafficking.",Crime


In [3]:
# Vectorize these suckers with the CountVectorizer, removing stopwords
vectorizer = CountVectorizer()
matrix = vectorizer.fit_transform(df.text)

In [4]:
# How many different features do we have?
len(vectorizer.get_feature_names())

7547

In [5]:
# What words correspond to the first 20 features?
vectorizer.get_feature_names()[:20]

['00',
 '0001',
 '001',
 '003',
 '0046',
 '007',
 '008',
 '01',
 '010',
 '011',
 '012',
 '013',
 '014',
 '015',
 '016',
 '018',
 '019',
 '02',
 '020',
 '0214']

## 2. Build a classifier

In [6]:
# Make the 'topic' column categorical, so we can print a pretty confusion matrix later
df.topic = df.topic.astype('category')

In [7]:
vectors = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
vectors.head()

Unnamed: 0,00,0001,001,003,0046,007,008,01,010,011,...,yellow,ymca,your,youth,zanger,zones,zoning,ââ,ââå,òan
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [8]:
# Glue the topics back together with the document vectors, into one dataframe
features_and_topic = pd.concat([df.topic, vectors],axis=1)
features_and_topic.head()

Unnamed: 0,topic,00,0001,001,003,0046,007,008,01,010,...,yellow,ymca,your,youth,zanger,zones,zoning,ââ,ââå,òan
0,Education,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Public Services,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Labor and Employment,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Labor and Employment,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Crime,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [9]:
# Now split 20% of combined data into a test set
train, test = train_test_split(features_and_topic, test_size=0.2)

In [10]:
# Build a decision tree on the training data
x_train = train.iloc[:,1:].values
y_train = train.iloc[:,0].values

dt = tree.DecisionTreeClassifier()
dt.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [11]:
# Evaluate the tree on the test data and print out the accuracy
x_test = test.iloc[:,1:].values
y_test = test.iloc[:,0].values
y_test_pred = dt.predict(x_test)
metrics.accuracy_score(y_test_pred, y_test)

0.6695652173913044

In [12]:
# Now print out a nicely labelled confusion natrix
truecats = "True " + df.topic.cat.categories
predcats = "Guessed " + df.topic.cat.categories
pd.DataFrame(metrics.confusion_matrix(y_test_pred, y_test, labels=df.topic.cat.categories), columns=predcats, index=truecats)

Unnamed: 0,Guessed Agriculture and Food,Guessed Animal Rights and Wildlife Issues,Guessed Arts and Humanities,"Guessed Budget, Spending, and Taxes",Guessed Business and Consumers,Guessed Campaign Finance and Election Issues,Guessed Civil Liberties and Civil Rights,Guessed Commerce,Guessed Crime,Guessed Drugs,...,Guessed Resolutions,Guessed Science and Medical Research,Guessed Senior Issues,Guessed Sexual Orientation and Gender Issues,Guessed Social Issues,Guessed State Agencies,Guessed Technology and Communication,Guessed Trade,Guessed Transportation,Guessed Welfare and Poverty
True Agriculture and Food,10,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
True Animal Rights and Wildlife Issues,1,6,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,1,0
True Arts and Humanities,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
"True Budget, Spending, and Taxes",0,0,0,52,2,0,0,3,0,0,...,0,0,0,0,0,3,1,0,1,0
True Business and Consumers,0,0,0,1,13,0,0,3,0,0,...,0,0,0,0,0,1,0,0,0,0
True Campaign Finance and Election Issues,0,0,0,1,0,36,0,1,0,0,...,0,0,0,0,0,0,0,0,0,0
True Civil Liberties and Civil Rights,0,0,0,0,0,0,4,0,0,0,...,0,0,0,0,0,0,0,0,0,0
True Commerce,1,0,0,0,0,1,0,19,2,0,...,0,0,0,0,0,1,0,0,1,0
True Crime,0,0,0,0,0,0,0,0,38,0,...,0,0,0,0,0,3,1,0,1,0
True Drugs,0,0,0,0,1,0,0,0,0,2,...,0,0,0,0,0,0,0,0,0,0


What's a case -- an entry in thie matrix -- where the classifier made a particularly large number of errors? Can you guess why?

Looking at this matrix, 7 documents were guessed "Budget, Spending, and Taxes" when they're actually "Housing and Property." It's possible these documents discussed property taxes, which caused them to be incorrectly classified.

## Bonus: try it on new data
How do we apply this to other bill titles? Ones that weren't originally in the test or training set?


In [13]:
# Here are some other bills
new_titles = [
    "Public postsecondary education: executive officer compensation.",
    "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
    "Political Reform Act of 1974: campaign disclosures.",
    "An act to add Section 236.3 to the Penal Code, relating to human trafficking."]

Your assighnment is to vectorize these titles, and predict their subject using the classifier we built.
The challenge here is to get these new documents encoded with the same features as the classifier expects. That is, we could just run them through `CountVectorizer` but then get_feature_names() would give us a different set of coluns, because the vocabulary of these documents is different.

The solution is to use the `vocabulary` parameter of `CountVectorizer` like this:


In [14]:
# Make a new vectorizer that maps the same words to the same feature positions as the old vectorizer
new_vectorizer = CountVectorizer(stop_words='english', vocabulary=vectorizer.get_feature_names())

In [15]:
# Now use this new_vectorizer to fit the new docs
new_matrix = new_vectorizer.fit_transform(new_titles)

In [16]:
new_vectors = pd.DataFrame(new_matrix.toarray(), columns=vectorizer.get_feature_names())
new_vectors

Unnamed: 0,00,0001,001,003,0046,007,008,01,010,011,...,yellow,ymca,your,youth,zanger,zones,zoning,ââ,ââå,òan
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [17]:
# Predict the topics of the new documents, using our pre-existing classifier
dt.predict(new_vectors.values)

array(['Education', 'Education', 'Campaign Finance and Election Issues',
       'Family and Children Issues'], dtype=object)