# Week 4-2: Text classification

For this assignment you will build a classifier that figures out the main topics of a bill, from its title.

In [1]:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.ensemble import RandomForestClassifier
from sklearn import tree
from sklearn.model_selection import train_test_split
from sklearn import metrics
%matplotlib inline


## 1. Create document vectors

In [6]:
# Load up bills.csv This is a list of thousands of bill titles from the California legislature, 
# and their subject classifications

df = pd.read_csv('bills.csv', encoding='utf-8')
df.head()

Unnamed: 0,text,topic
0,An act to amend Section 44277 of the Education...,Education
1,An act to add Section 8314.4 to the Government...,Public Services
2,"An act to amend Sections 226, 233, and 234 of,...",Labor and Employment
3,"An act to amend Sections 12920, 12921, 12926, ...",Labor and Employment
4,"An act to amend Section 186.8 of, and to add S...",Crime


In [18]:
df.shape

(5749, 2)

In [26]:
# Vectorize these suckers with the CountVectorizer, removing stopwords

vectorizer = CountVectorizer(stop_words='english', min_df=2) # df=2 throws out any word that doesn't appear in at least two docs
matrix = vectorizer.fit_transform(df.topic)

In [27]:
# How many different features do we have?
df.shape

(5749, 3)

In [32]:
# What words correspond to the first 20 features?
df['target'] = df['topic'].str[0:20]

## 2. Build a classifier

In [22]:
# Make the 'topic' column categorical, so we can print a pretty confusion matrix later
vectors = pd.DataFrame(matrix.toarray(), columns=vectorizer.get_feature_names())
vectors.head()

Unnamed: 0,affairs,agencies,agriculture,animal,arts,budget,business,campaign,children,civil,...,sexual,social,spending,state,taxes,technology,trade,transportation,welfare,wildlife
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [33]:
# Glue the topics back together with the document vectors, into one dataframe
features_and_target = pd.concat([df.target, vectors],axis=1)
features_and_target.head()

Unnamed: 0,target,affairs,agencies,agriculture,animal,arts,budget,business,campaign,children,...,sexual,social,spending,state,taxes,technology,trade,transportation,welfare,wildlife
0,Education,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,Public Services,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,Labor and Employment,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,Labor and Employment,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,Crime,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [36]:
# Now split 20% of combined data into a test set
train, test = train_test_split(features_and_target, test_size=0.2)

In [46]:
# Build a decision tree on the training data
x_train = train.drop('target', axis=1).values 
y_train = train[['target']].values

dt = tree.DecisionTreeClassifier()
dt.fit(x_train,y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=None,
            max_features=None, max_leaf_nodes=None,
            min_impurity_decrease=0.0, min_impurity_split=None,
            min_samples_leaf=1, min_samples_split=2,
            min_weight_fraction_leaf=0.0, presort=False, random_state=None,
            splitter='best')

In [48]:
# Evaluate the tree on the test data and print out the accuracy
x_test = test.drop('target', axis=1).values
y_test = test[['target']].values
y_test_pred = dt.predict(x_test)
metrics.accuracy_score(y_test_pred, y_test)

0.9991304347826087

In [49]:
# Now print out a nicely labelled confusion natrix
metrics.confusion_matrix(y_test_pred, y_test)

array([[12,  0,  0, ...,  0,  0,  0],
       [ 0,  6,  0, ...,  0,  0,  0],
       [ 0,  0, 64, ...,  0,  0,  0],
       ...,
       [ 0,  0,  0, ..., 11,  0,  0],
       [ 0,  0,  0, ...,  0, 76,  0],
       [ 0,  0,  0, ...,  0,  0,  1]], dtype=int64)

What's a case -- an entry in thie matrix -- where the classifier made a particularly large number of errors? Can you guess why?

Looking at this matrix, 7 documents were guessed "Budget, Spending, and Taxes" when they're actually "Housing and Property." It's possible these documents discussed property taxes, which caused them to be incorrectly classified.

## Bonus: try it on new data
How do we apply this to other bill titles? Ones that weren't originally in the test or training set?


In [12]:
# Here are some other bills
new_titles = [
    "Public postsecondary education: executive officer compensation.",
    "An act to add Section 236.3 to the Education code, related to the pricing of college textbooks.",
    "Political Reform Act of 1974: campaign disclosures.",
    "An act to add Section 236.3 to the Penal Code, relating to human trafficking."]

Your assighnment is to vectorize these titles, and predict their subject using the classifier we built.
The challenge here is to get these new documents encoded with the same features as the classifier expects. That is, we could just run them through `CountVectorizer` but then get_feature_names() would give us a different set of coluns, because the vocabulary of these documents is different.

The solution is to use the `vocabulary` parameter of `CountVectorizer` like this:


In [13]:
# Make a new vectorizer that maps the same words to the same feature positions as the old vectorizer
new_vectorizer = CountVectorizer(stop_words='english', vocabulary=vectorizer.get_feature_names())

In [5]:
# Now use this new_vectorizer to fit the new docs


In [6]:
# Predict the topics of the new documents, using our pre-existing classifier
