# Text classification: the basics

The aim of this notebook is to take what I learnt during the DSTL Multiple Label Classifcation data science challenge and distill it into prototype code so that I have a template for any text classification I do in future.

This competition did not allow you to retain the data so everything here is done using data provided by sklearn.


In [1]:
from __future__ import division

import re
import string
import unicodedata
import numpy as np
import pandas as pd

from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem.snowball import SnowballStemmer

from xgboost import XGBClassifier

from sklearn.externals import joblib
from sklearn.datasets import fetch_20newsgroups
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.decomposition import TruncatedSVD
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV

## Getting some data

Download some example data which has three catagories which appear evenly. Then store the text itself and the catagory in a pandas dataframe with two columns.

In [2]:
catagories = ['sci.med', 'sci.electronics', 'sci.space']
text_data = fetch_20newsgroups(categories=catagories,
                               random_state=42,
                               remove=('headers', 'footers', 'quotes'))

df = pd.DataFrame()
df['text'] = text_data.data
df['topic'] = text_data.target

# Remove blank lines
df = df[ df['text']!= "" ]

df.head(10)

Unnamed: 0,text,topic
0,Another fish to check out is Richard Rast -- h...,2
1,: As the subject says - Can I use a 4052 for d...,0
2,I am looking for current sources for lists of ...,1
3,"\n\nBut why do you characterize this as a ""fli...",1
4,\nIt was more than a theoretical concept; it w...,2
5,\n\n\nThe name is rather descriptive. It's a ...,2
6,My mom has just been diagnosed with cystic bre...,1
7,\n\nThe yearly chest x-ray provides a minute a...,1
9,I've recently listened to a tape by Dr. Stanis...,1
10,We've just been donated a large machine for us...,0


These catagories are lablled 0, 1, and 2. By resetting all the labels of catagory 2 to 0 we can simulate unbalanced classes on a simply 0/1 classification problem. Let's count the number of documents in each catagory before and after this transformation. 

In [3]:
df_catagories = pd.DataFrame()
df_catagories['count'] = df.groupby('topic').count()
df_catagories

Unnamed: 0,count
0,577
1,583
2,580


In [4]:
df.loc[df['topic']==2, 'topic' ] = 0

df_catagories = pd.DataFrame()
df_catagories['count'] = df.groupby('topic').count()
df_catagories

Unnamed: 0,count
0,1157
1,583


And finally split the data into a training set and a test set.

In [5]:
df_train, df_test = train_test_split(df, test_size=0.3, random_state=42)

## Preparing the text

Next we need to define a function which tokenise the text (in this case converting it from a string to a list of words), stems the words (so converts 'run', 'runs', and 'running' all to 'run') and generally cleans up the text by removing punctuation, putting all the text into lower case etc. Stemming and these cleaning steps may or may not improve the classifier. It is always worth turning these off and on to see whether they do improve accuracy.

In [6]:
def tokeniser(text):
    
    # Remove any whitespace at the start and end of the string
    # and remove any stray tabs and newline characters
    text = text.strip()
    
    # Remove any weird unicode characters
    if isinstance(text, unicode):
        text = unicodedata.normalize('NFKD', text).encode('ascii', 'ignore')
        
    # Convert hyphens and slashes to spaces
    text = re.sub(r'[-/]+',' ',text)
    
    # Remove remaining punctuation
    text = text.translate(None, string.punctuation)
    
    # Convert the text to lowercase and use nltk tokeniser
    tokens = word_tokenize(text.lower())
    
    # Define a list of stopwords apart from the word 'not'
    stops = set(stopwords.words('english')) - set(('not'))

    # Define stemmer
    stemmer = SnowballStemmer('english')

    return [stemmer.stem(i) for i in tokens if i not in stops]


Let us try out this function on a test string.

In [7]:
tokeniser("HERE is some text, albeit example/test text, THAT demonstrates what we are and aren't doing.")

[u'text', u'albeit', u'exampl', u'test', u'text', u'demonstr', u'arent']

## Fitting the model

When we create the model itself we are going to use weighting to correct for class imbalance. To do this we first need to know the number of 1s and 0s in the training data.

In [8]:
num_1s = df_train[df_train['topic']==1]['topic'].count()
num_0s = df_train[df_train['topic']==0]['topic'].count()

print "Number of 1s:", num_1s
print "Number of 0s:", num_0s

Number of 1s: 418
Number of 0s: 800


Next we define and fit a sklearn pipeline. This pipeline chains together a hashing vectorizer, a term frequency-inverse document frequency (TF-IDF) transform, singular value decomposition (SVD), and then classification using XGBoost.

The hashing vectoriser represents each unique word (after preprocessing which could include stemming and removing punctuation and stopwords) by a number and counts the number of times each word occurs in every document. Each document could then be represented by a vector where each element represents corresponds to a different word and the value of the element is equal to the number of times that word occurred in the document. A hashing vectoriser uses less memory then a count vectoriser but the downside is that you cannot go from the numerical representation of the words back to the words themselves, this makes interupting intermediate steps more difficult.

The TF-IDF transform weights each of the components of these vectors by a number that takes into account how common the word is across the whole corpus and how common it is in that particular document. This means that words that occur very frequently are down weighted and rare words are upweighted.

Truncated SVD is a form of principal component analysis which works well with sparce matrices. This reduces the number of dimensions by breaking down the matrix that descirbes the correlations between words across the documents into eigenvalues/eigenvevectors. Only a subset of eigenvectors are retained, this subset being the eigenvecots that capture the most variance. This approach corresponds to Latent Semantic Analysis.

After this the XGBoost version of a gradient boosted decision tree classifier is fitted to the transformed, dimensionally reduced data.

In [9]:
# Define the pipeline
vectoriser = Pipeline([
    ('vect', HashingVectorizer(tokenizer=tokeniser,
                               decode_error='replace',
                               #strip_accents='unicode',
                               ngram_range=(1,3))
    ),
    ('tfidf', TfidfTransformer()),
    ('svd', TruncatedSVD(n_components=100,
                         random_state=42)
    ),
    ('xgb', XGBClassifier(max_depth=6,
                          seed=42,
                          n_estimators=200,
                          scale_pos_weight=num_0s/num_1s)
    ),
])

# Fit the model 
vectoriser.fit(df_train['text'], df_train['topic'])

Pipeline(steps=[('vect', HashingVectorizer(analyzer=u'word', binary=False, decode_error='replace',
         dtype=<type 'numpy.float64'>, encoding=u'utf-8', input=u'content',
         lowercase=True, n_features=1048576, ngram_range=(1, 3),
         non_negative=False, norm=u'l2', preprocessor=None,
         ...eg_lambda=1,
       scale_pos_weight=1.9138755980861244, seed=42, silent=True,
       subsample=1))])

This model (which was fitted on the training data) can then be used to produce predictions for the topics contained within the test data. These predictions will later be compared to the actual, known values in order to determine the accuracy of the model.

Note, pandas can not keep track of all the copies of a dataframe so throws up an error with the line below which stores the predictions back into the test dataframe. However, this line does work so I have disabled warnings beforehand and re-activated them afterwards in order to avoid the warning message. This is not strictly necessary but makes the ouput prettier.

In [10]:
# Suppress pandas warning
pd.options.mode.chained_assignment = None

# Make predictions for the test data and store 
# the results back into the test dataframe
df_test['predict'] = vectoriser.predict(df_test['text'])

# Reset pandas warning
pd.options.mode.chained_assignment = 'warn'

# Show the first ten entries in the dataframe 
df_test.head(10)

Unnamed: 0,text,topic,predict
488,I am in the midst of designing a project which...,0,0
1539,\n\n\n\n,0,1
969,"\nWhatabout, Schools, Universities, Rich Indiv...",0,0
1025,"Umm, perhaps you could explain what 'rights' ...",0,0
719,"\nReading this definition, I wonder: when shou...",1,0
277,\n\n\nIf you want to have some fun.\n\nPlug th...,0,0
443,"Actually, they are legal! I not familiar with ...",0,0
1336,Subject: options before back surgery for protr...,1,1
1160,I have a HP 1740 scope that (I think) has a pr...,0,0
624,\nSure. Contact the World Space Foundation. ...,0,0


## Evaluate the model

The precision, recall and f1-score can be outputted to see how well the model is performing on the test data.

In [11]:
print classification_report(df_test['topic'], df_test['predict'])

             precision    recall  f1-score   support

          0       0.92      0.96      0.94       357
          1       0.91      0.81      0.86       165

avg / total       0.92      0.92      0.91       522



## Save the model

You can use joblib to save fitted sklearn models and then reload them later. For example, you might want to save a model and later use it to make more predictions, or if the model has a time consuming preprocessing step that you do not want to repeat whilst hyperparameter tuining these could be split out into a seperate model which is run and saved. For example, the tfidf and svd steps here could be split out into a seperate preprocessing pipeline and then could be fitted once before tuning the hyperparameters of the classifier itself.

Below are the joblib commands (commented out). Joblib is recommended over pickle for sklearn models.

In [12]:
# To save a model use dump
# joblib.dump(vectoriser, "./saved_model.pkl")

# To load a saved model use load
# vectoriser = joblib.load("./saved_model.pkl")

## Hyperparameter tuning

One method of choosing hyperparameters is to save the values you want to try in a dictionary and then use GridSearchCV to perform cross-validation to estimate the training error for all combinations, choosing the parameters that gives the best results according to the metric chosen. These best parameters can be outputted and the model outputted from the grid search can be used for prediction like regular sklearn models.

Here, the name given to the xgboost classifier ('xgb') has to be given so that GridSearchCV knowns which part of the pipeline these parameters are associated with.

In [13]:
params = dict(xgb__max_depth=[3,7], xgb__n_estimators=[100])

grid = GridSearchCV(vectoriser, param_grid=params, scoring='f1_micro', cv=2)

grid.fit(df_train['text'], df_train['topic'])

print grid.best_params_

{'xgb__n_estimators': 100, 'xgb__max_depth': 7}
