# Document Classification

This notebook demonstrates the automated classification of records into clusters identified in a prior auto-clustering step.

### Loading the Training Data

We are going to use the results of the auto-clustering of the existing documents as training data to build our classifier model. First step then will be to load the results of the clustering.

Loading the pickled clustering results will look for the `tokenize()` function defined in the clustering notebook. We will have restate the function definition here for the loading to work.

In [1]:
import nltk
import re

def tokenize(text):
    # get the english stopwords
    nltk.download('stopwords', quiet=True)
    stopwords = nltk.corpus.stopwords.words('english')
    # first tokenize by sentence, then by word to ensure that punctuation is caught as it's own token
    nltk.download('punkt', quiet=True)
    tokens = [word.lower() for sent in nltk.sent_tokenize(text) for word in nltk.wordpunct_tokenize(sent)]
    filtered_tokens = []
    # filter out any tokens not containing letters (e.g., numeric tokens, raw punctuation)
    for token in tokens:
        # include only those that contains letters
        if re.search('[a-zA-Z]', token):
            # exclude stop words, those shorter than 3 characters, and those that
            # start with non-alphanumeric characters
            if token not in stopwords and len(token) > 2 and token[0].isalnum():
                filtered_tokens.append(token)
    return filtered_tokens

In [2]:
import pickle

# Load the training data (i.e. the results of the clustering)
clustering_output= pickle.load(open('temp/clusters.pkl', 'rb'))
clusters = clustering_output['clusters']
vectorizer = clustering_output['vectorizer']
feature_matrix = clustering_output['features']
cluster_terms = clustering_output['terms']
feature_names = vectorizer.get_feature_names()

### Building the Classifier Model

We will use a classifier that implements regularized linear models with stochastic gradient descent (SGD) learning.

Read more here: http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html

In [3]:
from sklearn.linear_model import SGDClassifier

# Instantiate the classifier object and define the options
classifier = SGDClassifier(loss='hinge', penalty='l2', 
    alpha=1e-3, random_state=42, max_iter=5, tol=None)

# Perform the fitting of the feature matrix to the clusters
classifier.fit(feature_matrix, clusters['cluster'].values)

SGDClassifier(alpha=0.001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=5, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=42, shuffle=True,
       tol=None, verbose=0, warm_start=False)

The model can be saved for reuse later, instead of building again from training data:

In [4]:
from sklearn.externals import joblib

# We dump both the vectorizer and the classifier into a pickle file
outfile = 'temp/classifier.pkl'
joblib.dump((vectorizer, classifier), outfile)

['temp/classifier.pkl']

### Classifying New Records

Once a trained classifier model is built, it can be used to classify new records

In [5]:
import pandas as pd

new_records = [
    ('record_id_1', 'Testimonials, companies, strategy, FinTech, session, website, financial, Enterprise, results'),
    ('record_id_2', 'Contact, Location, Biology, Accelerate, Learning, Machine, High-Throughput, Computational'),
    ('record_id_3', 'Personalized, Recipe, completely, different, Nutrition, country, another, Welcome, recipes, Research'),
    ('record_id_4', 'Contact, information, facilities, department, Categories'),
    ('record_id 5', 'Program, programs, Contact, students, Graduate, requirements, Curriculum')
]

ndf = pd.DataFrame(new_records, columns=['id', 'keywords'])
counts = vectorizer.transform(ndf['keywords'].values)
predictions = classifier.predict(counts)

# Show the predictions
predictions

array([1, 1, 1, 1, 1])

In [7]:
# Let's make a dataframe to have a nice display of results
results = pd.DataFrame(list(zip(ndf['id'].values, predictions)), columns=['id', 'cluster'])
results

Unnamed: 0,id,cluster
0,record_id_1,1
1,record_id_2,1
2,record_id_3,1
3,record_id_4,1
4,record_id 5,1


### Cross Validating the Model

We can then proceed to cross-validate the model and see how accurate it is in reclassifying our training data into their respective clusters.

In [8]:
from sklearn.metrics import accuracy_score
from sklearn.model_selection import KFold
from sklearn.pipeline import Pipeline

# Use the pipelining style to 
pipeline = Pipeline([
    ('vectorizer',  vectorizer),
    ('classifier',  classifier)
])

# Split the training data to 3 partitions -- 2 parts for training, 1 part for testing
# This will also generate the permutations so that all of the three parts can be used for testing
kf = KFold(n_splits=3)

scores = []

# Iterate over the permutations
for train_indices, test_indices in kf.split(clusters):
    
    # Extract the keywords and clusters for the training data
    train_text = clusters.iloc[train_indices]['keywords'].values
    train_y = clusters.iloc[train_indices]['cluster'].values
    
    # Extract the keywords and clusters for the testing data
    test_text = clusters.iloc[test_indices]['keywords'].values
    test_y = clusters.iloc[test_indices]['cluster'].values
    
    # Use the pipeline interface to build the classifier model
    pipeline.fit(train_text, train_y)
    
    # Classify the records in the testing set
    predictions = pipeline.predict(test_text)
    
    # Compute the accuracy
    score = accuracy_score(test_y, predictions)
    scores.append(score)

print('Total records classified:', len(clusters))
print('Accuracy score: {0:.2f}%'.format(sum(scores)/len(scores) * 100))

Total records classified: 10000
Accuracy score: 88.43%
