# Applying the classifier to a real dataset

For the standard imports in this Notebook we will include `sklearn.neighbors.KNeighborsClassifier`, and the `collections.Counter` class, as used in the definition of `build_term_vector`.

In [Notebook 22.1](22.1 Case study preliminaries - the vector space model.ipynb) and [Notebook 22.2](22.2 Preliminaries - building the classifier.ipynb) we saw how to represent a collection of textual documents, how to estimate the similarity between them, and how to use the documents and the similarity measure as a way of building a simple spam filter.

In this Notebook we will apply the work from those two Notebooks to a collection of real emails, which have been classified as either ham or spam. By using a subset of the classified data, we will be able to see how well the classifier behaves on unseen data.

Unless we state otherwise, the functions we use in this Notebook will be the same as those defined in the two previous notebooks. As in [Notebook 22.2](22.2 Preliminaries - building the classifier.ipynb) we will use the same functions as much as possible, but we will find that we need to adapt the techniques to overcome the difficulties of real-world data.

We will start with the same imports and function definitions as in [Notebook 22.2](22.2 Preliminaries - building the classifier.ipynb).

In [1]:
# Standard imports
import pandas as pd

import math

from scipy.spatial.distance import cosine
from sklearn.neighbors import KNeighborsClassifier

from collections import Counter

If you are unclear about how the following functions are used, you should reread the previous Notebooks to refresh your memory.

In [2]:
def tokenise_document(docIn_str):
    '''Return a list of the tokens in the input string docIn_str'''
    return docIn_str.split()

In [3]:
def build_term_index(tokenisedDocuments_coll):
    '''Return a set of all the terms appearing in the 
       documents in tokenisedDocuments_coll
    '''
    allTerms_set = set()  # Store the tokens as a set to remove repetitions
    
    for tokens_coll in tokenisedDocuments_coll:
        allTerms_set = allTerms_set.union(set(tokens_coll))
        
    return list(allTerms_set)     # Return the members as a list

In [4]:
def build_tf_vector(tokenisedDocument_ls, termIndex_ls):
    '''Return a pandas Series representing the term 
       frequency vector of the tokenised document 
       tokenisedDocument_ls, and indexed with termIndex_ls
    '''
    
    return pd.Series(Counter(tokenisedDocument_ls),
                     index=termIndex_ls).fillna(0)

## Importing the training and test corpora

So far, we have only applied the classifier to some toy data. In this Notebook we will use a subset of the Enron spam corpus to investigate how well the basic classifier works on real data.


Because the whole corpus is quite large, we have taken a random subset of the corpus so that we can carry out experiments more quickly. To reiterate what was said previously: the aim of this week's work is to illustrate how the techniques you have seen can be used in a practical application, rather than to give a complete account of how to do this in an efficient and scalable manner. 

All the documents are stored as text files.


For this Notebook we have selected 1000 ham documents and 1000 spam documents as training data.

The ham training data can be found in the folder:

    data/trainingData/ham/
    
and the spam training data in the folder:

    data/trainingData/spam/


We have also selected 200 ham documents and 200 spam documents to use as test data.

The ham test data can be found in the folder:

    data/testData/ham/
    
and the spam test data in the folder:

    data/testData/spam/

We will collect the training corpus into a list of strings. Because the file structures of the email folders are not standardised, we will need to use the `os.walk` function to find all the text files in the folder hierarchy:

In [None]:
import os

In [None]:
trainingCorpusDocuments_ls = []
trainingCorpusClasses_ls = []

# First collect the ham documents:
print("Reading ham files...")

for (path, dirs, files) in os.walk('./data/trainingData/ham/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            continue
        
        with open(os.path.join(path, file), 'r') as fileIn:
            trainingCorpusDocuments_ls.append(fileIn.read())
            trainingCorpusClasses_ls.append('ham')
            
print('{numHamFiles} ham files read'.format(numHamFiles=len(trainingCorpusDocuments_ls)))

# Next, collect the spam documents:
print("Reading spam files...")

for (path, dirs, files) in os.walk('./data/trainingData/spam/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            pass
        else:
            with open(os.path.join(path, file), 'r') as fileIn:
                trainingCorpusDocuments_ls.append(fileIn.read())
                trainingCorpusClasses_ls.append('spam')
            
print('{numSpamFiles} ham files read'.format(numSpamFiles=len(trainingCorpusDocuments_ls)))

Oh dear!

You should find that simply trying to read in the files as text has raised an error. It is characteristic of many data collections, but particularly text collections, that the documents will contain characters which cannot be parsed using an off-the-shelf text analyser such as the standard methods in Python. This is a practical occurrence of the issues you saw in Part 2, on file encoding, and Part 3 on data preparation.

But this error raises the first of our decisions: what to do about unparsable characters?

One possibility, given that we know we are dealing with corporate emails, would be to attempt to identify any known quirks of the email system used by that corporation. 

In this case, we will take a simple solution to the problem, and just ignore any unparseable characters. To do this, we take advantage of the `ignore` parameter in the python `decode` library. Of course, this might not be the best solution: perhaps the spam emails are more likely to contain unparseable characters than the ham emails, and knowing this could be a useful pointer towards which emails are spam. However, for the moment we will try just removing the offending characters.

To do this, the files need to be read in binary format, and then decoded into utf-8 with the `ignore` parameter set:

In [None]:
trainingCorpusDocuments_ls = []
trainingCorpusClasses_ls = []

# First collect the ham documents:
print("Reading ham training files...")

for (path, dirs, files) in os.walk('./data/trainingData/ham/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            continue
        
        with open(os.path.join(path, file), 'rb') as fileIn:
            docText = fileIn.read()
            docText = docText.decode('utf-8', 'ignore')   # decoding the utf-8
            
            trainingCorpusDocuments_ls.append(docText)
            trainingCorpusClasses_ls.append('ham')

# Next, collect the spam documents:
print("Reading spam training files...")

for (path, dirs, files) in os.walk('./data/trainingData/spam/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            pass
        else:
            with open(os.path.join(path, file), 'rb') as fileIn:
                docText = fileIn.read()
                docText = docText.decode('utf-8', 'ignore')   # decoding the utf-8

                trainingCorpusDocuments_ls.append(docText)
                trainingCorpusClasses_ls.append('spam')

print('{} ham training files read'.format(trainingCorpusClasses_ls.count('ham')))
print('{} spam training files read'.format(trainingCorpusClasses_ls.count('spam')))

Having loaded the training data, we can now load the test data in the same way:

In [None]:
testCorpusDocuments_ls = []
testCorpusClasses_ls = []

# First collect the ham documents:
print("Reading ham test files...")

for (path, dirs, files) in os.walk('./data/testData/ham/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            continue
        
        with open(os.path.join(path, file), 'rb') as fileIn:
            docText = fileIn.read()
            docText = docText.decode('utf-8', 'ignore')   # decoding the utf-8
            
            testCorpusDocuments_ls.append(docText)
            testCorpusClasses_ls.append('ham')
            
# Next, collect the spam documents:
print("Reading spam test files...")

for (path, dirs, files) in os.walk('./data/testData/spam/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            pass
        else:
            with open(os.path.join(path, file), 'rb') as fileIn:
                docText = fileIn.read()
                docText = docText.decode('utf-8', 'ignore')   # decoding the utf-8

                testCorpusDocuments_ls.append(docText)
                testCorpusClasses_ls.append('spam')

print('{} ham test files read'.format(testCorpusClasses_ls.count('ham')))
print('{} spam test files read'.format(testCorpusClasses_ls.count('spam')))

We now have a list of documents, and a list of their classification into *ham* or *spam*. For example, to see the training document with index 30, we can say:

In [None]:
trainingCorpusDocuments_ls[30]

and to see its classification:

In [None]:
trainingCorpusClasses_ls[30]

## Tokenising the dataset

In Notebooks [22.1](22.1 Case study preliminaries - the vector space model.ipynb) and [22.2](22.2 Preliminaries - building the classifier.ipynb) we used a simple tokenisation technique of splitting on whitespace. 

In the module materials we suggested that a better (although still far from perfect) tokenisation strategy for real text would be to:

1. include all the metadata for each email
2. assume that all the individual tokens in an email are separated by whitespace
3. cast all the tokens into lower case
4. remove all punctuation.

To implement this strategy we will create a new tokenisation function, `tokenise_email_document`, to implement these stages. For punctuation, we will use Python's `string` library:

In [None]:
import string     # A string representing punctuation characters

string.punctuation

In [None]:
def tokenise_email_document(emailDocIn_txt):
    '''Convert an input string to a list of tokens using the operations:
    
        - convert to lower case
        - split on whitespace
        - remove surrounding punctuation
    '''
    return [token.strip(string.punctuation)  # remove punctuation around tokens
            
            for token in emailDocIn_txt.lower().split()] # Convert to lower case and split
                                                         # on whitespace 

To see this function in action, let's call it on a document with mixed case and punctuation:

In [None]:
tokenise_email_document('"Hello!" said John, loudly.')

To compare with the original tokenisation function:

In [None]:
tokenise_document('"Hello!" said John, loudly.')

Our new function is still very simplistic, but extends the original in a useful way. Tokenisation techniques can range from the simple (such as this) to the extremely complex, and for some languages, such as Chinese, a completely different approach is required. In practice, tokenisation techniques are often implemented using regular expressions, which strike a good balance between computational efficiency and appropriate expressive power.

## Building a DataFrame of training data

We will now follow the same pattern of analysis as in [Notebook 22.2](22.2 Preliminaries - building the classifier.ipynb) by building a DataFrame with which to train the classifier.

As before, the first task is to convert the collection of training documents into a list of tokenised documents:

In [None]:
trainingTokenisedDocuments_ls = [tokenise_email_document(doc_txt) for doc_txt in trainingCorpusDocuments_ls]

The *n*th member of `tokenisedDocuments_ls` is a tokenised form of the *n*th member of `trainingDocuments_ls`:

In [None]:
n = 30

print(trainingCorpusDocuments_ls[n])
print()
print(trainingTokenisedDocuments_ls[n])

Next, we need a term index, and use the `build_term_index` function to build it:

In [None]:
termIndex_ls = build_term_index(trainingTokenisedDocuments_ls)

Following the same process as in the previous Notebooks, we would now convert the list of tokenised documents into a list of term vectors:

<font color='red'>DO NOT RUN THE NEXT CELL!</font>

Although the previous cell is just the same as we used in [Notebook 22.2](22.2 Preliminaries - building the classifier.ipynb), in this case the size of the index (which turns out to have around 100,000 entries) results in a memory error. (If you want to see what happens, feel free to change the cell to an executable code cell and run it, but on some machines it can take a very long time before an error is finally raised).

Although you might be thinking that this is a result of using a single computer, bear in mind that you are also using a very small dataset! For a company like Google, issues such as the most efficient selection of linguistic features is of critical importance, as small variations in the size of the indexing terms can have major knock-on effects on the (financial) cost of storage and the (processing) cost of data analysis.

In this case, to reduce the size of the DataFrame to something managable, rather than use every term which appears in the index, we will use the most common terms which appear in the most documents.

In [None]:
termFrequencyIndex_ss = pd.Series(0, index=termIndex_ls)

for tokenisedDoc_ls in trainingTokenisedDocuments_ls:
    for token in tokenisedDoc_ls:
        termFrequencyIndex_ss[token] += 1

termFrequencyIndex_ss.sort_values(ascending=False, inplace=True)
        
termFrequencyIndex_ss.head()

The index of `termFrequencyIndex_ss` now forms a list of the terms which appear in the training document collection, sorted by frequency in decreasing order. To consider the most common *n* terms in the dataset, we can use the first *n* members of `termFrequencyIndex_ss`'s index.

Let's try building the training DataFrame again, but this time we will use only a subset of all the terms as the index. Choosing an arbitrary length for the index, let's try building the training DataFrame with an index of the 200 most common terms.

In [None]:
shortTermIndex = termFrequencyIndex_ss.index[:200]

shortTermIndex

In [None]:
# RUN THIS CELL!!

trainingTfVectors_ls = [build_tf_vector(tokenisedDoc_ls, shortTermIndex)
                        for tokenisedDoc_ls in trainingTokenisedDocuments_ls]

And as before, convert the training vectors into a DataFrame:

In [None]:
trainingData_df = pd.DataFrame(trainingTfVectors_ls)

trainingData_df

## Training the classifier

We can now use this DataFrame and the training classes to build a *k*-NN classifier. Again, we will use *k*=3.

In [None]:
spamFilter3_knn = KNeighborsClassifier(n_neighbors=3, metric='cosine', algorithm='brute')

In [None]:
spamFilter3_knn.fit(trainingData_df,
                    trainingCorpusClasses_ls)

## Using the classifier to classify test data

To use this classifier to classify the test data, we need to convert the test documents to a DataFrame as well, using the same techniques.

First tokenise the data:

In [None]:
testTokenisedData_ls = [tokenise_email_document(doc_txt) for doc_txt in testCorpusDocuments_ls]

Then convert the tokenised data to term frequency vectors using the same index as for the training data:

In [None]:
testTfVectors_ls = [build_tf_vector(tokenisedDoc_ls, shortTermIndex)
                    for tokenisedDoc_ls in testTokenisedData_ls]

In [None]:
testData_df = pd.DataFrame(testTfVectors_ls)

testData_df

In [None]:
results_df = pd.DataFrame({'predicted':spamFilter3_knn.predict(testData_df),
                           'actual':testCorpusClasses_ls})

results_df

## Evaluating the filter

We have now constructed the necessary training set, test set, and classification function, and so we can apply the classifier to see how well it classifies emails into the spam and ham classes.

To evaluate the technique, we will create a DataFrame in which the `actual` column contains the actual class of a test item, and the `predicted` column contains the class predicted by the classifier.

We can now evaluate how well the basic spam filter works by using the cross tabulation functionality of a DataFrame (you saw crosstab tables in Part 4 and Notebook 04.1):

In [None]:
tabulatedResults_df = pd.crosstab(results_df.predicted, results_df.actual, margins=True)

tabulatedResults_df

We can now print the results, and give an overall percentage accuracy (total number of emails that were correctly classified into *ham* or *spam*):

In [None]:
print('Ham correctly classified as ham: {}/{}'.format(tabulatedResults_df['ham']['ham'],
                                                      tabulatedResults_df['ham']['All']))

print('Ham incorrectly classified as spam: {}/{}'.format(tabulatedResults_df['ham']['spam'],
                                                         tabulatedResults_df['ham']['All']))

print('Spam incorrectly classified as ham: {}/{}'.format(tabulatedResults_df['spam']['ham'],
                                                         tabulatedResults_df['spam']['All']))

print('Spam correctly classified as spam: {}/{}'.format(tabulatedResults_df['spam']['spam'],
                                                        tabulatedResults_df['spam']['All']))

print('Overall system accuracy: {:.1%}'.format((tabulatedResults_df['ham']['ham'] + 
                                                tabulatedResults_df['spam']['spam']) / 
                                                     tabulatedResults_df['All']['All']))


This is a very good baseline result. In fact, this is much better than is typical for machine-learning systems; the small size of the dataset, and the fact that all the ham files come from within the same organisation makes this a much easier task than a full-blown spam filter for working on a wide range of emails.

In the next and final Notebook of Part 22, we will look at using inverse document frequency measures to try to improve the performance of the spam filter.

## What next?

If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, move on to [`22.4 Term frequency and inverse document frequency`](22.4 Term frequency and inverse document frequency.ipynb).