# Term frequency and inverse document frequency

In this final Notebook looking at how to handle textual data, we will consider how *inverse document frequency* is used to weight terms in the term frequency vectors. We will use the same training and test documents as in [the previous Notebook](22.3 Applying the classifier to a real dataset.ipynb) so that we can compare the performance of the two techniques directly.

In the module material we discuss the technique of inverse document frequency weighting as well as stopword removal. Although we will not look at stopword removal in the Notebooks, while working through this Notebook you should think about how different techniques can be used to improve the performance of your data investigations. When working on your own investigation, you should always be thinking about how you would go about selecting different ways of treating the data.

## Initial imports and function definitions

In [1]:
# Standard imports
import pandas as pd

import math, string
import os 

from scipy.spatial.distance import cosine
from sklearn.neighbors import KNeighborsClassifier

from collections import Counter

We will use the same definitions for the main functions as in [Notebook 22.3](22.3 Applying the classifier to a real dataset.ipynb). In this case, we will use `tokenise_email_document` again, rather than the simpler `tokenise_document`.

In [2]:
def tokenise_email_document(emailDocIn_txt):
    '''Convert an input string to a list of tokens using the operations:
    
        - convert to lower case
        - split on whitespace
        - remove surrounding punctuation
    '''
    return [token.strip(string.punctuation)  # remove punctuation around tokens
            
            for token in emailDocIn_txt.lower().split()] # Convert to lower case and split
                                                         # on whitespace 

In [3]:
def build_term_index(tokenisedDocuments_coll):
    '''Return a set of all the terms appearing in the 
       documents in tokenisedDocuments_coll
    '''
    allTerms_set = set()  # Store the tokens as a set to remove repetitions
    
    for tokens_coll in tokenisedDocuments_coll:
        allTerms_set = allTerms_set.union(set(tokens_coll))
        
    return list(allTerms_set)     # Return the members as a list

In [4]:
def build_tf_vector(tokenisedDocument_ls, termIndex_ls):
    '''Return a pandas Series representing the term 
       frequency vector of the tokenised document 
       tokenisedDocument_ls, and indexed with termIndex_ls
    '''
    
    return pd.Series(Counter(tokenisedDocument_ls),
                     index=termIndex_ls).fillna(0)

## Import data

We also need to import the same training and test data as for the previous Notebook.

To recap, we are using 1000 ham documents and 1000 spam documents as training data.

The ham training data can be found in the folder:

    data/trainingData/ham/
    
and the spam training data in the folder:

    data/trainingData/spam/
   

We have also selected 200 ham documents and 200 spam documents to use as test data.

The ham test data can be found in the folder:

    data/testData/ham/
    
and the spam test data in the folder:

    data/testData/spam/

In [5]:
trainingCorpusDocuments_ls = []
trainingCorpusClasses_ls = []

# First collect the ham documents:
print("Reading ham training files...")

for (path, dirs, files) in os.walk('./data/trainingData/ham/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            continue
        
        with open(os.path.join(path, file), 'rb') as fileIn:
            docText = fileIn.read()
            docText = docText.decode('utf-8', 'ignore')   # decoding the utf-8
            
            trainingCorpusDocuments_ls.append(docText)
            trainingCorpusClasses_ls.append('ham')

# Next, collect the spam documents:
print("Reading spam training files...")

for (path, dirs, files) in os.walk('./data/trainingData/spam/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            pass
        else:
            with open(os.path.join(path, file), 'rb') as fileIn:
                docText = fileIn.read()
                docText = docText.decode('utf-8', 'ignore')   # decoding the utf-8

                trainingCorpusDocuments_ls.append(docText)
                trainingCorpusClasses_ls.append('spam')

print('{} ham training files read'.format(trainingCorpusClasses_ls.count('ham')))
print('{} spam training files read'.format(trainingCorpusClasses_ls.count('spam')))

Reading ham training files...
Reading spam training files...
993 ham training files read
1000 spam training files read


In [6]:
testCorpusDocuments_ls = []
testCorpusClasses_ls = []

# First collect the ham documents:
print("Reading ham test files...")

for (path, dirs, files) in os.walk('./data/testData/ham/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            continue
        
        with open(os.path.join(path, file), 'rb') as fileIn:
            docText = fileIn.read()
            docText = docText.decode('utf-8', 'ignore')   # decoding the utf-8
            
            testCorpusDocuments_ls.append(docText)
            testCorpusClasses_ls.append('ham')
            
# Next, collect the spam documents:
print("Reading spam test files...")

for (path, dirs, files) in os.walk('./data/testData/spam/'):
    
    for file in files:
        if file[0] == '.':  # Don't process hidden files
            pass
        else:
            with open(os.path.join(path, file), 'rb') as fileIn:
                docText = fileIn.read()
                docText = docText.decode('utf-8', 'ignore')   # decoding the utf-8

                testCorpusDocuments_ls.append(docText)
                testCorpusClasses_ls.append('spam')

print('{} ham test files read'.format(testCorpusClasses_ls.count('ham')))
print('{} spam test files read'.format(testCorpusClasses_ls.count('spam')))

Reading ham test files...
Reading spam test files...
200 ham test files read
200 spam test files read


## Tokenising the dataset

As before, we will use the `tokenise_email_document` to perform the tokenisation:

In [7]:
trainingTokenisedDocuments_ls = [tokenise_email_document(doc_txt) for doc_txt in trainingCorpusDocuments_ls]

## Building an inverse document frequency index

Having imported and tokenised the data, we now need to build the inverse document frequency index. Recall that the definition of inverse document frequency (idf) for some term is:

$$\text{idf}(term)=\log_e\left(\frac{\textrm{total number of documents}}{\textrm{number of documents containing }term}\right)$$

As with the term frequency index we built in Notebook 22.3, we can build a *pandas* Series which contains the inverse document frequency values for all the terms in the training set.

First, create the term index of all the terms which appear in the training set:

In [8]:
trainingTermIndex_ls = build_term_index(trainingTokenisedDocuments_ls)

Next, we want a Series which represents how many documents each term appears in (the 'number of documents containing `term`' in the definition of `idf`). We will start with a Series whose index is the terms which appear in the collection, and which has zero for each document frequency:

In [9]:
documentFrequencyIndex_ss = pd.Series(0, index=trainingTermIndex_ls)

We can populate the Series with the document frequency count for each term:

In [10]:
for tokenisedDoc_ls in trainingTokenisedDocuments_ls:
    for term in set(tokenisedDoc_ls):
        documentFrequencyIndex_ss[term] += 1

So, for example, to find out how many documents the term *bill* appears in, use:

In [11]:
documentFrequencyIndex_ss['bill']

200

We can now create the idf index by dividing the number of documents in the training corpus by the document frequency, and using `np.log` to find the log of the values in the Series:

In [12]:
idfIndex_ss = pd.Series(len(trainingCorpusDocuments_ls),  # Put the number of documents as
                        index=trainingTermIndex_ls)       # each value

idfIndex_ss = np.log(idfIndex_ss / documentFrequencyIndex_ss)  # Divide by the document 
                                                               # frequency and take the log

We can now compare the impact that different terms will have. Comparing the inverse document frequency values of the terms *the* and *bill*:

In [13]:
print(idfIndex_ss['the'])
print(idfIndex_ss['bill'])

0.307785798762
2.29907895366


shows that each occurence of *bill* will be much more heavily weighted than each occurrence of *the*.

## Reducing the training data size

As before, we will quickly run into memory problems if we try to create a DataFrame containing the complete set of training documents, so as before we will only use the most common terms:

In [14]:
termFrequencyIndex_ss = pd.Series(0, index=idfIndex_ss.index)

for tokenisedDoc_ls in trainingTokenisedDocuments_ls:
    for token in tokenisedDoc_ls:
        termFrequencyIndex_ss[token] += 1

termFrequencyIndex_ss.sort_values(ascending=False, inplace=True)
        
termFrequencyIndex_ss.head()

       18161
the    15306
to     13633
and     8801
of      7827
dtype: int64

Again, take the 200 most common terms and create an index containing only those terms:

In [15]:
shortTermIndex = termFrequencyIndex_ss.index[:200]

shortTermIndex

Index(['', 'the', 'to', 'and', 'of', 'a', 'in', 'from', 'for', 'you',
       ...
       'power', 'just', 'kaminski', 'office', 'within', 'call', 'invoked',
       'go', 'doctype', 'w3c//dtd'],
      dtype='object', length=200)

And use the `reindex` method to reduce the size of `idfIndex_ss`:

In [16]:
idfIndex_ss = idfIndex_ss.reindex(shortTermIndex)

idfIndex_ss

                             0.244955
the                          0.307786
to                           0.005030
and                          0.434999
of                           0.519898
a                            0.257209
in                           0.459323
from                         0.000000
for                          0.395480
you                          0.528373
is                           0.545541
by                           0.363219
with                         0.535205
this                         0.648499
br                           1.474904
on                           0.713934
i                            1.020927
tr                           2.190225
subject                      0.002009
that                         0.945824
your                         0.777380
content-type                 0.000000
be                           0.868768
td                           2.289129
date                         0.004022
we                           0.958829
content-tran

## Building and training the classifier

We can now build our set of training vectors. Previously, we used the term frequency for each term in the sentence. In this case, we multiply the term frequency by the inverse document frequency value for that term (to give tf.idf):

In [17]:
trainingTfIdfVectors_ls = [build_tf_vector(tokenisedDoc_ls, shortTermIndex) * idfIndex_ss
                           for tokenisedDoc_ls in trainingTokenisedDocuments_ls]

In [18]:
trainingData_df = pd.DataFrame(trainingTfIdfVectors_ls)

trainingData_df

Unnamed: 0,Unnamed: 1,the,to,and,of,a,in,from,for,you,...,power,just,kaminski,office,within,call,invoked,go,doctype,w3c//dtd
0,0.000000,0.000000,0.010060,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.528373,...,0.000000,0.000000,2.279276,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.000000,1.538929,0.020121,0.434999,0.000000,0.257209,0.000000,0.0,0.00000,0.528373,...,0.000000,0.000000,4.558553,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
2,2.204597,3.693430,0.045272,0.869998,1.039797,0.257209,1.837293,0.0,1.18644,4.226983,...,0.000000,0.000000,6.837829,0.000000,2.522223,2.236104,0.000000,2.154979,0.000000,0.000000
3,0.244955,0.000000,0.015091,0.434999,0.000000,0.000000,0.000000,0.0,0.00000,0.528373,...,0.000000,0.000000,4.558553,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,0.979821,1.231143,0.045272,2.609993,0.519898,1.286047,0.918647,0.0,1.58192,3.170237,...,0.000000,2.025242,0.000000,2.398899,0.000000,2.236104,0.000000,0.000000,0.000000,0.000000
5,0.244955,0.000000,0.005030,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.000000,...,0.000000,0.000000,2.279276,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
6,0.000000,0.307786,0.015091,0.000000,0.000000,0.000000,0.000000,0.0,0.39548,1.056746,...,0.000000,0.000000,2.279276,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,0.244955,1.231143,0.005030,1.304996,1.039797,0.000000,0.459323,0.0,0.00000,0.000000,...,0.000000,0.000000,2.279276,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,0.000000,0.000000,0.005030,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.528373,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,0.000000,0.000000,0.005030,0.000000,0.000000,0.000000,0.000000,0.0,0.00000,0.528373,...,0.000000,2.025242,2.279276,0.000000,0.000000,0.000000,0.000000,4.309957,0.000000,0.000000


And as before, use this DataFrame and the training classes to build a *k*-NN classifier. Again, we will use *k*=3.

In [19]:
spamFilter3_knn = KNeighborsClassifier(n_neighbors=3, metric='cosine', algorithm='brute')

In [20]:
spamFilter3_knn.fit(trainingData_df,
                    trainingCorpusClasses_ls)

KNeighborsClassifier(algorithm='brute', leaf_size=30, metric='cosine',
           metric_params=None, n_jobs=1, n_neighbors=3, p=2,
           weights='uniform')

## Using the classifier to classify test data

To classify the test data, we need the tf.idf vector for each vector in the test set. First tokenise the test data:

In [21]:
testTokenisedDocuments_ls = [tokenise_email_document(doc_txt) for doc_txt in testCorpusDocuments_ls]

and convert to tf.idf vectors:

In [22]:
testTfIdfVectors_ls = [build_tf_vector(tokenisedDoc_ls, shortTermIndex) * idfIndex_ss
                       for tokenisedDoc_ls in testTokenisedDocuments_ls]

In [23]:
testData_df = pd.DataFrame(testTfIdfVectors_ls)

testData_df

Unnamed: 0,Unnamed: 1,the,to,and,of,a,in,from,for,you,...,power,just,kaminski,office,within,call,invoked,go,doctype,w3c//dtd
0,12.982627,0.000000,0.030181,0.869998,0.000000,1.028838,0.918647,0.0,1.18644,1.056746,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
1,0.244955,1.231143,0.040242,2.174994,0.519898,0.771628,0.918647,0.0,1.18644,0.528373,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,2.154979,0.000000,0.000000
2,16.656955,1.538929,0.035211,0.869998,1.559695,0.771628,0.459323,0.0,1.58192,4.226983,...,0.000000,0.000000,9.117105,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
3,0.000000,0.615572,0.020121,1.739995,0.000000,0.257209,2.755940,0.0,0.00000,1.585119,...,0.000000,2.025242,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
4,1.469731,2.770072,0.070423,3.044992,0.519898,1.028838,2.296616,0.0,2.37288,5.283729,...,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
5,0.000000,14.773718,0.196177,3.914989,5.198983,4.886980,1.377970,0.0,1.58192,23.776780,...,0.000000,4.050485,2.279276,0.000000,0.000000,6.708312,0.000000,4.309957,0.000000,0.000000
6,1.224776,6.155716,0.035211,1.304996,1.559695,1.800466,1.837293,0.0,0.39548,0.528373,...,0.000000,2.025242,2.279276,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
7,1.224776,2.154501,0.070423,2.174994,4.679084,1.028838,0.918647,0.0,2.76836,3.170237,...,0.000000,0.000000,4.558553,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
8,1.959642,3.693430,0.010060,0.869998,1.559695,1.543257,0.000000,0.0,2.37288,0.000000,...,0.000000,0.000000,2.279276,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000
9,0.979821,1.846715,0.015091,1.739995,1.039797,0.514419,1.377970,0.0,0.79096,0.528373,...,0.000000,0.000000,2.279276,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000,0.000000


Finally, apply the classifier to the test data:

In [24]:
results_df = pd.DataFrame({'predicted':spamFilter3_knn.predict(testData_df),
                           'actual':testCorpusClasses_ls})

results_df

Unnamed: 0,actual,predicted
0,ham,ham
1,ham,ham
2,ham,ham
3,ham,ham
4,ham,ham
5,ham,spam
6,ham,ham
7,ham,ham
8,ham,ham
9,ham,ham


## Evaluating the filter

Again, we can use the `pd.crosstab` function to present the results in a more readable way:

In [25]:
tabulatedResults_df = pd.crosstab(results_df.predicted, results_df.actual, margins=True)

tabulatedResults_df

actual,ham,spam,All
predicted,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
ham,197,1,198
spam,3,199,202
All,200,200,400


We can now print the results, and give an overall percentage accuracy (total number of emails that were correctly classified into *ham* or *spam*):

In [26]:
print('Ham correctly classified as ham: {}/{}'.format(tabulatedResults_df['ham']['ham'],
                                                      tabulatedResults_df['ham']['All']))

print('Ham incorrectly classified as spam: {}/{}'.format(tabulatedResults_df['ham']['spam'],
                                                         tabulatedResults_df['ham']['All']))

print('Spam incorrectly classified as ham: {}/{}'.format(tabulatedResults_df['spam']['ham'],
                                                         tabulatedResults_df['spam']['All']))

print('Spam correctly classified as spam: {}/{}'.format(tabulatedResults_df['spam']['spam'],
                                                        tabulatedResults_df['spam']['All']))

print('Overall system accuracy: {:.1%}'.format((tabulatedResults_df['ham']['ham'] + 
                                                tabulatedResults_df['spam']['spam']) / 
                                                     tabulatedResults_df['All']['All']))

Ham correctly classified as ham: 197/200
Ham incorrectly classified as spam: 3/200
Spam incorrectly classified as ham: 1/200
Spam correctly classified as spam: 199/200
Overall system accuracy: 99.0%


This is an (even) stronger result than our previous attempt.

The key message to take away here is that we have managed to greatly improve the behaviour of our data application by considering the nature of the data (natural language documents), choosing an appropriate similarity measure (cosine similarity) and using knowledge of the dataset to improve the way the data is processed (tf.idf measures) in a way that is appropriate for the particular application.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, you've completed the Part 22 Notebooks.