# GIAN 6: Classifying documents

This notebook illustrates some basic machine learning techniques for supervised text classification. It also shows you how to evaluate classification performance.

In [None]:
from collections import Counter, defaultdict
from matplotlib import pyplot as plt
%matplotlib inline

In this notebook, we introduce a new package called [scikit-learn](http://scikit-learn.org), abbreviated as *sklearn*. This package contains a wealth of machine learning techniques that can be applied in text mining. It should have been installed automatically with [Anaconda](https://anaconda.org/). If it is not installed yet, you can find it under scikit-learn.

## Supervised classification

In text mining, a supervised classification task requires:

+ A collection of text-containing elements. These elements can be books, twitter messages, emails, etc.;

+ For each element, a label, usually provided by a human rater at some point.

The goal of supervised classification is to learn how labels can be predicted by characteristics of the text –also called features– and to be able to assign a label to a new element based on these features.

## Spam and ham

A typical text classification task is spam detection.

Spam detection requires a collection of messages that a human has labeled to be:

+ unwanted (SPAM)
+ wanted (HAM)

As an example we will use a collection of SMS messages that have been labeled SPAM or HAM.  

In [None]:
f=open("GIAN6_data/SMSSpamCollection", encoding="utf-8")
messages=[]
labels=[]
for line in f:
    label, message = line.strip().split("\t")
    messages.append(message)
    labels.append(label.upper())

Let's look at some messages

In [None]:
for i in range(10):
    print(labels[i], ":", messages[i])

## Transforming the text for classification (feature extraction)

Our next step is to transform the texts into something that is useful for a classifier. This is also called feature extraction.

In this case, we will transform the text into a "bag of words". This means that we extract the word types from the document and "throw them together in a bag". The bag contains all of the different words in the text, but the words are unordered.

###  Vectorizing
As most machine learning methods expect numerical data, we need to transform the *bag of words* to a set of numbers.

Let's say we have three documents:

| Document | Content                               |
|----------|---------------------------------------|
| 1        | I've been to Hollywood                |
| 2        | I've been to Redwood                  |
| 3        | I've been a miner for a heart of gold |


After tokenizing, we can transform each document in a numeric vector (a row of numbers) by using all the possible words as columns.

We obtain a matrix with words as columns and document as rows.

| i | 've | been | to | hollywood | redwood | a | miner | for | heart | of | gold |
|---|-----|------|----|-----------|---------|---|-------|-----|-------|----|------|
| 1 | 1   | 1    | 1  | 1         | 0       | 0 | 0     | 0   | 0     | 0  | 0    |
| 1 | 1   | 1    | 1  | 0         | 1       | 0 | 0     | 0   | 0     | 0  | 0    |
| 1 | 1   | 1    | 0  | 0         | 0       | 1 | 1     | 1   | 1     | 1  | 1    |

For clarity's sake we have put the words in the order they occur in the documents, but the actual order of the columns is irrelevant.

In the example above, the cells in each row indicate whether the word in the column is present (1) or absent (0) in the document, but the cells may also include:

+ The frequency of the word in the document
+ The ${TF} \times {IDF}$ of the word
+ ... any other useful numerical measure


In [None]:
from sklearn.feature_extraction.text import CountVectorizer
vectorizer=CountVectorizer()

In [None]:
# We can also use spaCy's tokenizer because it is much better than the one included with sklearn
import spacy
import en_core_web_sm

nlp = en_core_web_sm.load(disable=["parser", "tagger", "ner"])

def my_tokenizer(text):
    return[token.lower_ for token in nlp(text)]

vectorizer=CountVectorizer(tokenizer=my_tokenizer)

Load the corpus

In [None]:
corpus=["I've been to Hollywood",               
"I've been to Redwood",
"I've been a miner for a heart of gold"]

Transform the corpus into a matrix

In [None]:
corpus_m=vectorizer.fit_transform(corpus)

Show the column labels (words)

In [None]:
vectorizer.get_feature_names()

Show the matrix - as you can see it contains the counts for each word in each document

In [None]:
corpus_m.toarray()

## Using the naive Bayes classifier

We will know try to classify the SMS messages using a naive Bayes classifier.

In [None]:
from sklearn.naive_bayes import MultinomialNB
cl_mnb=MultinomialNB()

Classification requires that we divide the messages in two sets: 
+ The set that we train on (training set)
+ The set that we test the classifier on (test set)

Sklearn includes a useful function to do this for us.

In [None]:
from sklearn.model_selection import train_test_split

Now we can split the SMS messages and labels into a training set and test set.

We will use 70% of the data for training and 30% of the data for testing

In [None]:
messages_train, messages_test, labels_train, labels_test = \
train_test_split(messages, labels, test_size=0.3)

Before we go on, let's s not forget to vectorize the messages in the spam database 

In [None]:
messages_v=vectorizer.fit(messages)
messages_v_train=vectorizer.transform(messages_train)
messages_v_test=vectorizer.transform(messages_test)

We can now train the naive Bayes classifier on the training data

In [None]:
cl_mnb.fit(messages_v_train, labels_train)

And see how well it scores on the test set! 

In [None]:
mnb_score=cl_mnb.score(messages_v_test, labels_test)

In [None]:
print(mnb_score)

(a score of 1 means that all SPAM messages in the test set were classified correctly)

We can also print more detailed reports, including precision, recall, and f1-score

In [None]:
from sklearn.metrics import classification_report

In [None]:
plabels_test=cl_mnb.predict(messages_v_test)
print(classification_report(labels_test, plabels_test))

All of these measures are nicely explained [here](https://en.wikipedia.org/wiki/Confusion_matrix)

And of course, we can use some python to look at which messages were 


|    X        | Predicted SPAM      | Predicted HAM       |
|-------------|---------------------|---------------------|
| Actual SPAM | True Positive (TP)  | False Negative (FN) |
| Actual HAM  | False Positive (FP) | True Negative (TN)  |

In [None]:
def confusion_dict(messages, xlabels, ylabels, ref):
    cd=defaultdict(list)
    n=len(messages)
    for i in range(n):
        if xlabels[i]==ylabels[i]:
            if xlabels[i]==ref:
                mtype="TP"
            else:
                mtype="TN"
        elif xlabels[i]!=ylabels[i]:
            if xlabels[i]==ref:
                mtype="FN"
            else:
                mtype="FP"
        cd[mtype].append(messages[i])
    return(cd)

In [None]:
cd_mnb=confusion_dict(messages_test, labels_test, plabels_test, "SPAM")

In [None]:
cd_mnb["TN"][:10]

## Using a k-nearest neighbors classifier

Since there are many classification algorithms to choose from, lets try another one.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
cl_knb=KNeighborsClassifier(n_neighbors=1)

In [None]:
cl_knb.fit(messages_v_train, labels_train)

In [None]:
cl_knb.score(messages_v_test, labels_test)

By default, the nearest neighbors classifier bases its decision on 5 neighbors. Let's see what happens if we change this value.

In [None]:
knb_scores=[]
for k in range(1, 16):
    cl_knb=KNeighborsClassifier(n_neighbors=k)
    cl_knb.fit(messages_v_train, labels_train)
    knb_score=cl_knb.score(messages_v_test, labels_test)
    knb_scores.append(knb_score)

In [None]:
plt.plot(range(1,16), knb_scores)
plt.xlabel("Number of neighbors")
plt.ylabel("Accuracy")
plt.show()

What would happen if we used a matrix with ${TF} \times {IDF}$ values instead of frequency values? 

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer_b=TfidfVectorizer(tokenizer=my_tokenizer)

In [None]:
messages_vb=vectorizer_b.fit(messages)
messages_vb_train=vectorizer_b.transform(messages_train)
messages_vb_test=vectorizer_b.transform(messages_test)

In [None]:
knb_scores_b=[]
for k in range(1, 16):
    cl_knb=KNeighborsClassifier(n_neighbors=k)
    cl_knb.fit(messages_vb_train, labels_train)
    knb_score=cl_knb.score(messages_vb_test, labels_test)
    knb_scores_b.append(knb_score)

In [None]:
plt.plot(range(1,16), knb_scores, label="Frequency")
plt.plot(range(1,16), knb_scores_b, label=r"${TF} \times {IDF}$")
plt.xlabel("Number of neighbors")
plt.ylabel("Accuracy")
plt.legend(loc="best")
plt.show()