# From text to a numerical representation
Typically, Machine Learning algorithms require a vectorial representation for their data. In the IR module, we saw how text can be pre-processed, tokenized, and converted to vector representations. In this tutorial we will be re-using these ideas to create vector representations of documents for the purpose of training classifiers.

We will be using the `scikit-learn` Python library, which provides functionallity for text pre-processing and training and using classifiers. 

In [None]:
## some configurations for notebook and importing modules
%matplotlib inline
%load_ext autoreload
%autoreload 2

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
np.random.seed(6490)

In this tutorial, we be using the `newsgroup dataset`, in which documents belong to one of nine categories.

We beging by loading the dataset into a `pandas` Dataframe, which is a two-dimensional labeled data-structure (similar to a table), in which columns represent attributes and rows represent data instances. 

In the resulting `dataset` Dataframe, each row is a document, and the columns represent the id, category and text, of the given document.

In [None]:
from data import read_as_df
from prepros import preprocessor
import os.path

path_to_dataset = os.path.join('question_1_data', 'newsgroups')

dataset = read_as_df(path_to_dataset)
dataset.head()

We use the preprocessing function in the file `prepros.py`, to pre-process each document in the dataset. 
This will add a new column to the previously created Dataframe.

**NB** This next block might take a while to execute.

In [None]:
dataset['tokens'] = dataset['text'].apply(preprocessor)
dataset.head()


## CountVectorizer

We use scikit-learn's [CountVectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) for vectorizing the tokens.

CountVectorizer itself can remove stop_words, convert text into lowercase tokens.
However, we will not use these options since we have already built a more sophisticated tokenizer, which can stem tokens (which not readily available in scikit-learn). 

To use our tokenizer, we define CountVectorizer with options `tokenizer = lambda x: x`, which means we are asking CountVectorizer to apply the identity function, as we already have a list of tokens available. 

Note that we can set `binary=True` option to use a boolean representation. Setting it to False outputs a term-frequency representation.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer
bow_vectorizer = CountVectorizer(lowercase = False, 
                                     tokenizer = lambda x: x, # because we already have tokens available
                                     stop_words = None, ## stop words removal already done from NLTK
                                     max_features = 5000, ## pick top 5K words by frequency
                                     ngram_range = (1, 1), ## we want unigrams for now
                                     binary = False) ## we want frequency count features
text_vec = bow_vectorizer.fit_transform(dataset.tokens)
print(text_vec[0, ]) ## see the features indices that are set to 1

The vectorial representation represents each document as a vector, in which each dimension corresponds to a word/token in the vocabulary build from the entire dataset/corpus. 

For example, one of the line in the output states:

`(0, 3863)	6`

which means that the first document has 6 occurrences of the $3863^{th}$ feature (word).

Let's see what word corresponds to that index, and what are the first 100 features:

In [None]:
print(bow_vectorizer.get_feature_names()[3863])
## list of feature names (they are just tokens here)
print(bow_vectorizer.get_feature_names()[:100])

## The sparsity of the feature space
The feature space is sparse, and as a result CountVectorizer represents documents them using a [sparse matrix](https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.csr_matrix.html), instead of a [dense matrix](https://docs.scipy.org/doc/numpy-1.13.0/reference/arrays.html)
(if each document is a vector, a collection of documents corresponds to a matrix).

To verify this, we can check how many features are enabled in the matrix corresponding to our dataset:

In [None]:
print('{} values are set, out of a maximum of {} = {:.2f}%'.format(
    text_vec.nnz, dataset.shape[0] * 10000, 1.0 * text_vec.nnz / (dataset.shape[0] * 10000) * 100))

As we can see, less than 1% of the matrix elements are set, so representing the data with dense matrix would be an inefficient usage of memory. 
Hence, `scikit-learn`'s [Count vectorizer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) internally uses sparse representations.

# Question 1 [7 pts]

<span style="color:blue">
    
### Sections 1.2-1.5 of this notebook each have some tasks for you to complete, these are marked in blue. Completeing these tasks is question 1 of the ML assignment, and you will need to submit this notebook along with your NLP Assignment.ipynb notebook.
</span>

# 1. Building a classifier

## 1.1 Naive Bayes Classifier

Having represented documents in the Vector Space Model, we can now start building a Naive Bayes textual classifier.

To do so, we need will need counts of the term occurrences when computing the $ \hat P(t | c)$.
This amounts to constructing a term-frequency vectorial representation.

Therefore, the [Naive bayes classifier](http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB) implementation available in `scikit-learn` requires the document collection to be in a vectorial representation prior to training Naive Bayes.

Next, we split the dataset into a training set (75% of the dataset) and a testing set (25% of the dataset). 
We train a Naive Bayes Classifier on the training set, and we perform the predictions on the test set.

**NB** that we are using `LabelEncoder` here to encode labels/classes of documents as numbers. 
The 9 classes will be mapped into numbers from 0 to 8 using this label encoder. 
We require this to render the dataset compatible with `scikit-learn` and the plotting libraries.

In [None]:
from sklearn.naive_bayes import MultinomialNB
msk = np.random.rand(len(dataset)) < 0.75
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

train_X = text_vec[msk]
test_X = text_vec[~msk]

y = le.fit_transform(dataset.category)
train_y = y[msk]
test_y = y[~msk]

We train the classifier using:

In [None]:
classifier =  MultinomialNB()
classifier.fit(train_X, train_y)

To make predictions, we use the `predict` method.

In [None]:
preds_bow = classifier.predict(test_X)
to_print = [le.inverse_transform([pred])[0] for pred in preds_bow ]
print(to_print[:100])

## 1.2 Evaluating the prediction performance [1 pts]
To evaluate how well the classifier performed, we compute the confusion matrix, as well as the overall accuracy, and the per-class precision, recall and F1 measure. 

<span style="color:blue">

### You will need to fill in the code that computes these measure, but note that these are all implemented in the scikit_learn library, and you should make use of this. [1 pts]
</span>.

In [None]:
def print_metrics(y, pred_y):
    # Correctly assign thse variables
    raise NotImplementedError
    confusion = None
    acc = None
    precisions, recalls, f1_scores, _ = None

    print("accuracy = {}".format(acc))

    print("{:>25} {:>4} {:>4} {:>4}".format("", "prec", "rec", "F1"))
    for (idx, scores) in enumerate(zip(precisions, recalls, f1_scores)):
        print("{:>25} {:.2f} {:.2f} {:.2f}".format(
            le.inverse_transform([idx])[0], scores[0], scores[1], scores[2]
    ))

    print('confusion matrix:\n{}'.format( confusion) )
    
    return acc

acc_bow = print_metrics(test_y, preds_bow)

## 1.3 Improving the performance with different feature representations [2 pts]

We attempt to improve the classifier's performance using other features.
We start with the boolean features: whenever a feature (token) appears in a document we mark a value of 1 instead of number of occurrence of that token.

We can pipeline vectorization, and classifier in scikit learn. Refer [this documentation](http://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html) for more information.

In [None]:
le = LabelEncoder()

train_X = dataset.tokens[msk]
test_X = dataset.tokens[~msk]
y = le.fit_transform(dataset.category)
train_y = y[msk]
test_y = y[~msk]

Again we will use CountVectorizer, but the difference here is `binary = True` argument, which tells CountVectorizer to use binary features instead of term frequencies.

<span style="color:blue">

### You will need to fill in the code to fit the pipeline model on the training data and create predictions for the test data. [1 pts]
</span>

In [None]:
bin_vectorizer = CountVectorizer(lowercase = False, 
                                     tokenizer = lambda x: x, # because we already have tokens available
                                     stop_words = None, ## stop words removal already done from NLTK
                                     max_features = 5000, ## pick top 5K words by frequency
                                     ngram_range = (1, 1), ## we want unigrams now
                                     binary = True) ## Now it is Binary

from sklearn.pipeline import Pipeline

pipeline = Pipeline([
    ('bow',  bin_vectorizer),
    ('naive-bayes',  MultinomialNB()) ])

# Train the pipeline and create predictions for the test set.
raise NotImplementedError
preds_bin = None

acc_bin = print_metrics(test_y, preds_bin)

<span style="color:blue">

### Did changing the model to use boolean features instead of term-frequency increase or decrease the model's performance? Explain why the performance did or did not change. [1 pts]
</span>

<span style="color:blue"> YOUR ANSWER HERE </span>

## 1.4 Improving the performance with  TFIDF [2 pts]
TF-IDF reflects how important a token/term is to a document, with respect to the entire collection of documents. 

<span style="color:blue">


### To build a pipeline with TF-IDF representation, you will need to add a [TfidfTransformer](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html) after the bag-of-words vectorizer. [1 pts]
</span>

It means that we are transforming the token counts using TF-IDF.

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

# Create the pipeline and train and evaluate the model.
raise NotImplementedError
pipeline = None
preds_tfidf = None

acc_tfidf = print_metrics(test_y, preds_tfidf)

<span style="color:blue">

### Did changing the model to use tf-idf features increase or decrease the model's performance? Explain why the performance did or did not change. [1 pts]
</span>

<span style="color:blue"> YOUR ANSWER HERE </span>

In [None]:
accuracies = pd.DataFrame(
    [('tf', acc_bow), ('binary', acc_bin), ('tfidf', acc_tfidf)], 
    columns = ['feature_rep', 'accuracy']
).set_index('feature_rep')
accuracies.plot.bar(ylim = (0.7, 0.9))

## 1.5 Further improving performance with Bigrams features [2 pts]
Until now, we created feature representations using unigrams, i.e. taking one token as a feature. 
The main disadvantage of doing this is that we loss positional information in unigram feature representation. 
To address this, we can use n-gram as a features: we use sequences of n words to construct features.

<span style="color:blue">

### You will need to train your best performing model, this time usinig both unigram and bigram features. [1 pts]
</span>

In [None]:
from sklearn.feature_extraction.text import TfidfTransformer

# Define the CountVectorizer to use bigrams as well as unigrams.
raise NotImplementedError
bigrams_bow_vectorizer = None
pipeline = None
preds_tfidf_bigrams = None

acc_tfidf_bigrams = print_metrics(test_y, preds_tfidf_bigrams)

<span style="color:blue">
    
### Did changing the model to use bigram and unigram features increase or decrease the model's performance? Explain why the performance did or did not change. [1 pts]
</span>

<span style="color:blue"> YOUR ANSWER HERE </span>

In [None]:
accuracies = pd.DataFrame(
    [('unigrams', acc_tfidf),  ('bigrams', acc_tfidf_bigrams)], 
    columns = ['n-grams', 'accuracy']
).set_index('n-grams')
accuracies.plot.bar(ylim = (0.8, 0.9))

Let's see what bigrams features look like.

In [None]:
print(bigrams_bow_vectorizer.get_feature_names()[:100])

We can notice sequences of two tokens used as features.

# ADDITIONAL MATERIAL 

## Distance metrics and searching in the Vector Space Model

### Distance metrics for a document

In the previous tutorial, we used boolean logic on the set representation of features to perform exact document match. With vector representations introduced in this module, we can query for partial matches. 

The degree of matching can be quantified by similarity metrics. 
The similarity metrics can be computed from distance metrics, where the distance between documents are computed in vector spaces. There are two popular choices for distance metrics in this space:
1. Cosine distance
2. Euclidean distance

To search for similar documents, we use the candidates that has minimum distance with the query's vector representation. We use the TFIDF vectorizer with unigrams representation for this.

In [None]:
vectorizer = Pipeline([
    ('bow',  bow_vectorizer),
    ('tfidf',  TfidfTransformer())])
vectorizer.fit(dataset.tokens)
tfidf_vec = vectorizer.transform(dataset.tokens)

Let's use this vectorizer to create a vector representation of our previous two queries: `research seminar` and `scientific visualization`

In [None]:
tokens1 = preprocessor("research seminar")
query_vec1 = vectorizer.transform([tokens1])

tokens2 = preprocessor("scientific visualization")
query_vec2 = vectorizer.transform([tokens2])

The implementation of the distance metrics are provided in the file `dist.py`. 
Have a look at the functions provided there.

We can use the provided `dist` function as:

In [None]:
from dist import dist, search
cosine_distance = dist(query_vec1.toarray().squeeze(), query_vec2.toarray().squeeze(), method = 'cosine')
euclid_distance = dist(query_vec1.toarray().squeeze(), query_vec2.toarray().squeeze(), method = 'euclid')

print('cosine distance = {}, euclid distance = {}'.format(cosine_distance, euclid_distance))

### Search and ranking

We use the same distance measures to search for similar documents for a query text. 
The most similar documents in our collections are the ones that have lowest distance against the query string. 
We can also use the distance to rank the search results.

Have a look at search function in `dist.py`. 

Top 5 matches with `research seminar`:

In [None]:
## get indexes of the most similar documents 
idxs1 = search(tfidf_vec.toarray().squeeze(),
        query_vec1.toarray().squeeze(),
        dist_measure = 'cosine'
)

## ranked top 5 search results for query 'research seminar'
dataset.iloc[idxs1]

Top 5 matches with `scientific visualization` based on euclidean distance measure:

In [None]:
idxs2 = search(tfidf_vec.toarray().squeeze(),
        query_vec2.toarray().squeeze(),
        dist_measure = 'euclid'
)

## ranked top 5 search results for query 'research seminar'
dataset.iloc[idxs2]

## AUC and ROC 

Another popular evaluation metric for evaluating per class performance of a classifier is the [Area under Curve (AUC) of the Receiver Operating Characteristics (ROC)](https://en.wikipedia.org/wiki/Receiver_operating_characteristic). 

The ROC curve plots the true positive rate (Sensitivity) against the false positive rate (Specificity) for different cut-off points. 
Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. 

AUC is the area under ROC curve. $ AUC \in [0.5, 1]$ and a value of $0.5$ corresponds to a random classifiers. Higher is better.

The block below demonstrates how we can use matplotlib (python plotting library) and scikit's evaluation metric functions to plot per class performance of our classifier. 
We will see the plot for our best performing classifier (i.e. TFIDF with bigrams feature representation).

In [None]:
from itertools import cycle
from sklearn.metrics import roc_curve, auc

pipeline = Pipeline([
    ('bigram_bow',  bigrams_bow_vectorizer),
    ('tfidf',  TfidfTransformer()),
    ('naive-bayes',  MultinomialNB()) ])

## Fit the data
pipeline.fit(train_X, train_y)

## This function plots the ROC curve
def plot_roc(labels, probs, le):
    colors = cycle(['aqua', 'red', 'green', 'blue', 'yellow', 'cyan', 'magenta', 'violet', 'purple', 'black', 'grey'])
    fpr, tpr = dict(), dict()
    roc_auc = dict()
    for label in range(len(list(le.classes_))):
        color = next(colors)
        fpr[label], tpr[label], _ = roc_curve(test_y == label, probs[:, label])
        roc_auc[label] = auc(fpr[label], tpr[label])
        plt.plot(fpr[label], tpr[label], color = color, lw=2,
                 label ='ROC of {0} | auc = {1:0.2f}'
                 ''.format(le.inverse_transform([label])[0], roc_auc[label]))
    plt.xlim([0.0, 1.1])
    plt.ylim([0.0, 1.1])
    plt.xlabel('False Positive Rate')
    plt.ylabel('True Positive Rate')
    plt.title('ROC Curve')
    plt.legend(loc="lower right")

plt.figure(figsize=(9, 6))  # the plot size you want

## see accuracy and confusion matrix
preds_tfidf_bigrams = pipeline.predict_proba(test_X)
plot_roc(test_y, preds_tfidf_bigrams, le)