In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import gridspec
%matplotlib inline

In [None]:
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
## This prevents execution of lengthy grid search at the end of notebook. 
## A shorter taking around six minutes grid search will still be executed.     
RUN_GRID_SEARCH = False

# Text classification - Multinomial Bayes

Bayes classifier  is often used in text classification e.g. for sentiment analysis or spam recognition. It assumes the so called "bag of words model". In this approach we treat a document as an unordered colllection of words (tokens) and only count the number of times each word was used in the document. The list of all worlds that we  consider is called the _vocabulary_. 

We treat the vocabulary as __one__ giantic categorical feature $X$. Then each document of length $n$ can be viewed as  draw from the multinomial distribution i.e. we select _with replecement_ $n$ words at random from the vocabulary. For more information see the references below.

### References 
 1. A. McCallum and K. Nigam (1998). [A comparison of event models for Naive Bayes text classification](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.46.1529). Proc. AAAI/ICML-98 Workshop on Learning for Text Categorization, pp. 41-48.
 1. V. Metsis, I. Androutsopoulos and G. Paliouras (2006). [Spam filtering with Naive Bayes – Which Naive Bayes?](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.61.5542) 3rd Conf. on Email and Anti-Spam (CEAS).


## Training

Our training data will consist of number of  documents with labels assigning them to $n_c$ different classes. 

We start by converting documents  into vectors. Each element of the vector corresponds to one word in the vocabulary and contains the number of times this word was used in the document. If $n_s $ is the number of documents(samples) nad $n_w$ is the number of words in the vocabulary then our data can be represented as $n_s\times n_w$ matrix $\mathbf{x}$ and  $y_i=0,\ldots,n_c-1$ will represent  corresponding class label. 

Let $m_i$ denote the  number of words in document $i$

$$m_i = \sum_{j=0}^{n_w-1} x_{ij}$$

As in the categorical Naive Bayes we split the matrix $\mathbf{x}$ into rows corresponding to different classes

$$\mathbf{x}^{(c)}\equiv\{\mathbf{x}_h: y_h=c\}$$ 

The total number of times the ith word was used in documents belonging to class $c$  will be denoted by $n^{(c)}_i$

$$n^{(c)}_i= \sum_{j=0}^{n_s-1} x_{ji}\delta_{y_j,c},
\qquad \delta_{a,b}=  \begin{cases}
1 & a=b\\
0 & a\neq b
\end{cases}
$$

The total number of worlds in $\mathbf{x}^{(c)}$ will be denoted by $n^c$.

$$n^{(c)}= \sum_{i=0}^{n_w-1} n^{(c)}_i = \sum_{i=0}^{n_w-1}\sum_{j=0}^{n_s-1} x_{ji}\delta_{y_j,c}
$$

Then the smoothed estimator of the conditional probability

$$P(X=i|c)$$

is 

$$
p^{(c)}_{i} = \frac{n^{(c)}_i+\alpha}{n^{(c)}+n_w \alpha}, \quad \alpha\ge 0
$$

Once the $p^{(c)}_i$  are estimated we estimate the conditional probability for document encoded as vector $\mathbf{x}_j$ 

$$P(C=c|\mathbf{x}_j) = \frac{P(\mathbf{x}_j|C=c)P(C=c)}{\sum_c P(\mathbf{x}_j|C=c)P(C=c)}$$

where  $P(\mathbf{x}_j|C=c)$ is given by the [multinomial distribution](https://en.wikipedia.org/wiki/Multinomial_distribution)

$$P(\mathbf{x}_j|C=c)
=
\frac{(\sum_{i=0}^{n_w-1} x_{ji})!}{\prod_{i=0}^{n_w-1} x_{ji}!}
\prod_{i=0}^{n_w-1} \left(p^{(c)}_i\right)^{\displaystyle x_{ji}}$$

In practice most of the $x_{ji}$ are zero and sums and products in the above expression have range only over the number of unique tokesn in the document. 

## Amazon reviews

We will use the Amazon review data set. This data set is very handy because it contains both documents (reviews) and labels (ratings). We can use it to train classifier predict ratings based on the review. The original data set is HUGE  and can be  _e.g._ found [here](http://jmcauley.ucsd.edu/data/amazon/). We will use the preprocessed data from
[ Xiang Zhang's Google Drive dir](https://drive.google.com/open?id=0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M) that can be downloaded as a tar archive. However it still has 1.5GB of data. So for the sake of this lecture I have prepared a smaller sample that I have additionally compressed with 'bz2' reducing the size to "only" 41MB. You can play with original file by downloading it directly from the above link. 

The data was selected using the code below. 

```
from sklearn.model_selection import train_test_split
seed = 85865
data = pd.read_csv("../../Data/amazon_reviews/train.csv",
                   names=["rating", "title", "review"])
small_data,_ = train_test_split(data,train_size=300000,  stratify=data['rating'], random_state=seed)
small_data.to_csv("../../Data/amazon_reviews/small.csv.bz2", index=False, compression='bz2')
```

Use of the 'stratify' argument  guarantee that proportion of each ratings will be  preserved. In this case we will have same number of documents with each rating. The smaller file can be read in using:

In [None]:
data = pd.read_csv("../../Data/amazon_reviews/small.csv.bz2", compression='bz2')

In [None]:
data.head(3)

The file contains

In [None]:
len(data)

reviews. We will split this set into training and testing sets: 

In [None]:
seed = 57576

train_data, test_data = train_test_split(data,train_size=250000, test_size=50000, stratify=data['rating'], random_state = seed)

We can check that indeed we have equal number of documents for each rating. 

In [None]:
train_data.rating.value_counts()

In [None]:
test_data.rating.value_counts()

In [None]:
train_data.head(5)

## Extracting features

The next step is to convert each document into vector as described in the introduction.  This is actually a non trivial task. It requires first spliting text into _tokens_ (e.g. words). Not all letter combinations are meaningfull and we have decide which tokens are valid. Also it may be advantageous to  do _steming_: reduce the word to its _root_ word e.g. 'flying'->'fly. 

The scikit-learn library contains a [CountVectorizer](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html) class that implements a simple tokenizer and occurence counting.

In [None]:
from sklearn.feature_extraction.text import CountVectorizer

In [None]:
vectorizer = CountVectorizer(stop_words='english')

You should consult the [documentation](https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.CountVectorizer.html)  for the full description of the available arguments. I used only one: `stop_words = 'english'` which enables use of the internal list of tokens to reject (stop words) (but [see](https://scikit-learn.org/stable/modules/feature_extraction.html#stop-words)). 

The method `fit_transform` takes the dataset and prepares the vocabulary and then encodes the dataset. It's a  composition of `fit` and `transform` methods.

In [None]:
%%time
review_features = vectorizer.fit_transform(train_data.review)

In [None]:
review_features.shape

As we can see the resulting matrix is huge!! And it's a  only a small version of data... But even this "small" version would take 

In [None]:
np.prod(review_features.shape)/2**30

gigabytes of data assuming only one byte of data per entry. However each document conatins only a very small subset of total vocabulary. So each row of this matrix contains mostly zeros e.g. in first five rows it is:

In [None]:
np.count_nonzero(review_features[:5,:].toarray(), axis = 1)

compared to 

In [None]:
review_features.shape[1]

row length. That's why it is stored in the compressed matrix format

In [None]:
type(review_features)

and we had to  use `toarray` method above to convert it to normal numpy array. Be carefull not to call this methods on the whole matrix!!!

The `nnz` attribute stores the number of actuall entries in the matrix

In [None]:
review_features.nnz

Resulting in only  small proportion of all the entries

In [None]:
review_features.nnz/np.prod(review_features.shape)

The vectorizer object stores the `vocabulary_` dictionary that maps words to   vector indices. 

In [None]:
price_i = vectorizer.vocabulary_['price']
print(price_i)

and word list that does the oposite:

In [None]:
vectorizer.get_feature_names()[price_i]

## Classification

We will again use the invaluable scikit-learn library which implements the Multinomial classifier. 

In [None]:
from sklearn.naive_bayes import MultinomialNB

In [None]:
base_rating_classifier = MultinomialNB()

In [None]:
base_rating_classifier.fit(review_features, train_data.rating)

In [None]:
%%time
test_features = vectorizer.transform(test_data.review)

In [None]:
base_rating_classifier.score(test_features, test_data.rating)

The `score` method implements the acccuracy metric which  gives the precentage of correct classfications. 

In [None]:
from sklearn.metrics import accuracy_score

In [None]:
base_predicted =  base_rating_classifier.predict(test_features)

In [None]:
accuracy_score(test_data.rating, base_predicted)

The accuracy does not look very good but keep in mind that  this is not binary classification.  We have five  ratings so random guessing would give us accuracy around 20%.  All the scores we have introduced in the 'categorical' notebook give similar (or identical) results. 

In [None]:
from sklearn.metrics import f1_score, precision_score, recall_score

In [None]:
recall_score(test_data.rating, base_predicted, average='macro')

In [None]:
precision_score(test_data.rating, base_predicted, average='macro')

In [None]:
f1_score(test_data.rating, base_predicted, average='macro')

Accuracy counts only the ratings we got exactly right. However we are not dealing with real categorical (nominal) data. Our labels are _ordinals_ meaning that there is an order to ratings  and clearly predicting one instead of five is a bigger error then predicting four. In practice we may tolerate  _e.g._  to be one rating off.

Let's look more closely how close we are with our classification.
The function below calculates the histogram of differences between  predictions and true ratings and plots it. 

In [None]:
def plot_results(true_ratings, predictated_ratings):
    fig, ax = plt.subplots(1,3, figsize=(18,5))
    diffs = predictated_ratings - true_ratings
    ax[0].hist(diffs, bins=9, range=(-4.5,4.5), density=True);
    ax[0].set_xlabel("$\delta$")
    ax[0].set_ylabel("$P(\Delta r = \delta)$")
    ax[1].hist(np.abs(diffs), bins=5, range=(-0.5,4.5), density=True);
    ax[1].set_xlabel("$\delta$")
    ax[1].set_ylabel("$P(|\Delta r| = \delta)$")

    ax[2].hist(np.abs(diffs), bins=5, range=(-0.5,4.5), density=True, cumulative=True, histtype='step');
    ax[2].axhline(0.80, linewidth=0.75, color='grey');
    ax[2].set_xlabel("$\delta$")
    ax[2].set_ylabel("$P(\Delta r \leq \delta)$", fontsize=12)

In [None]:
plot_results(test_data.rating, base_predicted)

We see that im 80% of cases we are just one rating off.

We can see that the most extreme ratings are most often  predicted correctly, which maybe is not that suprising. 

We can get  similar results ) with `confusion_matrix` functions from scikit-learn

In [None]:
from sklearn.metrics import confusion_matrix, plot_confusion_matrix

Please pay attention to `normalize` parameter. This displays  the relative frequencies of all possible outcomes combinations:   rows correspond to true labels, columns to predicted.

In [None]:
outcome_freq = confusion_matrix(test_data.rating, base_predicted, normalize='all')
outcome_freq

It ads up to one as expected

In [None]:
outcome_freq.sum()

We can get the counts of each outcomes by not requesting normalization

In [None]:
confusion_matrix(test_data.rating, base_predicted)

scikit-learn provides also function for plotting  confusion matrix.

In [None]:
fig, ax = plt.subplots(figsize=(12,8))
disp = plot_confusion_matrix(base_rating_classifier, test_features, test_data.rating, normalize='true', ax=ax);

 The normalization set to 'true' normalizes each row separately. We can check this by accesing the computed confusion matrix.

In [None]:
disp.confusion_matrix.sum(axis=1)

If we want to display an already computed confusion matrix we can use the `ConfusionMatrixDisplay` class

In [None]:
from sklearn.metrics import ConfusionMatrixDisplay
cm_display = ConfusionMatrixDisplay(outcome_freq, display_labels=range(1,6))
fig, ax = plt.subplots(figsize=(12,8))
cm_display.plot(ax=ax);
ax.set_xlabel('predicted ratings')
ax.set_ylabel('true ratings')

#### Most probable words

Let's look what are the most probable words in each category. The MultinomialNB makes the  logs of probabilities

$$\log P(X_i=i|R=r)=\log p^{(r)}_{i}$$

available as `feature_log_prob_` attribute. We can extract the index of highest number in an array using numpy function `argmax`

In [None]:
most_probable = np.argmax(base_rating_classifier.feature_log_prob_,axis=1)
most_probable

To convert the indices to word in the vocabulary we will use an auxiliary function

In [None]:
def word(i):
    return vectorizer.get_feature_names()[i]

word = np.vectorize(word)

The numpy `vectorize` function takes a Python function and makes it _threadable_ over numpy arrays, but please note that this may be quite inneficient. 

In [None]:
word(most_probable)

We can look at more words using the `argsort` function which returns the indices of the array in the sorted by (ascending) values order

In [None]:
a = np.random.uniform(0,1,6)
print(a)
ia = np.argsort(a)
print(ia)
print(a[ia])

Below we take $n$ most probable words  for each rating

In [None]:
n = 10 
most_probable = np.argsort(base_rating_classifier.feature_log_prob_,axis=1)[:,-1:-n-1:-1]

`argsort` sorts in ascending order. To list last $n$ elements of an array  in reverse order we use  slice with negative stride. 

`[-1:-n-1:-1]` means: start from last element (first from the end) and  go down  to  but not including n+1 element from the end. 

In [None]:
words = word(most_probable)

We can use Pandas dataframe to "pretty print" the  resulting array

In [None]:
pd.DataFrame(words, index =[1,2,3,4,5], columns=range(1,11))

## Hyperparameters tuning

While developing the text classifier above we had to set  a number of parameters. The `MultinomialNB` has only  one: `alpha`, but   vectorizer has several. Those parameters are called _hyperparameters_ to distinguish them from the parameters like $p^{(r)}_i$  that we are training. It is not clear a priori  what values of those hyperparameters will give us the best results. Unfortunatelly here we  are moving away from nice science of probability calculus  and enter the realm  of black magic of hyperparameters tuning.  

The most obvious and conceptually easiest strategy is the search of the space of possible parameters. In practice this is of course not possible, but we can at least scan a part of the parameters space and get some feeling about performance. 

Before we begin we must however think a little bit about how are we going to measure the performance of  the classifier. I have already stated that we must not use the training set for evaluation. What about the testing set then? While not so obvious, when we are tuning the hyperparameters we are also in some sense training our model. So if we use the test set to choose best hyperparameters we cannot use this set for overall final evaluation. The solution is to  set aside yet another set called the _validation_ dataset and use this set for tuning hyperparameters. For more explanation see [this article](https://machinelearningmastery.com/difference-test-validation-datasets/).

### Cross validation

Unfortunatelly  we usually have a finite amount of data and leaving aside test and validatation sets leaves less data for training. One of the strategies used to mitigate that is _k-fold cross validation_. This works like that: We divide our data in $k$ equal parts called folds.  Then we train our model $k$ times each time using different fold for testing and $k-1$ remaining folds for training. That way we get $k$ evaluations results that we can e.g. average. 

Fortunatelly scikit-learn  provides several functions to help us with cross validation.

In [None]:
from sklearn.model_selection import cross_val_score, cross_validate

In [None]:
for a in [0,1,2,3,4,5,10,20,30,50]:
    scores = cross_val_score(MultinomialNB(alpha=a), review_features, train_data.rating,cv = 5, n_jobs=-1 )
    print("{:4.1f} {:6.4f}".format(a, scores.mean()), scores)

The `cross_val_score` function can run its  training and scoring tasks concurrently. This is enabled by the `n_jobs` parameter  which indicated the number of concurrent processes we want to use. Be default this is `None` which is equvalent to one (no concurrency). Setting   it to -1 results in  automatic determination of number of processes to run and on computers with hyperthreading this will 
be twice the number of cores. For such small jobs as above this does not matter. But for larger jobs you should try different values of this parameter. Not always more processes is the better!

By deafult `cross_val_score` uses `score` function of the estimator which in this case was the accuracy score. We may change this by providing  `scoring` parameter

In [None]:
for a in [0,1,2,3,4,5,10,20,30,50]:
    scores = cross_val_score(MultinomialNB(alpha=a), review_features, train_data.rating,cv = 5, n_jobs=-1, scoring='f1_macro' )
    print("alpha = {:4.1f} {:6.4f}".format(a, scores.mean()), scores)

Function `cross_validate` does the same but allows for multiple metrics  and reports more information _e.g._ time used for fit and scoring. 

In [None]:
cross_val_results = cross_validate(MultinomialNB(alpha=20), review_features, train_data.rating,cv = 5, n_jobs=-1, scoring=['accuracy', 'f1_macro', 'precision_macro'] )

`cross_validata` returns its results in  format suitable for directly using it in pandas  dataframe which resulst in nicer output

In [None]:
pd.DataFrame(cross_val_results)

### Grid search

In case of bigger number of different parameters writing our explicit search loops like above can be tedious. We can use the `GridSearchCV` class to combine search with cross validation.

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
param_grid = {'alpha': [0,1,2,3,4,5,10,20,30,40,50,60]}

In [None]:
grid_search = GridSearchCV(MultinomialNB(), param_grid=param_grid, n_jobs=6, cv=5, scoring='f1_macro')

In [None]:
%%time
grid_search.fit(review_features, train_data.rating)

The search results are reported in `cv_results_` attribute which can be  used directly  to construct a dataframe

In [None]:
pd.DataFrame(grid_search.cv_results_).sort_values(by='rank_test_score')

The `fit` method by deafult selects the best set of paremters according to scoring metric and refits the estimator to all data using those parameters. The resulting object is stored in `best_estimator_` attribute

In [None]:
best_clf = grid_search.best_estimator_
best_predicted = best_clf.predict(test_features)

In [None]:
f1_score(test_data.rating, best_predicted, average='macro' )

Comparing to the "base" accuracy

In [None]:
acc_best = accuracy_score(test_data.rating, best_predicted)
acc_base = accuracy_score(test_data.rating, base_predicted)
print("base {:.3f} best {:.3f} diff {:.3f} rel diff {:3.0f}%".format(acc_base, acc_best, acc_best-acc_base,100*(acc_best-acc_base)/acc_base))

We see a very slight increase in accuracy.  This does not seem as a lot and for this particular example it isn't. But for some applications if accuracy translates  e.g. directly to our profit this gives us a 1% increase "for free". 

In [None]:
def evaluate_metrics(data, label, estimator, metrics):
        result={}
        predicted = estimator.predict(data)
        for name, func in metrics.items():
            result[name]= func(label, predicted)
        return result    
        
def evaluate_estimators_on_metrics(data, labels, estimators, metrics):
    results = {}
    
    for name, est in estimators.items():
        results[name] = evaluate_metrics(data, labels, est, metrics)
        
    return results     
            
def compare_to(evaluation,base):
    result = {}
    base_res = evaluation[base]
    for m_name, m in base_res.items():
            metric_column = result.setdefault('metric',[])
            metric_column.append(m_name)
            base_m_column = result.setdefault(base,[])
            base_m_column.append(m)
            for e_name, res in evaluation.items():
                if e_name != base:
                    m_column = result.setdefault(e_name,[])
                    current = evaluation[e_name]
                    m_value = current[m_name]
                    m_column.append(m_value)
                    column = result.setdefault(e_name+'_diff',[])
                    column.append(m_value-m)
                    column = result.setdefault(e_name+'_rel',[])
                    column.append((m_value-m)/m)
                    
                    
    return  result     
                
    

In [None]:
f1_macro = lambda x,y: f1_score(x,y, average = 'macro')

In [None]:
pd.DataFrame(
    compare_to(
        evaluate_estimators_on_metrics(test_features, test_data.rating, {'base': base_rating_classifier, 'best' :best_clf}, {'accuracy': accuracy_score, 'f1_macro': f1_macro}),
        'base')
)

### Pipelines

This is based on  scikit-learn : [Working with text data](https://scikit-learn.org/stable/tutorial/text_analytics/working_with_text_data.html) example.

Now we would like to optimize also the parameters of the `CountVectoriser`, but `GridSearchCV`accepts a single estimator object as input. We could write a  class that combines together `CountVectoriser` and `MultinomialNB` but we don't have to :) Scikit-learn provides an easy way to do it using it's `Pipeline` class. 

In [None]:
from sklearn.pipeline import Pipeline

In [None]:
rating_clf_pipe =  Pipeline([('vect', CountVectorizer(stop_words='english', max_df=0.25, min_df=10, max_features=None)),
                       ('clf', MultinomialNB())])

This creates a  `Pipeline` object that has all the methods of the estimator object. The pipeline can consist of several transformer objects and an estimator object at the end (see [documentation](https://scikit-learn.org/stable/developers/develop.html) for description of estimators and other interfaces). The resulting object accepts data in format suitable for the first transformer, passes it through all transformers and to the estimator at the end. 

So we can fit  the review data with one method call

In [None]:
rating_clf_pipe.fit(train_data.review, train_data.rating)

In [None]:
f1_score(test_data.rating,rating_clf_pipe.predict(test_data.review), average='macro')

Please note that this calls now take more time as they include costly tokenizing and word counting.  So our grid search will take much more time. The parameters to the grid search are passed as a  dictionary or a list of dictionaries. The keys in dictionaries denote the names of the parameters. They have  the name of the pipeline stage they refer to prepended to them with two underscores:

In [None]:
ratings_smaller_grid  = {'clf__alpha' : [1,5,20,30,40,50], 'vect__max_features':[5000, 10000, None]}

In [None]:
pipeline_search = GridSearchCV(rating_clf_pipe, ratings_smaller_grid, n_jobs=6, cv=5, verbose=10)

Please be warned that the command below takes around six minutes on 6 cores i9 laptop :( 

In [None]:
%%time
pipeline_search.fit(train_data.review, train_data.rating)
pd.DataFrame(pipeline_search.cv_results_)

We should of course also try to parametrize other parameters at the expense of more CPU time ... 

In [None]:
ratings_bigger_grid  = {'clf__alpha' : [25,30,35,40], 'vect__max_features':[5000, 10000, 15000, None], 'vect__max_df': [0.25, 0.5], 'vect__min_df':[1,10,25]}

In [None]:
pipeline_bigger_search = GridSearchCV(rating_clf_pipe, ratings_bigger_grid, n_jobs=6, cv=5, verbose=10)

Please be warned that if `RUN_GRID_SEARCH` is true, the command below takes over 30 minutes (!) on 6 cores i9 laptop :( I have included in the repository results for this run so you do not have to wait :)

In [None]:
%%time
from sklearn.base import clone

if RUN_GRID_SEARCH:
    pipeline_bigger_search.fit(train_data.review, train_data.rating)
    bigger_search_results = pd.DataFrame(pipeline_bigger_search.cv_results_)  
    best_estimator = pipeline_bigger_search.best_estimator_
else:
    bigger_search_results = pd.read_csv("grid_search.csv")
    params = eval(bigger_search_results.params.iloc[bigger_search_results.rank_test_score.argmax()])
    best_estimator = clone(rating_clf_pipe)
    best_estimator.set_params(**params)
    best_estimator.fit(train_data.review, train_data.rating)
    

predicted = best_estimator.predict(test_data.review)
bigger_search_results_sorted = bigger_search_results.sort_values(by='rank_test_score')

print("f1 score = {:.4f} accuracy = {:.4f}".format(f1_score(test_data.rating,predicted , average='macro'), accuracy_score(test_data.rating, predicted) ))
bigger_search_results_sorted.head(4)    

Please not that now we get much bigger optimal value for the parameter $\alpha$. 