# Classifiers comparison on texts with naive Bayes assumption

In this session of laboratory we compare two probabilistic models for categorical data: 
1. multivariate Bernoulli 
2. multinomial on a dataset 

We adopt a dataset on Twitter messages labelled with emotions (Joy vs Sadness).

The following program shows the loading of the data from a file.

Data are loaded into a matrix X adopting a sparse matrix representation, in order to save space and time.
Sparse matrix representation (in the csr format) represents in three "parallel" arrays the value of the matrix cells that are different from zero and the indices of those matrix cells.
The arrays are called: 
- data
- row
- col

- data[i] stores the value of the matrix cell #i whose indexes are contained in row[i] and col[i] 
- row[i] stores the index of the row in the matrix of the cell #i, 
- col[i] stores the index of the column of the cell #i.

Basically, it's equivalent to an associative array like this [(i, j) ~> data]. Sparse matrices are matrices whose at least 50% of their elements are empty.

The data file is in csv format.
Any Twitter message has been preprocessed by a Natural Language pipeline which eliminated stop words and substituted the interesting document elements with an integer identifier.  
The interesting document elements might be words, emoji or emoticons. The elements could be repeated in the same document and are uniquely identified in the documents by the same integer number (named "element_id" in the program). This "element_id" number will be used as the index of the column of the data matrix, for the purposes of storage of data.

Each row of the CSV file reports the content of a document (a Twitter message).It is formed as a list of integer number pairs, followed by a string which is the label of the document ("Joy" or "Sadness").
The first number of the pair is the identifier of a document element (the "element_id"); 
the second number of the pair is the count (frequency) of that element in that document.

An example of a line is:
38,3,264,1,635,1,2780,1,Joy

where 38 is the identifier of the first word occurring in that message, and 3 is the number of times (frequency count) in
which that word is present in that message.
38, 264, 635, 2780 are the identifiers of the words and 3, 1, 1, 1 are the respective frequencies in that message.

The dataset has:
tot_n_docs (or rows in the file) = n_rows = 11981
n_features (total number of distinct words in the corpus) = 11288 (or vocabulary size)

The following program reads the data file and loads in a sparse way the matrix using the scipy.sparse library

In [16]:

from numpy import ndarray, zeros
import numpy as np
import scipy
from scipy.sparse import csr_matrix

class_labels = ["Joy","Sadness"]
n_features = 11288                                    # number of columns in the matrix = number of features (distinct elements in the documents)
n_rows = 11981                                        # number rows of the matrix
n_elements = 71474                                    # number of the existing values in the matrix (not empty, to be loaded in the matrix in a sparse way)

path_training = "../datasets/"
file_name = "joy_sadness6000.txt"

# declare the row and col arrays with the indexes of the matrix cells (non empty) to be loaded from file
# they are needed because the matrix is sparse and we load in the matrix only the elements which are present
row = np.empty(n_elements, dtype=int)
col = np.empty(n_elements, dtype=int)
data = np.empty(n_elements, dtype=int)

row_n = 0                                             # Number of current row to be read and managed
cur_el = 0                                            # Position in the three arrays: row, col and data
twitter_labels = []                                   # List of class labels (target array) of the documents (twitter) that will be read from the input file
twitter_target = []                                   # List of 0/1 for class labels
with open(path_training + file_name, "r") as fi:
    for line in fi:
        el_list = line.split(',')                     # List of integers read from a row of the file
        l = len(el_list)
        last_el = el_list[-1]                         # Grab the last element in the list which is the class label
        class_name = last_el.strip()                  # Eliminate the '\n'
        twitter_labels.append(class_name)

        # Twitter_labels contains the labels (Joy/Sadness);
        # twitter_target contains 0/1 for the respective labels
        if (class_name == class_labels[0]):
           twitter_target.append(0)
        else:
           twitter_target.append(1)

        # Start by reading all the doc elements from the beginning of the list
        i = 0
        while (i<(l-1)):
            element_id = int(el_list[i])             # Identifier of the element in the document equivalent to the column index
            element_id = element_id-1                # The index starts from 0 (the read id starts from 1)
            i = i+1 
            value_cell = int(el_list[i])             # Make access to the following value in the file which is the count of the element in the document 
            i = i+1
            row[cur_el] = row_n                      # Load the data in the three arrays: the first two are the row and col indexes; the last one is the matrix cell value
            col[cur_el] = element_id
            data[cur_el] = value_cell
            cur_el = cur_el+1
            # i-th row represent the tweet, j-th col represent the word. 
        row_n = row_n+1
fi.close

# Convert the target into a numpy array for later convenient use
twitter_target = np.array(twitter_target)

# loads the matrix by means of the indexes and the values in the three arrays just filled
twitter_data = csr_matrix((data, (row, col)), shape = (n_rows, n_features)).toarray()

Write a program in the following cell that splits the data matrix in training and test set (by random selection) and predicts the class (Joy/Sadness) of the messages on the basis of the words. 
Consider the two possible models:
multivariate Bernoulli and multinomial Bernoulli.
Find the accuracy of the models and test if the observed differences are significant.

In [17]:
from sklearn.model_selection import train_test_split

# Split the dataset between training/test set
X_train, X_test, y_train, y_test = train_test_split(twitter_data, twitter_target, test_size=0.3, random_state=42)

We now focus on building the Multivariate model, which is called `BernoulliNB` inside the ScikitLearn library.

In [18]:
from sklearn.naive_bayes import BernoulliNB

multivariate = BernoulliNB()  
multivariate.fit(X_train, y_train)

# Compute the accuracy of the classifier
multivariate.score(X_test, y_test)

0.9504867872044507

Now we do the same thing, but with a Multinomial Bernoulli model

In [19]:
from sklearn.naive_bayes import MultinomialNB

multinomial = MultinomialNB()  
multinomial.fit(X_train, y_train)

# Compute the accuracy of the classifier
multinomial.score(X_test, y_test)

0.9471488178025035

As we can see, the multivariate bernoulli model seems to perform better on on the test set.

## Evaluating differences using *paired t-test*
From the observed difference we can say that the multivariate model is performing better than the multinomial. Now we test those assumptions using a statistical test called *paired t-test*.
The main idea is to perform a KFold and register the differences on performance between each fold. Then we assume (*null hypothesis*) that the distribution has mean $\mu = 0$ with unknown variance. To account for this additional uncertainty, we model the distribution of the performance differences as a t-student distribution with *degrees of freedom* equals to the number of folds minus one. ($k-1$) 

In [38]:
from sklearn.model_selection import KFold, cross_val_score
from scipy.stats import ttest_rel
import time

k = 10

# Declare the folds
kfold = KFold(n_splits=k)
print('Running cross validation on the Multivariate model')
scores_mv = cross_val_score(BernoulliNB(), twitter_data, twitter_target, cv = kfold)
print('Running cross validation on the Multinomial model')
scores_mn = cross_val_score(MultinomialNB(), twitter_data, twitter_target, cv = kfold)
    
# Now we can perform the t-test using the SciPy's library
_, p_value = ttest_rel(scores_mv, scores_mn)
# Print the p_value
print(f'The p_value is: {p_value}')

Running cross validation on the Multivariate model
Running cross validation on the Multinomial model
The p_value is: 0.0050500833362442845


Since $p < 0.05$ we reject the null hypothesis, therefore concluding that there are no differences between the Multivariate and the Multinomial models with a significance level of $\alpha = 0.05$ 