# # Classifiers comparison on texts with naive Bayes assumption

In this session of laboratory we compare two models for categorical data probabilistic modeling: 
1. multivariate Bernoulli 
2. multinomial on a dataset 

We adopt a dataset on Twitter messages labelled with emotions (Joy vs Sadness).

The following program shows the loading of the data from a file.

***Il dataset è già stato privato delle stop_words (fatto dal docente) e per risparmiare spazio ogni lemma è stato identificato con un identificatore numerico progressivo (identifica la feature nella matrice dei dati, dove ogni riga identifica un documento)***

***Siccome il numero di parole presenti nei documenti è minore rispetto al numero di parole totale nel vocabolario, utilizziamo una matrice sparsa***

Data are loaded into a matrix X adopting a sparse matrix representation, in order to save space and time.
Sparse matrix representation (in the csr format) represents in three "parallel" arrays the value of the matrix cells that are different from zero and the indices of those matrix cells.
The arrays are called: 
- data *contiene valore delle celle*
- row *contiene identificatore della riga delle celle*
- col *contiene identificatore della colonna delle celle*

***Il primo elemeto dell'array data contiene il valore della prima cella, il primo elemento dell'array row contiene l'identificatore di riga della prima cella, e il primo elemento di col contiene l'identificatore della colonna della prima cella.***

***Nel file ogni riga è un documento, gli elementi sono separati da virgole,
i numeri rappresentano informazioni a coppie, dove il primo elemento della coppia è l'identificatore della feature (ossia della parola) mentre il secondo è la frequenza con cui la parola compare nel documento, l'ultimo elemento è l'etichetta di classe associata a quel documento.*** 

- data[i] stores the value of the matrix cell #i whose indexes are contained in row[i] and col[i] 
- row[i] stores the index of the row in the matrix of the cell #i, 
- col[i] stores the index of the column of the cell #i.


The data file is in csv format.
Any Twitter message has been preprocessed by a Natural Language pipeline which eliminated stop words and substituted the interesting document elements with an integer identifier.  
The interesting document elements might be words, emoji or emoticons. The elements could be repeated in the same document and are uniquely identified in the documents by the same integer number (named "element_id" in the program). This "element_id" number will be used as the index of the column of the data matrix, for the purposes of storage of data.

Each row of the CSV file reports the content of a document (a Twitter message).It is formed as a list of integer number pairs, followed by a string which is the label of the document ("Joy" or "Sadness").
The first number of the pair is the identifier of a document element (the "element_id"); 
the second number of the pair is the count (frequency) of that element in that document.

The dataset has:

tot_n_docs (or rows in the file) =n_rows=11981 
*Il numero dei documenti non è pari perchè alcuni messaggi contenevano solo stop words*
n_features (total number of distinct words in the corpus)=11288



The following program reads the data file and loads in a sparse way the matrix using the scipy.sparse library

In [None]:

from numpy import ndarray, zeros
import numpy as np
import scipy
from scipy.sparse import csr_matrix

class_labels = ["Joy","Sadness"]
n_features=11288 # number of columns in the matrix = number of features (distinct elements in the documents, ossia parole)
n_rows=11981 # number rows of the matrix, ossia n° di documenti
n_elements=71474 # number of the existing values in the matrix (not empty, to be loaded in the matrix in a sparse way)

#path_training="/Users/meo/Documents/Didattica/Laboratorio-15-16-Jupyter/"
path_training="/"
file_name="/content/joy_sadness6000.txt"

# declare the row and col arrays with the indexes of the matrix cells (non empty) to be loaded from file
# they are needed because the matrix is sparse and we load in the matrix only the elements which are present
row=np.empty(n_elements, dtype=int)
col=np.empty(n_elements, dtype=int)
data=np.empty(n_elements, dtype=int)

row_n=0 # number of current row to be read and managed
cur_el=0 # position in the three arrays: row, col and data
twitter_labels=[] # list of class labels (target array) of the documents (twitter) that will be read from the input file
twitter_target=[] # list of 0/1 for class labels
with open(path_training + file_name, "r") as fi:
    for line in fi: # per ogni riga nel file di input
        el_list=line.split(',')  # list of integers read from a row of the file
        l=len(el_list)
        last_el=el_list[l-1] # I grab the last element in the list which is the class label es Sadness or Joy
        class_name=last_el.strip() # eliminate the '\n'
        twitter_labels.append(class_name) # aggiungo la nuova etichetta all'array delle etuchette
        # twitter_labels contains the labels (Joy/Sadness); twitter_target contains 0/1 for the respective labels
        if (class_name==class_labels[0]): # se class_name è Joy
           twitter_target.append(0)
        else:
           twitter_target.append(1) # se class_name è Sadness
        i=0 # I start reading all the doc elements from the beginning of the list

        #In questo while scansiono un singolo documento di lunghezza l e carico i valori nei 3 array data, row e col
        while(i<(l-1)): #finchè non arrivo alla fine del documento
            element_id=int(el_list[i]) # identifier of the element in the document equivalent to the column index
            element_id=element_id-1 # the index starts from 0 (the read id starts from 1) sottrae 1 per far partire da zero gli indici
            i=i+1 # mi sposto di una posizione avanti nel documento per leggere il n° di occorrenze della parola element_id
            value_cell=int(el_list[i]) # make access to the following value in the file which is the count of the element in the documento 
            i=i+1
            row[cur_el]=row_n # load the data in the three arrays: the first two are the row and col indexes; the last one is the matrix cell value
            col[cur_el]=element_id
            data[cur_el]=value_cell
            cur_el=cur_el+1
        row_n=row_n+1
fi.close #chiudo il file
print("final n_row="+str(row))
# loads the matrix by means of the indexes and the values in the three arrays just filled
twitter_data=csr_matrix((data, (row, col)), shape=(n_rows, n_features)).toarray()
print("resulting matrix:")
print(twitter_data)
print(twitter_labels)
print(twitter_target)


final n_row=[0 0 0 ... 0 0 0]
resulting matrix:
[[1 1 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]]
['Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy'

#Esercizio 1 Write a program in the following cell that splits the data matrix in training and test set (by random selection) and predicts the class (Joy/Sadness) of the messages on the basis of the words. 
Consider the two possible models:
multivariate Bernoulli and multinomial Bernoulli.
Find the accuracy of the models and test is the observed differences are significant.

In [None]:
print((twitter_target)) # 0 è Joy 1 è Sadness
#np.set_printoptions(threshold=np.inf)
np.set_printoptions(threshold=5)
print(twitter_data[10])
#print(twitter_labels)

[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 

In [None]:
#Modo alternativo e molto migliore per fare splitting del training e test set
from sklearn.model_selection import train_test_split
#train_test_split Split arrays or matrices into random train and test subsets
X_train, X_test = train_test_split(twitter_data, test_size=0.2)
y_train, y_test= train_test_split(twitter_target, test_size=0.2)



In [None]:
#Caso MULTIVARIATO
from sklearn.naive_bayes import BernoulliNB
from sklearn.metrics import accuracy_score

bernulli_clf = BernoulliNB().fit(X_train, y_train) #fa fitting dei dati nel classificatore

y_predicted = bernulli_clf.predict(X_test)

acc_score = accuracy_score(y_test, y_predicted)
print("Accuracy score: "+ str(acc_score))



Accuracy score: 0.4918648310387985


In [None]:
#Caso MULTINOMIALE
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

multinomial_clf = MultinomialNB().fit(X_train, y_train)

y_predicted = multinomial_clf.predict(X_test)

acc_score = accuracy_score(y_test, y_predicted)
print("Accuracy score: "+ str(acc_score))

Accuracy score: 0.49144764288694204


In [None]:
from sklearn.model_selection import cross_val_score
#ATTENZIONE: impiega circa 2 minuti fare entrambi, ma funziona!
bernulli_score = cross_val_score(bernulli_clf,twitter_data,twitter_target,cv=10) #cv=10 significa che spezzo i dati in 10 fold
print("cross_validation on bernulli model: ")
print(bernulli_score)
print(np.mean(bernulli_score))
multinomial_score = cross_val_score(multinomial_clf,twitter_data,twitter_target,cv=10)
print("cross_validation on multinomial model: ")
print(multinomial_score)

cross_validation on bernulli model: 
[0.95663053 0.96577629 0.94991653 ... 0.95492487 0.9624374  0.93656093]
0.9527582111414491
cross_validation on multinomial model: 
[0.95412844 0.96494157 0.94991653 ... 0.94991653 0.96160267 0.93739566]


In [None]:
#Traiamo le conclusioni in base alle nozioni sui Machine Learning Experiments se rifiuatre o no l'ipotesi nulla (ossia se gli algoritmi sono uguali)

from scipy import stats
#Esegue il T-test per confrontare le performance due algorimti sullo stesso dataset
t2, p2 = stats.ttest_ind(bernulli_score,multinomial_score)
#False -> accetto ipotesi nulla, algoritmi con performance uguali
#True -> rifiuto ipotesi nulla, algoritmi con performance diverse
print("Rifiuto l'ipotesi nulla? ",p2<0.05)


Rifiuto l'ipotesi nulla?  False
