# # Classifiers comparison on texts with naive Bayes assumption

In this session of laboratory we compare two models for categorical data probabilistic modeling: 
1. multivariate Bernoulli 
2. multinomial on a dataset 

We adopt a dataset on Twitter messages labelled with emotions (Joy vs Sadness).

The following cell shows the loading of the data from a fileprogram training a decision tree and its results in preciction 

The following cell shows the loading of the data from a file into a matrix X in a sparse representation.
The dataset is in csv format.
Any Twitter message has been preprocessed by a Natural Language pipeline which eliminated stop words and substituted the interesting document elements with an integer identifier.  
The interesting document elements might be words, emoji or emoticons. The elements could be repeated in the same document and are uniquely identified in the documents by the same integer number.

Each row of the dataset is a list of integer number pairs, followed by a string which is the label of the document (Joy or sadness).
The first number of the pair is an identifier of an element (word, emoji or emoticon) and the second number of the pair is the count (frequency) of that element in that document.

The dataset has:
tot_n_docs=n_rows=11981
n_features (document elements)=11288


The following program reads the data file and loads in a sparse way the matrix using the scipy.sparse library

In [3]:
from numpy import ndarray, zeros
import numpy as np
import scipy
from scipy.sparse import csr_matrix

class_labels = ["Joy","Sadness"]
n_features=11288 # number of columns in the matrix = number of features (distinct elements in the documents)
n_rows=11981 # number rows of the matrix
n_elements=71474 # number of the existing values in the matrix (not empty, to be loaded in the matrix in a sparse way)

path_training="./"
file_name="joy_sadness6000.txt"

# declare the row and col arrays with the indexes of the matrix cells (non empty) to be loaded from file
# they are needed because the matrix is sparse and we load in the matrix only the elements which are present
row=np.empty(n_elements, dtype=int)
col=np.empty(n_elements, dtype=int)
data=np.empty(n_elements, dtype=int)

row_n=0 # number of current row to be read and managed
cur_el=0 # position in the arrays row, col and data
twitter_labels=[] # list of class labels (target array) of the documents (twitter) that will be read from the input file
twitter_target=[] # list of 0/1 for class labels
with open(path_training + file_name, "r") as fi:
    for line in fi:
        el_list=line.split(',')
        l=len(el_list)
        last_el=el_list[l-1] # I grab the last element in the list which is the class label
        class_name=last_el.strip() # eliminate the '\n'
        twitter_labels.append(class_name)
        # twitter_labels contains the labels (Joy/Sadness); twitter_target contains 0/1 for the respective labels
        if (class_name==class_labels[0]):
           twitter_target.append(0)
        else:
           twitter_target.append(1)
        i=0 # I start reading all the doc elements from the beginning of the list
        while(i<(l-1)):
            element_id=float(el_list[i]) # identifier of the element in the document
            element_id=element_id-1 # the index starts from 0 (the read id starts from 1)
            i=i+1
            value_cell=int(el_list[i]) # make access to the following value in the file which is the count of the element in the documento 
            i=i+1
            row[cur_el]=row_n # load the data in the three arrays: the first two are the row and col indexes; the last one is the matrix cell value
            col[cur_el]=element_id
            data[cur_el]=value_cell
            cur_el=cur_el+1
        row_n=row_n+1
fi.close
print("final n_row="+str(row))
# loads the matrix by means of the indexes and the values in the three arrays just filled
twitter_data=csr_matrix((data, (row, col)), shape=(n_rows, n_features)).toarray()
print("resulting matrix:")
print(twitter_data)
print(twitter_labels)
print(twitter_target)


final n_row=[0 0 0 ... 0 0 0]
resulting matrix:
[[1 1 1 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 [0 0 0 ... 0 0 0]
 ...
 [0 0 0 ... 0 0 1]
 [0 0 0 ... 0 0 0]
 [0 1 0 ... 0 0 0]]
['Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy', 'Joy'

# Solution

In [4]:
from sklearn.model_selection import train_test_split

X = twitter_data
y = twitter_target

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)

In [5]:
def compute_number_error(true, predicted):
    error = 0
    for i in range(len(y_test)):
        if true[i] != predicted[i]:
            error = error + 1
    return error

# Multivariate Bernoulli classifier

In [6]:
# http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html#sklearn.naive_bayes.BernoulliNB

from sklearn.naive_bayes import BernoulliNB
   
clf_bernoulliNB = BernoulliNB(alpha=1.0)
clf_bernoulliNB.fit(X_train, y_train)
proba_clf_bernoulliNB = clf_bernoulliNB.predict_proba(X_test)
predicted_clf_bernoulliNB = clf_bernoulliNB.predict(X_test)

In [7]:
error_bernoulliNB = compute_number_error(y_test, predicted_clf_bernoulliNB)
print(error_bernoulliNB)

202


In [8]:
from sklearn.metrics import accuracy_score

accuracy_score_bernoulliNB = accuracy_score(y_test, predicted_clf_bernoulliNB)
print(accuracy_score_bernoulliNB)

0.9489124936772888


In [9]:
from sklearn.metrics import confusion_matrix

tn_bernoulliNB, fp_bernoulliNB, fn_bernoulliNB, tp_bernoulliNB = confusion_matrix(y_test, predicted_clf_bernoulliNB).ravel()
print(tn_bernoulliNB, fp_bernoulliNB, fn_bernoulliNB, tp_bernoulliNB)

(1862, 160, 42, 1890)


# Multinomial classifier

In [10]:
# http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html#sklearn.naive_bayes.MultinomialNB

from sklearn.naive_bayes import MultinomialNB

clf_multinomialNB = MultinomialNB(alpha=1.0)
clf_multinomialNB.fit(X_train, y_train)
proba_clf_multinomialNB = clf_multinomialNB.predict_proba(X_test)
predicted_clf_multinomialNB = clf_multinomialNB.predict(X_test)

In [11]:
error_multinomialNB = compute_number_error(y_test, predicted_clf_multinomialNB)
print(error_multinomialNB)

209


In [12]:
accuracy_score_multinomialNB = accuracy_score(y_test, predicted_clf_multinomialNB)
print(accuracy_score_multinomialNB)

0.9471421345472939


In [13]:
tn_multinomialNB, fp_multinomialNB, fn_multinomialNB, tp_multinomialNB = confusion_matrix(y_test, predicted_clf_multinomialNB).ravel()
print(tn_multinomialNB, fp_multinomialNB, fn_multinomialNB, tp_multinomialNB)

(1851, 171, 38, 1894)


# Gaussian Naive Bayes classifier

In [16]:
# http://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html

from sklearn.naive_bayes import GaussianNB

clf_gaussianNB = GaussianNB()
clf_gaussianNB.fit(X_train, y_train)
proba_clf_gaussianNB = clf_gaussianNB.predict_proba(X_test)
predicted_clf_gaussianNB = clf_gaussianNB.predict(X_test)

In [17]:
error_gaussianNB = compute_number_error(y_test, predicted_clf_gaussianNB)
print(error_gaussianNB)

556


In [18]:
accuracy_score_gaussianNB = accuracy_score(y_test, predicted_clf_gaussianNB)
print(accuracy_score_gaussianNB)

0.8593829033889732


In [19]:
tn_gaussianNB, fp_gaussianNB, fn_gaussianNB, tp_gaussianNB = confusion_matrix(y_test, predicted_clf_gaussianNB).ravel()
print(tn_gaussianNB, fp_gaussianNB, fn_gaussianNB, tp_gaussianNB)

(1523, 499, 57, 1875)


In [20]:
mean_feature_each_class = clf_gaussianNB.theta_ # mean of each feature per class
variance_feature_each_class = clf_gaussianNB.sigma_ # variance of each feature per class

# Misure

In [21]:
alg = ["Bernoulli", "Multinomial", "Gaussian"]
error_matrix = [error_bernoulliNB, error_multinomialNB, error_gaussianNB]
accuracy_matrix = [accuracy_score_bernoulliNB, accuracy_score_multinomialNB, accuracy_score_gaussianNB]
print("Il numero minore di errori complessivi è ottenuto da " + str(alg[np.argmin(error_matrix)]) + " con " + str(error_matrix[np.argmin(error_matrix)]) + " errori.")
print("La migliore accuracy è ottenuta da " + str(alg[np.argmax(accuracy_matrix)]) + " con " + str(accuracy_matrix[np.argmax(accuracy_matrix)]) + " errori.")

Il numero minore di errori complessivi è ottenuto da Bernoulli con 202 errori.
La migliore accuracy è ottenuta da Bernoulli con 0.9489124936772888 errori.
