The probability that the output is class $c$ given that the input is vector $\textbf{x}\in\mathbb{R}^d$ is $\mathbb{P}(c|\textbf{x})$. This expression, if calculated, can help determine the class of each data point by selecting the class with the highest probability of falling into:

$$c = {\underset{c\in\{1, \ldots, C\}}{\operatorname{arg\,max}}}\mathbb{P}(c|\textbf{x}) = {\underset{c\in\{1, \ldots, C\}}{\operatorname{arg\,max}}}\frac{\mathbb{P}(\textbf{x}|c)}{\mathbb{P}(\textbf{x})} = {\underset{c\in\{1, \ldots, C\}}{\operatorname{arg\,max}}}\mathbb{P}(\textbf{x}|c)\mathbb{P}(c)$$

The third "=" occurs because $\mathbb{P}(\textbf{x})$ does not depend on $c$. If the training set is large, it can be determined using the maximum likelihood estimator (MSE). Conversely, we can use maximum a posteriori (MAP).

The name Naive Bayes Classifier was born from the assumption of independence of the high-dimensional random variable $\textbf{x}$. Although this assumption is too strict and in reality it is difficult to find data that are completely independent of each other, this assumption sometimes brings unexpected results. Where $c$ is a known value:

$$\mathbb{P}(\textbf{x}|c) = \mathbb{P}(x_1, x_2, \ldots, x_d|c) = \prod_{i=1}^{d}\mathbb{P}(x_i|c)$$

Then we get:

$$c = {\underset{c\in\{1, \ldots, C\}}{\operatorname{arg\,max}}}\mathbb{P}(c)\prod_{i=1}^{d}\mathbb{P}(x_i|c)$$

When $d$ is large and the probabilities are small, the expression on the right-hand side of $c$ will be a very small number that is influential and produces errors. To handle this, $c$ is often rewritten in equivalent form by taking $\log$ both sides of the right-hand side. This does not affect the result because $\log$ is a uniform function on the set of positive numbers.

Commonly used distributions in NBC:

+ Gaussian naive Bayes: Consider $\{x_i\} \sim \mathcal{N}(\mu_{c_i}, \sigma_{c_i}^2).$ In which, the parameter set $\theta = \{\mu_{c_i}, \sigma_{c_i}^2\}$ is determined by MLE based on the points in the training set belonging to class $c$:

$$\mathbb{P}(x_i|c) = \mathbb{P}(x_i|\mu_{c_i}, \sigma_{c_i}^2) = \frac{1}{\sqrt{2\pi\sigma_{c_i}^2}}\exp{\left(-\frac{(x_i - \mu_{c_i})^2}{2\sigma_{c_i}^2}\right)}$$

+ Multinomial naive Bayes: This is the model mainly used in text classification. First of all, we will give the formula:

$$\lambda_{c_i} = \mathbb{P}(x_i|c) = \frac{N_{c_i}}{N_c}$$

In there:

\+ $N_{c_i}$ is the total number of times the word $i$ appears in documents of class $c$. Or we can say that it is the sum of all the $i$ features of the feature vectors corresponding to class $c$.

\+ $N_c$ is the total number of words appearing in class $c$ or the length of class $c$. It can be deduced that $N_c = \sum_{i=1}^{d}N_{c_i}$ and $\sum_{i=1}^d\lambda_{c_i} = 1$ where $d$ is the number of words in dictionary.

This calculation has a limitation: if there is a new word that has never appeared in class $c$, the expression $\lambda_{c_i}$ will be zero, leading to the expression $c$ being zero. To solve this problem, a *Laplace smoothing* technique is used:

$$\hat{\lambda_{c_i}} = \frac{N_{c_i} + \alpha}{N_c + d\alpha}$$

Where $\alpha$ is a positive number (usually equal to 1). The denominator is added to $d\alpha$ to ensure the total probability $\sum_{i=1}^d\hat{\lambda_{c_i}} = 1$. So that each class $c$ will be described by a set of positive numbers whose sum is 1: $\hat{\lambda_c} = \{\hat{\lambda_{c_1}}, \hat{\lambda_{c_2}}, \ldots, \hat{\lambda_{c_d}}\}$.

+ Bernoulli Naive Bayes: This model is applicable to data types where each element is a binary value of 0 or 1. For example, also with text types but instead of counting the total number of occurrences of a word in the text, We just need to care whether that word appears or not:

$$\mathbb{P}(x_i|c) = \mathbb{P}(i|c)x_i + [1 - \mathbb{P}(i|c)](1 - x_i)$$

With $\mathbb{P}(x_i|c)$ it can be understood as the probability that the word $i$ appears in the text of class $c$.

Based on the example in document $[1]$, *pages 130-131*, we will recalculate using the available `scikit-learn` library. Note that $\mathbb{P}(B|d5)$ and $\mathbb{P}(N|d5)$ are only proportional to $\mathbb{P}(B)\prod_{i=1}^d\mathbb{P}(x_i|B)$ and $\mathbb{P}(N)\prod_{i=1}^d\mathbb{P}(x_i|N)$ because we ignored $\mathbb{P} (\textbf{x})$ as mentioned before:

In [2]:
import numpy as np
import pandas as pd
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB

+ Multinomial naive Bayes:

In [3]:
# train data
d1 = [2, 1, 1, 0, 0, 0, 0, 0, 0]
d2 = [1, 1, 0, 1, 1, 0, 0, 0, 0]
d3 = [0, 1, 0, 0, 1, 1, 0, 0, 0]
d4 = [0, 1, 0, 0, 0, 0, 1, 1, 1]

train_data = np.array([d1, d2, d3, d4])
label = np.array(['B', 'B', 'B', 'N']) 

# test data
d5 = np.array([[2, 0, 0, 1, 0, 0, 0, 1, 0]])
d6 = np.array([[0, 1, 0, 0, 0, 0, 0, 1, 1]])

# call MultinomialNB
mnb = MultinomialNB()
# training 
mnb.fit(train_data, label)

# test
print('Predicting class of d5:', str(mnb.predict(d5)[0]))
print('Probability of d6 in each class:', mnb.predict_proba(d6)[0])

Predicting class of d5: B
Probability of d6 in each class: [0.29175335 0.70824665]


+ Bernoulli naive Bayes:

In [4]:
# train data
d1 = [1, 1, 1, 0, 0, 0, 0, 0, 0]
d2 = [1, 1, 0, 1, 1, 0, 0, 0, 0]
d3 = [0, 1, 0, 0, 1, 1, 0, 0, 0]
d4 = [0, 1, 0, 0, 0, 0, 1, 1, 1]

train_data = np.array([d1, d2, d3, d4])
label = np.array(['B', 'B', 'B', 'N'])

# test data
d5 = np.array([[1, 0, 0, 1, 0, 0, 0, 1, 0]])
d6 = np.array([[0, 1, 0, 0, 0, 0, 0, 1, 1]])

## call MultinomialNB
clf = BernoulliNB()
# training 
clf.fit(train_data, label)

# test
print('Predicting class of d5:', str(clf.predict(d5)[0]))
print('Probability of d6 in each class:', clf.predict_proba(d6)[0])

Predicting class of d5: B
Probability of d6 in each class: [0.16948581 0.83051419]


### **Naive Bayes Classifier to classify spam emails.**

In [16]:
import numpy as np
import pandas as pd
from scipy.sparse import coo_matrix # # check this: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.coo_matrix.html for more information about coo_matrix function 
from sklearn.naive_bayes import MultinomialNB, BernoulliNB
from sklearn.metrics import accuracy_score # for evaluating results

# data path and file name 
path = 'ex6DataPrepared/'
train_data_fn = 'train-features.txt'
test_data_fn = 'test-features.txt'
train_label_fn = 'train-labels.txt'
test_label_fn = 'test-labels.txt'

In [27]:
nwords = 2500 

def read_data(data_fn, label_fn):
    ## read data_fn
    with open(path + data_fn) as f:
        content = f.readlines() # read all lines in the file
    content = [x.strip() for x in content] # remove '\n' at the end of each line

    ## read label_fn
    with open(path + label_fn) as f:
        label = f.readlines()
    label = [int(x.strip()) for x in label]

    dat = np.zeros((len(content), 3), dtype=int)
    for i, line in enumerate(content): 
        a = line.split(' ')
        dat[i, :] = np.array([int(a[0]), int(a[1]), int(a[2])])
    
    # remember to -1 at coordinate since we're in Python
    data = coo_matrix((dat[:, 2], (dat[:, 0] - 1, dat[:, 1] - 1)), shape=(len(label), nwords))
    
    return (data, label)

In [38]:
(train_data, train_label) = read_data(train_data_fn, train_label_fn)
(test_data, test_label)  = read_data(test_data_fn, test_label_fn)

print(train_data, train_label, sep='\n')

clf = MultinomialNB()
clf.fit(train_data, train_label)

y_pred = clf.predict(test_data)
print('Training size = %d, accuracy = %.2f%%' % (train_data.shape[0], accuracy_score(test_label, y_pred) * 100))

  (0, 18)	2
  (0, 44)	1
  (0, 49)	1
  (0, 74)	1
  (0, 84)	1
  (0, 138)	1
  (0, 199)	1
  (0, 350)	1
  (0, 351)	1
  (0, 512)	1
  (0, 563)	1
  (0, 742)	1
  (0, 776)	1
  (0, 1130)	1
  (0, 1276)	1
  (0, 1638)	1
  (0, 1763)	1
  (0, 1815)	1
  (0, 1867)	1
  (1, 34)	5
  (1, 96)	1
  (1, 102)	2
  (1, 158)	1
  (1, 293)	2
  (1, 726)	1
  :	:
  (699, 2048)	4
  (699, 2054)	4
  (699, 2057)	2
  (699, 2071)	4
  (699, 2084)	2
  (699, 2108)	3
  (699, 2126)	1
  (699, 2172)	1
  (699, 2198)	3
  (699, 2226)	1
  (699, 2231)	1
  (699, 2236)	1
  (699, 2244)	1
  (699, 2325)	1
  (699, 2356)	3
  (699, 2377)	2
  (699, 2397)	4
  (699, 2401)	1
  (699, 2418)	1
  (699, 2432)	1
  (699, 2433)	1
  (699, 2471)	1
  (699, 2478)	2
  (699, 2480)	2
  (699, 2499)	3
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,

Continue trying with smaller training data sets:

In [39]:
train_data_fn = 'train-features-100.txt'
train_label_fn = 'train-labels-100.txt'
test_data_fn = 'test-features.txt'
test_label_fn = 'test-labels.txt'

(train_data, train_label)  = read_data(train_data_fn, train_label_fn)
(test_data, test_label)  = read_data(test_data_fn, test_label_fn)
clf = MultinomialNB()
clf.fit(train_data, train_label)
y_pred = clf.predict(test_data)
print('Training size = %d, accuracy = %.2f%%' % (train_data.shape[0],accuracy_score(test_label, y_pred) * 100))

Training size = 100, accuracy = 97.69%


In [40]:
train_data_fn = 'train-features-50.txt'
train_label_fn = 'train-labels-50.txt'
test_data_fn = 'test-features.txt'
test_label_fn = 'test-labels.txt'

(train_data, train_label)  = read_data(train_data_fn, train_label_fn)
(test_data, test_label)  = read_data(test_data_fn, test_label_fn)
clf = MultinomialNB()
clf.fit(train_data, train_label)
y_pred = clf.predict(test_data)
print('Training size = %d, accuracy = %.2f%%' % (train_data.shape[0],accuracy_score(test_label, y_pred) * 100))

Training size = 50, accuracy = 97.31%


In [43]:
# trying with `BernoulliNB` model
clf = BernoulliNB()
clf.fit(train_data, train_label)
y_pred = clf.predict(test_data)
print('Training size = %d, accuracy = %.2f%%' % (train_data.shape[0],accuracy_score(test_label, y_pred)*100))

Training size = 50, accuracy = 69.62%
