# Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes' theorem with the "naive" assumption of conditional independence between every pair of features given the value of the class variable. Bayes'theorem states the following relationship, given class variable $y$ and dependent feature vector $x_1$ through $x_n$,:

$$P(y \mid x_1, \dots, x_n) = \frac{P(y) P(x_1, \dots x_n \mid y)}
                                 {P(x_1, \dots, x_n)}$$

Using the naive conditional independence assumption, we have

$$\begin{align}\begin{aligned}P(y \mid x_1, \dots, x_n) \propto P(y) \prod_{i=1}^{n} P(x_i \mid y)\\\Downarrow\\\hat{y} = \arg\max_y P(y) \prod_{i=1}^{n} P(x_i \mid y),\end{aligned}\end{align}$$

Then, we can use Maximum A Posteriori (MAP) estimation to estimate $P(y)$ and $P(x_i \mid y)$; the former is then the relative frequency of class $y$ in the training set.

*References*:
H. Zhang (2004). The optimality of Naive Bayes. Proc. FLAIRS.

# 1 Gaussian Naive Bayes

GaussianNB implements the Gaussian Naive Bayes algorithm for classification.   
The likelihood of the features is assumed to be Gaussian:

$$P(x_i \mid y) = \frac{1}{\sqrt{2\pi\sigma^2_y}} \exp\left(-\frac{(x_i - \mu_y)^2}{2\sigma^2_y}\right)$$

The parameters $\sigma_y$ and $\mu_y$  are estimated using maximum likelihood.

**Example** - The training data is generated as follows:

In [1]:
import numpy as np
X = np.array([[-1, -1], [-2, -1], [-3, -2], [1, 1], [2, 1], [3, 2]])
Y = np.array([1, 1, 1, 2, 2, 2])

**Q1**: Training a GaussianNB model:

In [2]:
from sklearn.naive_bayes import GaussianNB
gnb = GaussianNB()
gnb.fit(X, Y)

GaussianNB(priors=None, var_smoothing=1e-09)

**Q2**: Predict the label of a data [-0.8,-1]:

In [3]:
X_test = np.array([[-0.8, -1], [-10, -10], [1, 1]])
y_pred = gnb.predict(X_test)
y_pred

array([1, 1, 2])

# 2 MultinomialNB

The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). 

*References*   
C.D. Manning, P. Raghavan and H. Schuetze (2008). Introduction to Information Retrieval. Cambridge University Press, pp. 234-265. http://nlp.stanford.edu/IR-book/html/htmledition/naive-bayes-text-classification-1.html

**Example** - The training data is generated as follows:

In [4]:
import numpy as np
X = np.random.randint(5, size=(6, 100))
y = np.array([1, 2, 3, 4, 5, 6])
X.shape

(6, 100)

**Q3**: Training a MultinomialNB model:

In [5]:
from sklearn.naive_bayes import MultinomialNB
mnb = MultinomialNB()
mnb.fit(X, y)

MultinomialNB(alpha=1.0, class_prior=None, fit_prior=True)

**Q4**: Predict the label of a data X[2:3]:

In [6]:
mnb.predict(X[1:4])

array([2, 3, 4])

In [7]:
X[1:4]

array([[3, 4, 1, 0, 4, 1, 4, 1, 3, 1, 2, 1, 2, 3, 4, 4, 1, 0, 2, 4, 3, 4,
        4, 2, 2, 1, 0, 2, 1, 0, 0, 1, 1, 4, 0, 2, 0, 0, 0, 3, 3, 1, 3, 0,
        0, 2, 2, 4, 2, 3, 4, 0, 0, 3, 1, 4, 1, 0, 2, 1, 4, 2, 3, 1, 1, 2,
        0, 1, 1, 4, 0, 4, 1, 0, 3, 2, 2, 2, 3, 2, 3, 0, 1, 3, 1, 2, 1, 2,
        1, 2, 1, 3, 2, 0, 0, 1, 1, 0, 3, 1],
       [1, 1, 0, 3, 4, 1, 2, 1, 1, 0, 1, 1, 4, 4, 3, 3, 0, 0, 4, 3, 3, 3,
        2, 2, 1, 1, 4, 3, 3, 1, 2, 4, 0, 1, 4, 0, 4, 2, 2, 0, 4, 2, 1, 1,
        3, 4, 3, 4, 3, 4, 4, 3, 3, 4, 1, 4, 1, 3, 1, 2, 4, 1, 4, 3, 1, 2,
        2, 4, 2, 4, 4, 0, 3, 2, 3, 4, 1, 0, 2, 3, 4, 2, 2, 4, 3, 4, 0, 0,
        2, 1, 2, 1, 4, 1, 0, 0, 2, 1, 0, 0],
       [1, 1, 0, 0, 2, 1, 0, 3, 1, 4, 4, 3, 0, 4, 2, 3, 2, 2, 1, 0, 0, 4,
        3, 0, 4, 1, 3, 2, 2, 4, 1, 2, 1, 3, 0, 4, 3, 4, 2, 4, 4, 0, 1, 1,
        3, 1, 0, 2, 4, 2, 3, 4, 0, 3, 1, 3, 4, 2, 3, 0, 0, 2, 2, 1, 3, 0,
        1, 3, 3, 1, 0, 3, 2, 2, 1, 3, 1, 4, 3, 0, 3, 2, 3, 4, 4, 0, 3, 2,
        1, 4, 1, 0, 4,

# 3 Process on 'Iris' Data

In Week 9, we have studied how to use KNN algorithm to do classification task on 'iris' data. Here,we are going to employ the GaussianNB to conduct the same task. 

In [8]:
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
iris_dataset = load_iris()
X_train, X_test, y_train, y_test = train_test_split(
iris_dataset['data'], iris_dataset['target'], random_state=0)

In [9]:
print('Dimensions of X_train: ', X_train.shape)
print('Dimensions of y_train: ', y_train.shape)
print('Dimensions of X_test: ', X_test.shape)
print('Dimensions of y_test: ', y_test.shape)

Dimensions of X_train:  (112, 4)
Dimensions of y_train:  (112,)
Dimensions of X_test:  (38, 4)
Dimensions of y_test:  (38,)


**Q5**：Report the acuracy result on test data:

In [10]:
clf1 = GaussianNB()
clf1.fit(X_train, y_train)
clf1_y_pred = clf1.predict(X_test)
print('Accuracy using Gaussian Naive Bayes: ', np.mean(clf1_y_pred == y_test))

Accuracy using Gaussian Naive Bayes:  1.0


In [11]:
clf2 = MultinomialNB()
clf2.fit(X_train, y_train)
clf2_y_pred = clf2.predict(X_test)
print('Accuracy using Multinomial Naive Bayes: ', np.mean(clf2_y_pred == y_test))

Accuracy using Multinomial Naive Bayes:  0.5789473684210527


# 4 Predict Human Activity Recognition (HAR)

The objective of this practice exercise is to predict current human activity based on phisiological activity measurements from 53 different features based in the [HAR dataset](http://groupware.les.inf.puc-rio.br/har#sbia_paper_section). The training (`har_train.csv`) and test (`har_validate.csv`) datasets are provided.

**Q6**: Build a Naive Bayes model, predict on the test dataset and compute the [confusion matrix](https://towardsdatascience.com/understanding-confusion-matrix-a9ad42dcfd62). Note: Please refer to the [`sklearn.metrics.confusion_matrix`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html)

In [12]:
import pandas as pd
har_train = pd.read_csv('data/har_train.csv')
har_validate = pd.read_csv('data/har_validate.csv')
print('Dimensions of har_train: ', har_train.shape)
print('Dimensions of har_validate: ', har_validate.shape)

Dimensions of har_train:  (13737, 53)
Dimensions of har_validate:  (5885, 53)


In [13]:
har_train

Unnamed: 0,classe,roll_belt,pitch_belt,yaw_belt,total_accel_belt,gyros_belt_x,gyros_belt_y,gyros_belt_z,accel_belt_x,accel_belt_y,...,total_accel_forearm,gyros_forearm_x,gyros_forearm_y,gyros_forearm_z,accel_forearm_x,accel_forearm_y,accel_forearm_z,magnet_forearm_x,magnet_forearm_y,magnet_forearm_z
0,A,1.41,8.07,-94.4,3,0.00,0.00,-0.02,-21,4,...,36,0.03,0.00,-0.02,192,203,-215,-17,654,476
1,A,1.41,8.07,-94.4,3,0.02,0.00,-0.02,-22,4,...,36,0.02,0.00,-0.02,192,203,-216,-18,661,473
2,A,1.42,8.07,-94.4,3,0.00,0.00,-0.02,-20,5,...,36,0.03,-0.02,0.00,196,204,-213,-18,658,469
3,A,1.48,8.05,-94.4,3,0.02,0.00,-0.03,-22,3,...,36,0.02,-0.02,0.00,189,206,-214,-16,658,469
4,A,1.45,8.06,-94.4,3,0.02,0.00,-0.02,-21,4,...,36,0.02,-0.02,-0.03,193,203,-215,-9,660,478
5,A,1.42,8.09,-94.4,3,0.02,0.00,-0.02,-22,3,...,36,0.02,0.00,-0.02,195,205,-215,-18,659,470
6,A,1.42,8.13,-94.4,3,0.02,0.00,-0.02,-22,4,...,36,0.02,-0.02,0.00,193,205,-213,-9,660,474
7,A,1.43,8.16,-94.4,3,0.02,0.00,-0.02,-20,2,...,36,0.03,0.00,-0.02,193,204,-214,-16,653,476
8,A,1.45,8.18,-94.4,3,0.03,0.00,-0.02,-21,2,...,36,0.02,-0.02,-0.02,193,205,-214,-17,657,465
9,A,1.43,8.18,-94.4,3,0.02,0.00,-0.02,-22,2,...,36,0.02,0.02,-0.03,191,203,-215,-11,657,478


In [14]:
X_train = har_train.drop('classe', axis = 1)
y_train = har_train['classe']
X_test = har_validate.drop('classe', axis = 1)
y_test = har_validate['classe']
print('Dimensions of X_train: ', X_train.shape)
print('Dimensions of y_train: ', y_train.shape)
print('Dimensions of X_test: ', X_test.shape)
print('Dimensions of y_test: ', y_test.shape)

Dimensions of X_train:  (13737, 52)
Dimensions of y_train:  (13737,)
Dimensions of X_test:  (5885, 52)
Dimensions of y_test:  (5885,)


In [15]:
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
print('Accuracy of Gaussian Naive Bayes: ', np.mean(y_test == y_pred))

Accuracy of Gaussian Naive Bayes:  0.5542905692438402


In [16]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, y_pred)

array([[1070,   95,  262,  212,   35],
       [ 127,  685,  145,   76,  106],
       [ 223,  106,  512,  136,   49],
       [ 102,   35,  271,  441,  115],
       [  51,  239,   95,  143,  554]])

In [17]:
import numpy as np
from sklearn.neural_network import MLPClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.gaussian_process import GaussianProcessClassifier
from sklearn.gaussian_process.kernels import RBF
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis

In [18]:
names = ["Nearest Neighbors", "Linear SVM", "RBF SVM", "Gaussian Process",
         "Decision Tree", "Random Forest", "Neural Net", "AdaBoost",
         "Naive Bayes", "QDA"]

classifiers = [KNeighborsClassifier(3),
               SVC(kernel="linear", C=0.025),
               SVC(gamma=2, C=1),
               GaussianProcessClassifier(1.0 * RBF(1.0)),
               DecisionTreeClassifier(max_depth=5),
               RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1),
               MLPClassifier(alpha=1, max_iter=1000),
               AdaBoostClassifier(),
               GaussianNB(),
               QuadraticDiscriminantAnalysis()]

In [19]:
#for name, clf in zip(names, classifiers):
#    print('classifier: ', name)
#    clf.fit(X_train, y_train)
#    score = clf.score(X_test, y_test)
#    print('Accuracy: ', score)

In [20]:
knn = KNeighborsClassifier(3)
svc_l = SVC(kernel="linear", C=0.025)
svc_g = SVC(gamma=2, C=1)
gpc = GaussianProcessClassifier(1.0 * RBF(1.0))
decision_tree = DecisionTreeClassifier(max_depth=5)
random_forest = RandomForestClassifier(max_depth=5, n_estimators=10, max_features=1)
mlp = MLPClassifier(alpha=1, max_iter=1000)
ada = AdaBoostClassifier()
qda = QuadraticDiscriminantAnalysis()

In [21]:
#KNeighborsClassifier
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
score = knn.score(X_test, y_test)
print('Accuracy of Nearest Neighbors: ', score)

Accuracy of Nearest Neighbors:  0.9378079864061173


In [22]:
#DecisionTreeClassifier
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)
score = decision_tree.score(X_test, y_test)
print('Accuracy of Decision Tree: ', score)

Accuracy of Decision Tree:  0.5587085811384876


In [23]:
#RandomForestClassifier
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)
score = random_forest.score(X_test, y_test)
print('Accuracy of Random Forest: ', score)

Accuracy of Random Forest:  0.5971112999150382


In [24]:
#AdaBoostClassifier
ada.fit(X_train, y_train)
y_pred = ada.predict(X_test)
score = ada.score(X_test, y_test)
print('Accuracy of AdaBoost: ', score)

Accuracy of AdaBoost:  0.7063721325403568


In [25]:
#QuadraticDiscriminantAnalysis
qda.fit(X_train, y_train)
y_pred = qda.predict(X_test)
score = qda.score(X_test, y_test)
print('Accuracy of QDA: ', score)

Accuracy of QDA:  0.897536108751062
