# Naive Bayes

## Bayes' Theorem

Suppose that we wish to classify an observation into one of $K$ classes.

Let $\pi_k$ represent the overall *prior* probability that a randomly chosen observation comes from the $k$th class.

Let $f_k(X=x) = \Pr(X=x|Y=k)$ denote the *likelihood* that an observation $X$ comes from the $k$th class.

Then **Bayes' theorem** states that the *posterior* probability is given by,

$$\Pr(Y=k|X=x) = \frac{\pi_k \cdot f_k(x)}{\sum_{i=1}^{K}\pi_i \cdot f_i(x)}$$

In general, this model is called a **Bayes classifier**. Thus, for us to use this model we need to select a form for the likelihood function $f_k(X)$.

Note that $X$ represents our predictor variables, meaning that $X$ can actually be a vector of features $X = [X_1, X_2, \ldots, X_p]$. 

To "select a form for our likelihood function" means that we need to define the joint distribution over all our predictor random variables $[X_1, X_2, \ldots, X_p]$.

## Naive Bayes Classifier

Suppose we assume that *within the $k$th class, the $p$ predictors are independent*, **(because of this assumption our model is super super fast to train)**

$$f_k(x) = f_{k1}(x_1) \cdot f_{k2}(x_2) \cdots f_{kp}(x_p)$$

where $f_{kj}$ is the density function of the $j$th predictor among observations in the $k$th class.

Under this assumption, our classification model is called a **Naive Bayes classifier**. Thus our posterior probability is given by,

$$\Pr(Y=k|X=x) = \frac{\pi_k \cdot f_{k1}(x_1) \cdot f_{k2}(x_2) \cdots f_{kp}(x_p)}{\sum_{i=1}^{K}\pi_i \cdot f_{i1}(x_1) \cdot f_{i2}(x_2) \cdots f_{ip}(x_p)}$$

### Choosing a Model for $f_{kj}$

If $X_j$ is numerical, we can assume that

$$X_j|Y=k \sim N(\mu_{jk}, \sigma_{jk}^2)$$

In other words, we say that the $j$th predictor is drawn from a univariate normal distribution within each class $k$.

If $X_j$ is categorical, then we can simply count the proportion of training observations for the $j$th predictor that corresponds to each class.

### Toy Example - Naive Bayes with Mixed Data Types

To understand what's happening, let's work through a toy example.

In [36]:
import numpy as np
import pandas as pd

df = pd.read_csv('nb_toy.csv')
df

Unnamed: 0,X1,X2,X3,y
0,26.24,-5.79,2,1
1,3.88,0.90,0,0
2,4.72,-2.39,1,1
3,-0.73,-1.63,0,0
4,18.65,-8.38,0,0
...,...,...,...,...
95,10.77,-10.51,2,0
96,6.56,-1.59,0,0
97,10.44,-2.32,0,1
98,3.80,-0.76,2,0


In this dataset, we have 3 predictors (2 numerical, 1 categorical w/ 3 classes) and 2 target classes. 

>(In this case we dont OHE the categorical because it may not have significant impact on the performance of the model) -basti

First, let's set up our Naive Bayes classifier by defining the priors.

In [37]:
# Number of classes
k = 2

# Prior probabilities (uniform)
pi = np.ones(2)/2
pi

array([0.5, 0.5])

Next, we need to estimate each of the $f_{kj}$. (j runs from 1 to 3 since we have 3 predictors)

That's $f_{11}, f_{12}, f_{13}$  for $k=1$, and $f_{21}, f_{22}, f_{23}$ for $k=2$.

So let's split the dataset into their respective classes.

In [38]:
df_1 = df[df['y'] == 0] # get data with label 0
df_2 = df[df['y'] == 1] # get data with label 1

Let's estimate $f_{11}, f_{12}, f_{13}$. 

We should end up with 2 normal distributions and 1 categorical distribution.

In [39]:
# f_11 (this is not the actual function, we plug the mean and stdev in the gaussian formula)
mean_11 = np.mean(df_1['X1']) # this is ur MLE estimate 
std_11 = np.std(df_1['X1'])

print('f_11 mean =', round(mean_11,2))
print('f_11 std =', round(std_11,2))

# f_12
mean_12 = np.mean(df_1['X2'])
std_12 = np.std(df_1['X2'])

print('f_12 mean =', round(mean_12,2))
print('f_12 std =', round(std_12,2))

# f_13 (probability vector)
f_13 = (df_1['X3'].value_counts()/df_1['X3'].shape[0]).to_numpy()
f_13

f_11 mean = 11.12
f_11 std = 9.6
f_12 mean = -3.72
f_12 std = 3.92


array([0.47058824, 0.29411765, 0.23529412])

In [40]:
mean_11

11.116617647058824

Let's estimate $f_{21}, f_{22}, f_{23}$. Which is similar to what we did above.

In [41]:
# f_21
mean_21 = np.mean(df_2['X1'])
std_21 = np.std(df_2['X1'])

print('f_21 mean =', round(mean_21,2))
print('f_21 std =', round(std_21,2))

# f_22
mean_22 = np.mean(df_2['X2'])
std_22 = np.std(df_2['X2'])

print('f_22 mean =', round(mean_22,2))
print('f_22 std =', round(std_22,2))

# f_23
f_23 = (df_2['X3'].value_counts()/df_2['X3'].shape[0]).to_numpy()
f_23

f_21 mean = 9.52
f_21 std = 6.89
f_22 mean = -2.68
f_22 std = 3.18


array([0.4375, 0.4375, 0.125 ])

Suppose we have a new observation $x^* = [x_1^*, x_2^*, x_3^*]$.

Then using the Naive Bayes formula, the posterior probabilities of $x*$ belonging to class $k$ are given by the formula

$$\pi_k' = \frac{\pi_k \cdot f_{k1}(x_1^*) \cdot f_{k2}(x_2^*) \cdot f_{k3}(x_3^*)}{(\pi_1 \cdot f_{11}(x_1^*) \cdot f_{12}(x_2^*) \cdot f_{13}(x_3^*)) + (\pi_2 \cdot f_{21}(x_1^*) \cdot f_{22}(x_2^*) \cdot f_{23}(x_3^*))}$$

for $k=1,2$.

As an example, let $x^* = [8, -1, 0]$. Let's calculate the posterior probabilities for $x^*$.

In [42]:
from scipy.stats import norm

# New data point
x_new = np.array([8, -1, 0])

# Store posteriors here
pi_post = np.zeros(2) # for storage

# Calculate posteriors
prior1_lik1 = pi[0] * norm.pdf(x_new[0], mean_11, std_11) * norm.pdf(x_new[1], mean_12, std_12) * f_13[x_new[2]]
prior2_lik2 = pi[1] * norm.pdf(x_new[0], mean_21, std_21) * norm.pdf(x_new[1], mean_22, std_22) * f_23[x_new[2]]

pi_post[0] = prior1_lik1/(prior1_lik1 + prior2_lik2)
pi_post[1] = prior2_lik2/(prior1_lik1 + prior2_lik2)

print(pi_post)

[0.3545775 0.6454225]


In [43]:
def gaussian_pdf(x, mu, sigma):
    return (1/np.sqrt(2*np.pi*sigma**2))*np.exp(-(x-mu)**2/(2*sigma**2))

print(gaussian_pdf(x_new[0], mean_11, std_11))
norm.pdf(x_new[0], mean_11, std_11)

0.03943492349767363


0.039434923497673635

Therefore, we would classify this new observation as $k=2$.

### Verify using `sklearn` (Numeric Data Types)

Unfortunately, `sklearn` does not directly support mixed data types. However, we can still write code to adapt it for such.

Before that, let's see if we can replicate the posterior probabilities using only the numeric predictors.

In [None]:
# Store posteriors here
pi_post_num = np.zeros(2)

# Calculate posteriors (since we are only dealing with numeric, we remove the categorical term aka the third one)
prior1_lik1_num = pi[0] * norm.pdf(x_new[0], mean_11, std_11) * norm.pdf(x_new[1], mean_12, std_12)
prior2_lik2_num = pi[1] * norm.pdf(x_new[0], mean_21, std_21) * norm.pdf(x_new[1], mean_22, std_22)

pi_post_num[0] = prior1_lik1_num/(prior1_lik1_num + prior2_lik2_num)
pi_post_num[1] = prior2_lik2_num/(prior1_lik1_num + prior2_lik2_num)

print(pi_post_num)

[0.33807489 0.66192511]


In [None]:
from sklearn.naive_bayes import GaussianNB

X_num = df.iloc[:, :2]
y = df.iloc[:, 3]

model = GaussianNB(priors=pi).fit(X_num, y)
model.predict_proba(np.array([[8, -1]]))

array([[0.33807489, 0.66192511]])

### Verify using `sklearn` (Categorical Data Types)

Now let's see if we can replicate the posterior probabilities using only the categorical predictor.

In [None]:

# Store posteriors here
pi_post_cat = np.zeros(2)

# Calculate posteriors
prior1_lik1_cat = pi[0] * f_13[x_new[2]]
prior2_lik2_cat = pi[1] * f_23[x_new[2]]

pi_post_cat[0] = prior1_lik1_cat/(prior1_lik1_cat + prior2_lik2_cat)
pi_post_cat[1] = prior2_lik2_cat/(prior1_lik1_cat + prior2_lik2_cat)

print(pi_post_cat)

[0.51821862 0.48178138]


In [47]:

from sklearn.naive_bayes import CategoricalNB

X_cat = df.iloc[:, 2].to_numpy().reshape(-1, 1)
y = df.iloc[:, 3]

model = CategoricalNB(alpha=0, class_prior=pi).fit(X_cat, y) # alpha parameter controls laplace smoothing
model.predict_proba(np.array([[0]]))



array([[0.51821862, 0.48178138]])

### Exercise 1 - Laplace Smoothing


Modify our original implementation to include a [Laplace smoothing](https://scikit-learn.org/stable/modules/naive_bayes.html#categorical-naive-bayes) parameter $\alpha$.

In [48]:
# replicate the results using our own code + laplace smoothing 
model_replicate = CategoricalNB(alpha=1, class_prior=pi).fit(X_cat, y) # alpha parameter controls laplace smoothing
model_replicate.predict_proba(np.array([[0]]))

array([[0.52027027, 0.47972973]])

In [63]:
a = 1
n = 3

f_13_lap = ((df_1['X3'].value_counts() + a)/(df_1['X3'].shape[0]+ (a*n))).to_numpy()
f_23_lap = ((df_2['X3'].value_counts() + a)/(df_2['X3'].shape[0]+ (a*n))).to_numpy()

# Store posteriors here
pi_post_cat = np.zeros(2)
# Calculate posteriors
prior1_lik1_cat = pi[0] * f_13_lap[x_new[2]]
prior2_lik2_cat = pi[1] * f_23_lap[x_new[2]]

pi_post_cat[0] = prior1_lik1_cat/(prior1_lik1_cat + prior2_lik2_cat)
pi_post_cat[1] = prior2_lik2_cat/(prior1_lik1_cat + prior2_lik2_cat)

print(pi_post_cat)

[0.52027027 0.47972973]


### Exercise 2 - Naive Bayes Classifier with Mixed Data Types using `sklearn`

Write code that uses `GaussianNB` and `CategoricalNB` to calculate the posterior probabilities for the mixed data type observation $x^* = [8, -1, 0]$. 

*Very cool version: Write the code as a general wrapper function.*

In [66]:
# Your code here


from sklearn.naive_bayes import GaussianNB

X_num = df.iloc[:, :2]
y = df.iloc[:, 3]

model_gauss = GaussianNB(priors=pi).fit(X_num, y)
gauss_prob = model_gauss.predict_proba(np.array([[8, -1]]))



from sklearn.naive_bayes import CategoricalNB

X_cat = df.iloc[:, 2].to_numpy().reshape(-1, 1)


model_cat = CategoricalNB(alpha=0, class_prior=pi).fit(X_cat, y) # alpha parameter controls laplace smoothing
cat_prob = model_cat.predict_proba(np.array([[0]]))
print('gauss', gauss_prob)
print('cat', cat_prob)
print('replicate', pi_post)

gauss [[0.33807489 0.66192511]]
cat [[0.51821862 0.48178138]]
replicate [0.3545775 0.6454225]




In [75]:
print('replicate', pi_post)
(gauss_prob * cat_prob) / np.sum((gauss_prob*cat_prob))

replicate [0.3545775 0.6454225]


array([[0.3545775, 0.6454225]])

For `mixed data type` problem for Naive Bayes, you can combine predictions from gaussian and categorical NB of sklearn and then normalize the combination.

multiply : combination of predictions<br>
divide by np.sum : normalization