$\newcommand{\xv}{\mathbf{x}}
 \newcommand{\wv}{\mathbf{w}}
 \newcommand{\yv}{\mathbf{y}}
 \newcommand{\zv}{\mathbf{z}}
 \newcommand{\uv}{\mathbf{u}}
 \newcommand{\vv}{\mathbf{v}}
 \newcommand{\Chi}{\mathcal{X}}
 \newcommand{\R}{\rm I\!R}
 \newcommand{\sign}{\text{sign}}
 \newcommand{\Tm}{\mathbf{T}}
 \newcommand{\Xm}{\mathbf{X}}
 \newcommand{\Zm}{\mathbf{Z}}
 \newcommand{\I}{\mathbf{I}}
 \newcommand{\Um}{\mathbf{U}}
 \newcommand{\Vm}{\mathbf{V}} 
 \newcommand{\muv}{\boldsymbol\mu}
 \newcommand{\Sigmav}{\boldsymbol\Sigma}
 \newcommand{\Lambdav}{\boldsymbol\Lambda}
$


# Naive Bayes


In this module, let us learn how we use probability in learning. 
Naive Bayes is a simple classifier that uses "naive" assumption on the conditional independence.
As a fast, multi-class classifier, it has been used for real-time prediction, text classification, spam filtering, sentiment analysis, and recommendation systems. 
Let us start from the quick review of probability theory to Naive Bayes algorithm. 

As we reviewed in Week 2, 
* for two (or more) random variables, a **joint probability** $p(X, Y)$ calculates the likelihood of two (or more) random envets occur at the same time. 
* A **conditional probability** $p(Y \mid X)$ measures the chance of an event ($Y$) occurs given another event ($X$) has occured. 
* We can **marginalize** the joint probability for a subset of random variables without referencing the other:
$$
   p(X) = \sum_Y p(X, Y). 
$$
* Given the definition of conditional probability, 
$$
  p(Y \mid X) = \frac{p(X, Y)}{p(X)},  
$$
we can write the **product rule**
$$
  p(X, Y) = p(Y \mid X) p(X). 
$$
* When the two event $X$ and $Y$ are independent, $p(Y \mid X) = p(Y)$ because the chance $Y$ occurs is not relevant to $X$. This imiplies the product rule:
$$
  p(X, Y) = p(X) p(Y)
$$

### Probability for Classification

Let $x$ be our input data and $y$ be target class label. We can formulate our classification problem to measure the chance to achive the correct label $y$ given the input $x$. That is, a conditional probability 
$$
p(y \mid x).
$$
If we learn the distribution, we can make a prediction by picking a class label that has the maximum conditional probability. This is called the *Bayes-optimal* classifier. 

*Discriminative models* learn directly estimate $p(y \mid x)$ while *generative models* use the model for the likelihood $p(x \mid y)$ to get the posterior using Bayes rule. 

### Bayes Rule

The joint probability 
$$
p(X, Y) = p(Y, X). 
$$
From the product rule, 
$$
\begin{align}
p(X, Y) = p(X \mid Y) p(Y), 
p(Y, X) = p(Y \mid X) p(X). 
\end{align}
$$

$$
\therefore p(X \mid Y) p(Y) = p(Y \mid X) p(X)
$$

This leads us to Bayes rule:
$$
    p(Y \mid X)  = \frac{p(X \mid Y) p(Y)}{p(X)}
$$

Remeber the terminology 
* posterior: $p(Y \mid X)$
* likelihood: $p(X \mid Y)$
* prior: $p(Y)$
* evidence: $p(X)$



### Maximum a posteriori

The maximum a posteriori (MAP) estimates the mode of the posterior distribution:  
$$
  y_{MAP} = \arg \max_Y p(Y \mid X) = \arg \max_Y \frac{p(X \mid Y) p(Y)}{p(X)} = \arg \max_Y p(X \mid Y) p(Y)
$$

If we assume the prior distribution $p(Y)$ to be uniform or ignore it for a simple solution, we obtain the maximum likelihood: 
$$
  y_{ML} = \arg \max_Y p(X \mid Y).
$$

## Naive Bayes Classifier

Naive Bayes models $P(X \mid Y)$ with aforementioned conditional independence. 
That is, the input features are independent, given the class label. 

#### Naive Bayes Assumption
$$
\begin{align*}
P(\Xm \mid Y) &= P(\xv_1, \xv_2, \ldots, \xv_N \mid Y) \\
              &=  P(\xv_1 \mid Y) P(\xv_2 \mid T, \xv_1) P(\xv_3 \mid T, \xv_1, \xv_2) \cdots P(\xv_N \mid T, \xv_1, \ldots, \xv_N)\\
              &= P(\xv_1 \mid Y) P(\xv_2 \mid Y) \cdots P(\xv_N \mid Y) \\
              &= \prod_i^N P(\xv_i \mid Y)
\end{align*}
$$

Using the likelihood model, we can make a prediction using the following decision rule:
$$
  y_{NB} = \arg \max_Y p(X \mid Y) p(Y) = \arg \max_Y \prod_i^N P(\xv_i \mid Y)
$$

### Training Naive Bayes 

Training process computes the prior and likelihood as below from the training data. 

$$
\hat{P}(y) = \frac{\vert \{i: y = y_i \} \vert} {N}, \\
\hat{P}(\xv_i \mid y) = \frac{\hat{P}(\xv_i, y)}{\hat{P}(y)} = \frac{ \vert \{i: \xv = \xv_i, y = y_i \} \vert \mathbin{/} N}{ \vert \{i: y = y_i \} \vert \mathbin{/} N } = \frac{\vert \{i: \xv = \xv_i, y = y_i \} \vert}{\vert \{i: y = y_i \} \vert}.
$$

### Example: Weather

Let us play with this simple weather data as in the table. For each input feature, we want to train a classifier that predicts the binary lable "class."


Outlook | Temperature | Humidity | Windy | Class
--|--|--|--|--
sunny | hot | high | false | -
sunny | hot | high | true | - 
overcast | hot | high | false | +
rain | mild | high | false | + 
rain | cool | normal | false | + 
rain | cool | normal | true | - 
overcast | cool | normal | true | +
sunny | mild | high | false | -
sunny | cool | normal | false | + 
rain | mild | normal | false | +
sunny | mild | normal | true | + 
overcast | mild | high | true | + 
overcast | hot | normal | false | + 
rain | mild | high | true | - 
sunny | cool | high | true | ? 

writefile magic command in Jupyter Notebook creates a file named weather.csv with the listed content. 

In [1]:
%%writefile weather.csv
Outlook, Temperature, Humidity, Windy, Class
sunny, hot, high, false, -1
sunny, hot, high, true, -1 
overcast, hot, high, false, 1
rain, mild, high, false, 1 
rain, cool, normal, false, 1 
rain, cool, normal, true, -1 
overcast, cool, normal, true, 1
sunny, mild, high, false, -1
sunny, cool, normal, false, 1 
rain, mild, normal, false, 1
sunny, mild, normal, true, 1 
overcast, mild, high, true, 1 
overcast, hot, normal, false, 1 
rain, mild, high, true, -1 

Writing weather.csv


In [2]:
import numpy as np 
import pandas as pd

In [3]:
# load the csv file
df = pd.read_csv("weather.csv")
df

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Class
0,sunny,hot,high,False,-1
1,sunny,hot,high,True,-1
2,overcast,hot,high,False,1
3,rain,mild,high,False,1
4,rain,cool,normal,False,1
5,rain,cool,normal,True,-1
6,overcast,cool,normal,True,1
7,sunny,mild,high,False,-1
8,sunny,cool,normal,False,1
9,rain,mild,normal,False,1


We can count the each feature using value_counts command in pandas. 

In [4]:
for col in df.columns:
    print(df[col].value_counts())

sunny       5
rain        5
overcast    4
Name: Outlook, dtype: int64
 mild    6
 hot     4
 cool    4
Name:  Temperature, dtype: int64
 high      7
 normal    7
Name:  Humidity, dtype: int64
 false    8
 true     6
Name:  Windy, dtype: int64
 1    9
-1    5
Name:  Class, dtype: int64


We can split the dataframe into two for each class.

In [6]:
dfpos = df[df.iloc[:, -1] == 1]
dfneg = df[df.iloc[:, -1] == -1]
dfpos

Unnamed: 0,Outlook,Temperature,Humidity,Windy,Class
2,overcast,hot,high,False,1
3,rain,mild,high,False,1
4,rain,cool,normal,False,1
6,overcast,cool,normal,True,1
8,sunny,cool,normal,False,1
9,rain,mild,normal,False,1
10,sunny,mild,normal,True,1
11,overcast,mild,high,True,1
12,overcast,hot,normal,False,1


Here prints the value_counts for each class. 

In [7]:
for hdf in [dfpos, dfneg]:
    print("_______________________________________")
    for col in hdf.columns:
        print(hdf[col].value_counts())


_______________________________________
overcast    4
rain        3
sunny       2
Name: Outlook, dtype: int64
 mild    4
 cool    3
 hot     2
Name:  Temperature, dtype: int64
 normal    6
 high      3
Name:  Humidity, dtype: int64
 false    6
 true     3
Name:  Windy, dtype: int64
1    9
Name:  Class, dtype: int64
_______________________________________
sunny    3
rain     2
Name: Outlook, dtype: int64
 hot     2
 mild    2
 cool    1
Name:  Temperature, dtype: int64
 high      4
 normal    1
Name:  Humidity, dtype: int64
 true     3
 false    2
Name:  Windy, dtype: int64
-1    5
Name:  Class, dtype: int64


**TODO:** First, compute the priors using the number of samples for each class. 

In [8]:
# TODO: compute the prior to  
priors = np.array([dfneg.shape[0], dfpos.shape[0]]) / df.shape[0]
priors

array([ 0.35714286,  0.64285714])


We use Python dictionary to store the conditional probabilities for each feature given the class. 

In [9]:
#likelihood 
# P(X | +)

likeli = {1: {}, -1: {}}
for hdf in [dfpos, dfneg]:
    for col in df.columns[:-1]:
        Nh = hdf.shape[0]
        for val in df[col].unique():
            likeli[hdf.iloc[0, -1]][val.strip()] = np.sum(hdf[col] == val) / Nh

likeli

{-1: {'cool': 0.20000000000000001,
  'false': 0.40000000000000002,
  'high': 0.80000000000000004,
  'hot': 0.40000000000000002,
  'mild': 0.40000000000000002,
  'normal': 0.20000000000000001,
  'overcast': 0.0,
  'rain': 0.40000000000000002,
  'sunny': 0.59999999999999998,
  'true': 0.59999999999999998},
 1: {'cool': 0.33333333333333331,
  'false': 0.66666666666666663,
  'high': 0.33333333333333331,
  'hot': 0.22222222222222221,
  'mild': 0.44444444444444442,
  'normal': 0.66666666666666663,
  'overcast': 0.44444444444444442,
  'rain': 0.33333333333333331,
  'sunny': 0.22222222222222221,
  'true': 0.33333333333333331}}

**TODO:** Using the conditional independence, compute the likelihood.

In [11]:
# TODO: compute the likelihood estimate assuming the function has access to the global variable, likeli 
def likelihood(outlook, temp, humid, wind, target):
    return likeli[target][outlook] * likeli[target][temp] * \
        likeli[target][humid] * likeli[target][wind]


**TODO:** Now, compute likelihood $\times$ prior, $p(X \mid Y) p(Y)$. 

In [12]:
# discriminant for the example target: sunny, cool, high, true
neg_p = likelihood('sunny', 'cool', 'high', 'true', -1) * priors[0]
pos_p = likelihood('sunny', 'cool', 'high', 'true', 1) * priors[1]

print(neg_p, pos_p)


0.0205714285714 0.00529100529101


To get the posterior, we can normalize the value by dividing the values by the sume of them. So, when it is sunny, cool, highly humid, and windy, it is likely to be classified to "positive" with 79% of chance. 

In [14]:
# normalize    
print(np.array([neg_p, pos_p]) / np.sum([neg_p, pos_p]))

[ 0.79541735  0.20458265]


So far, we have been played with Naive Bayes classifer. We can summarize pros and cons of the classifier as follows:


### Pros
- easy and fast to predict
- *when Naive Bayes assumption holds*, it works well
- performs well with categorical variables

### Cons

- *Zero Frequency*: no observed categorical values will have 0 probability
  - Worse when data is sparse
- Bad estimator: the probability outputs are not to be taken too seriously
- Naive Bayes assumption

# Laplacian Smoothing

As listed above, *Zero Frequency* problem can be an issue with the original Naive Baye. 
Even in the simple example above, we can observe 
$$
P(Outlook = overcast \mid -) = 0.
$$


When the data is sparse, we have many zero conditional probabilities, 
$$P(X = sparse \mid T) = 0.$$

This can cause computation errors as in this sample case: 

$$
P( Y=red \mid x ) = \frac{\prod_i P(x_i \mid Y=red) P(Y = red)}{ \sum_k^{colors} \Big( \prod_i P(x_i \mid Y=k) P(Y=k) \Big)} = \frac{0}{0}
$$



### Remedy
Laplacian smoothing fixes the zero frequency problem by simply adding 1 to the numerator, and $k$ the number of categories to the denominator. 


$$
P(X = sparse \mid Y) = \frac{}{} = \frac{\vert \{i: x = x_i, y = y_i \} \vert + 1}{\vert \{i: y = y_i \} \vert + k}
$$


In [15]:
#conditional probabilities with Laplacian Smoothing
# P(X | +)

likeli = {1: {}, -1: {}}
for hdf in [dfpos, dfneg]:
    for col in df.columns[:-1]:
        cats = df[col].unique()
        Nh = hdf.shape[0] + len(cats) # k = the number of categories
        for val in cats:
            likeli[hdf.iloc[0, -1]][val.strip()] = (np.sum(hdf[col] == val) + 1) / Nh

likeli

{-1: {'cool': 0.25,
  'false': 0.42857142857142855,
  'high': 0.7142857142857143,
  'hot': 0.375,
  'mild': 0.375,
  'normal': 0.2857142857142857,
  'overcast': 0.125,
  'rain': 0.375,
  'sunny': 0.5,
  'true': 0.5714285714285714},
 1: {'cool': 0.33333333333333331,
  'false': 0.63636363636363635,
  'high': 0.36363636363636365,
  'hot': 0.25,
  'mild': 0.41666666666666669,
  'normal': 0.63636363636363635,
  'overcast': 0.41666666666666669,
  'rain': 0.33333333333333331,
  'sunny': 0.25,
  'true': 0.36363636363636365}}