# [CPSC 322](https://github.com/GonzagaCPSC322) Data Science Algorithms
[Gonzaga University](https://www.gonzaga.edu/)

[Gina Sprint](http://cs.gonzaga.edu/faculty/sprint/)

# Naive Bayes
What are our learning objectives for this lesson?
* Learn about Bayes Theorem
* Learn about the Naive Bayes classification algorithm

Content used in this lesson is based upon information in the following sources:
* Dr. Shawn Bowers' Data Mining notes

## Warm-up Task(s)
TBA

## Today 
TBA

## Naive Bayesian Classification
Basic ideas
* Predict class labels based on probabilities (statistics)
* Naive Bayes comparable in performance to "fancier" approaches
* Relatively efficient on large datasets
* Assumes "conditional independence"
    * Effect of one attribute on a class is independent from other attributes
    * Why it is called "naive"
    * Helps with execution time (speed)

## Basic Probability
$P(H)$ ... the probability of event $H$
* $H$ (hypothesis) for us would be that any given instance is of a class $C$
* Called the prior probability

$P(X)$ ... the probability of event $X$
* For us, $X$ would be an instance (a row in a table)
* The probability that an instance would have $X$'s attribute values
* Also a prior probability

$P(X|H)$ ... the conditional probability of $X$ given $H$
* The probability of X’s attribute values assuming we know it is of class C
* Called the posterior probability

$P(H|X)$ ... the conditional probability of $H$ given $X$
* The probability that $X$ is of class $C$ given $X$'s attribute values
* Also a posterior probability
* This is the one we want to know to make predictions!
    * i.e., we want the $C$ that gives the highest probability
* We can estimate $P(H)$, $P(X)$, and $P(X|H)$ from the training set
* From these, we can use Bayes Theorem to estimate $P(H|X)$:

Bayes Theorem:
$$P(H|X) = \frac{P(X|H)P(H)}{P(X)}$$

Basic idea behind Bayes Theorem:
If $P(A \cap B)$ is the probability that both $A$ and $B$ occur, then:
$$P(A \cap B) = P(A|B)P(B) = P(B|A)P(A)$$

In other words:
* Let's say $A$ occurs $x$% of the time given (within) $B$
* And $B$ occurs $y$% of the time
* Then $A$ and $B$ occur together, i.e., $A \cap B$: $x$% $\cdot y$% of the time


For example:
* Assume we have a bucket of Lego bricks
* 50% of the 1x2 bricks are Red
* 10% of the bricks in the bucket are 1x2's
* Then, 50% of the 10% of 1x2's are Red-1x2's (i.e., 50% $\cdot$ 10%)

We can use the equality to derive Bayes Theorem:
$$P(A|B) = \frac{P(A \cap B)}{P(B)} = \frac{P(B|A)P(A)}{P(B)}$$

## Classification Approach
Basic Approach:
1. We're given an instance $X = [v_1, v_2, ..., v_n]$ to classify
1. For each class $C_1, C_2, ... , C_m$, we want to find the class $C_i$ such that:
$$P(C_i|X) > P(C_j|X) \: \textrm{for} \: i \leq j \leq m, j \neq i$$
In other words, we want to find the class $C_i$ with the largest $P(C_i|X)$
1. Use Bayes Theorem to find each $P(C|X)$, i.e., for each $C_i$ calculate:
$$P(C_i|X) = P(X|C_i)P(C_i)$$
We leave out $P(X)$ since it is the same for all classes ($C_i$'s)
1. We estimate $P(C)$ as the percentage of $C$-labeled rows in the training set
$$P(C) = \frac{|C|}{D}$$
where $|C|$ is the number of instances classified as $C$ in the training set and $D$ is the training set size
1. We estimate $P(X|C)$ using the independence assumption of attributes:
$$P(X|C) = \prod_{k=1}^{n}P(v_k|C)$$
If attribute $k$ is categorical
    * We estimate $P(v_k|C)$ as the percentage of instances with value $v_k$ (in attribute $k$) across training set instances of class $C$
    
Some notes:
* Step 5 is an optimization: comparing entire rows is expensive (esp. if many attributes)
* For smaller datasets, there may also not be any matches
* Can extend the approach to support continuous attributes...

## Lab Tasks
### Lab Task 1
Consider the following labeled dataset, where result denotes class information and the remaining columns have categorical values.

|att1|att2|result|
|-|-|-|
|1|5|yes|
|2|6|yes|
|1|5|no|
|1|5|no|
|1|6|yes|
|2|6|no|
|1|5|yes|
|1|6|yes|

1. Compute the priors for the dataset (e.g. what is $P(result = yes)$ and $P (result = no)$?)
1. Compute the posteriors (conditional probabilities) for the dataset by making a table like Bramer 3.2 (e.g. what is $P(att1 = 1|result = yes)$, $P(att1 = 2|result = yes)$, $P(att2 = 5|result = yes)$, ...
1. If $X = [1, 5]$, what is $P(result = yes|X)$ and $P(result = no|X)$ assuming conditional independence? Show your work.
    1. What would the class label prediction be for the instance $X = [1, 5]$? Show your work.
1. If $X = [1, 5]$, what is $P(result = yes|X)$ and $P(result = no|X)$ *without* assuming conditional independence? Show your work.
    1. What would the class label prediction be for the instance $X = [1, 5]$? Show your work.

### Lab Task 2
Example adapted from [this Naive Bayes example](https://www.geeksforgeeks.org/naive-bayes-classifiers/)

Suppose we have the following dataset that has four attributes and a class attribute (PLAY GOLF):

|OUTLOOK	|TEMPERATURE	|HUMIDITY	|WINDY	|PLAY GOLF|
|-|-|-|-|-|
|Rainy	|Hot	|High	|False	|No|
|Rainy	|Hot	|High	|True	|No|
|Overcast	|Hot	|High	|False	|Yes|
|Sunny	|Mild	|High	|False	|Yes|
|Sunny	|Cool	|Normal	|False	|Yes|
|Sunny	|Cool	|Normal	|True	|No|
|Overcast	|Cool	|Normal	|True	|Yes|
|Rainy	|Mild	|High	|False	|No|
|Rainy	|Cool	|Normal	|False	|Yes|
|Sunny	|Mild	|Normal	|False	|Yes|
|Rainy	|Mild	|Normal	|True	|Yes|
|Overcast	|Mild	|High	|True	|Yes|
|Overcast	|Hot	|Normal	|False	|Yes|
|Sunny	|Mild	|High	|True	|No|

Suppose we have a new instance X = \[Sunny, Hot, Normal, False\]. 
1. What is $P(PLAY GOLF = Yes|X)$? 
1. What is $P(PLAY GOLF = No|X)$? 
1. What is the prediction for X?

## Handling Continuous Data
Assume attribute k is a continuous attribute sampled from a Gaussian distribution. We want to use attribute k with Naive Bayes. We will need to be able to compute $P(v_k|C)$ for a value $v_k$
* Let $\mu_C$ be the mean of attribute $k$ for instances labeled as $C$
* Let $\sigma_C$ be the standard deviation of attribute $k$ for instances labeled as $C$
* The probability $P(v_k|C)$ is defined as:
$$P(v_k|C) = g(v_k, \mu_C, \sigma_C)$$
where the Gaussian function $g$ is defined as:
$$g(x, \mu, \sigma) = \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{(x-\mu)^2}{2\sigma^2}}$$
* This looks pretty messy, but it is relatively straightforward in Python:

In [1]:
import math
def gaussian(x, mean, sdev):
    first, second = 0, 0
    if sdev > 0:
        first = 1 / (math.sqrt(2 * math.pi) * sdev)
        second = math.e ** (-((x - mean) ** 2) / (2 * (sdev ** 2)))
    return first * second

## Lab Task 3
Open ClassificationFun/main.py. Let's assume both `att1` `att2` are continuous and sampled from a normal (gaussian) distribution. 
1. What is $P(att1 = 2 | result = yes)$?
1. What is $P(att2 = 3 | result = yes)$?
1. What is $P(result = yes | X = [2, 3])$?
1. What is $P(result = no | X = [2, 3])$?
1. What is Naive Bayes' prediction for X = [2, 3]?
1. How does this compare to kNN's prediction for X?

What is $P(att1 = 2 | result = yes)$? First looking at class C = "yes", we need to compute the average and standard deviation of `att1` where C = "yes". Then we can compute the posterior $P(att1 = 2 | result = yes)$ using the `gaussian()` function and `att1=2`.

In [2]:
import numpy as np

train = [
    [3, 2],
    [6, 6],
    [4, 1],
    [4, 4],
    [1, 2],
    [2, 0],
    [0, 3],
    [1, 6]
]
train_labels = ["no", "yes", "no", "no", "yes", "no", "yes", "yes"]
test = [2, 3]

att1_class_yes = []
for i, row in enumerate(train):
    if train_labels[i] == "yes":
        att1_class_yes.append(row[0])
print(att1_class_yes)
mean = np.mean(att1_class_yes)
stdev = np.std(att1_class_yes)
# note: v_k (x is the formula) is test[0] att1 = 2
p_att1_2_given_class_yes = gaussian(test[0], mean, stdev)
print(p_att1_2_given_class_yes)

[6, 1, 0, 1]
0.17010955993225252


What is $P(att2 = 3 | result = yes)$? Repeat the above for `att2` where C = "yes" and `att2=3`

In [3]:
att2_class_yes = []
for i, row in enumerate(train):
    if train_labels[i] == "yes":
        att2_class_yes.append(row[1])
print(att2_class_yes)
mean = np.mean(att2_class_yes)
stdev = np.std(att2_class_yes)
# note: v_k (x is the formula) is test[1] att2 = 3
p_att2_3_given_class_yes = gaussian(test[1], mean, stdev)
print(p_att2_3_given_class_yes)

[6, 2, 3, 6]
0.17488003967875107


What is $P(result = yes | X = [2, 3])$? $P(result = yes | X = [2, 3]) = P(att1 = 2 | result = yes) \times P(att2 = 3 | result = yes) \times P(result = yes)$

In [5]:
p_yes_given_test = p_att1_2_given_class_yes * p_att2_3_given_class_yes * (4 / 8)
print(p_yes_given_test)

0.014874383295343602


Try finishing the rest of the lab task questions from here :)