# Sect 28: Bayesian Classification

## Learning Objectives

- Understand how Bayes theorem can be applied to classify data using conditional probabilities.

- Understand Gaussian Naive Bayes and how it uses the Probability Density Function of a Normal Distribution 

- Understand the "underflow" issue and how to fix.


- Apply Naive Bayes manually and with sklearn

    - Activity 1: Gaussian Naive Bayes Lab
    - Activity 2: Document Classification with Naive Bayes

## Bayes Theorem Revisited

$$ \large P(A|B) = \dfrac{P(B|A)(A)}{P(B)}$$





$$ \Large P(y|x_1, x_2, ..., x_n) = \frac{P(y)\prod_{i}^{n}P(x_i|y)}{P(x_1, x_2, ..., x_n)}$$ 


***The Bayesian interpretation of this formula is***



$$ \large P(A|B) = \dfrac{P(B|A)(A)}{P(B)}$$


$$ \large \text{Posterior} = \dfrac{\text{Likelihood} \cdot \text{Prior}}{\text{Evidence}}$$

## Gaussian Naive Bayes

- Gaussian Naive Bayes makes the assumption that our probabilities follow a normal distribution.
- It uses the Probability Density Function for a Normal (Gaussian) Distribution to get point estimates of the probabilities.

In [1]:
!pip install -U fsds_100719
from fsds_100719.imports import *

fsds_1007219  v0.7.19 loaded.  Read the docs: https://fsds.readthedocs.io/en/latest/ 


Handle,Package,Description
dp,IPython.display,Display modules with helpful display and clearing commands.
fs,fsds_100719,Custom data science bootcamp student package
mpl,matplotlib,Matplotlib's base OOP module with formatting artists
plt,matplotlib.pyplot,Matplotlib's matlab-like plotting module
np,numpy,scientific computing with Python
pd,pandas,High performance data structures and tools
sns,seaborn,High-level data visualization library based on matplotlib


[i] Pandas .iplot() method activated.


In [2]:
from scipy import stats
from sklearn import datasets
iris = datasets.load_iris()

X = pd.DataFrame(iris.data)
X.columns = iris.feature_names

y = pd.DataFrame(iris.target)
y.columns = ['Target']

df = pd.concat([X, y], axis=1)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [3]:
aggs = df.groupby('Target').agg(['mean', 'std'])
aggs

Unnamed: 0_level_0,sepal length (cm),sepal length (cm),sepal width (cm),sepal width (cm),petal length (cm),petal length (cm),petal width (cm),petal width (cm)
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
Target,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,5.006,0.35249,3.428,0.379064,1.462,0.173664,0.246,0.105386
1,5.936,0.516171,2.77,0.313798,4.26,0.469911,1.326,0.197753
2,6.588,0.63588,2.974,0.322497,5.552,0.551895,2.026,0.27465


$$ \Large P(x_i|y) = \frac{1}{\sqrt{2 \pi \sigma_i^2}}e^{\frac{-(x-\mu_i)^2}{2\sigma_i^2}}$$

$$ \Large P(y|x_1, x_2, ..., x_n) = \frac{P(y)\prod_{i}^{n}P(x_i|y)}{P(x_1, x_2, ..., x_n)}$$ 


In [5]:
def p_x_given_class(obs_row, feature, class_):
    mu = aggs[feature]['mean'][class_]
    std = aggs[feature]['std'][class_]
    
    # A single observation
    obs = df.iloc[obs_row][feature] 
    
    p_x_given_y = stats.norm.pdf(obs, loc=mu, scale=std)
    return p_x_given_y

# Notice how this is not a true probability; you can get values > 1
p_x_given_class(0, 'petal length (cm)', 0) 

2.1553774365786804

In [6]:
row = 100
c_probs = []
for c in range(3):
    # Initialize probability to relative probability of class 
    p = len(df[df['Target'] == c])/len(df) 
    for feature in X.columns:
        p *= p_x_given_class(row, feature, c) 
        # Update the probability using the point estimate for each feature
        c_probs.append(p)

c_probs

[0.0004469582872647558,
 0.00044432855867026464,
 5.436807559640758e-152,
 9.529514999027405e-251,
 0.20091323410933296,
 0.06135077392562668,
 5.488088968636944e-05,
 2.460149009916488e-12,
 0.1887425821931875,
 0.140076102721696,
 0.0728335779635225,
 0.023861042537402642]

In [9]:
def predict_class(row):
    c_probs = []
    for c in range(3):
        # Initialize probability to relative probability of class
        p = len(df[df['Target'] == c])/len(df) 
        for feature in X.columns:
            p *= p_x_given_class(row, feature, c)
        c_probs.append(p)
    return np.argmax(c_probs)

In [10]:
row = 0
df.iloc[row]
predict_class(row)

0

In [11]:
df['Predictions'] =  [predict_class(row) for row in df.index]
df['Correct?'] = df['Target'] == df['Predictions']
df['Correct?'].value_counts(normalize=True)

True     0.96
False    0.04
Name: Correct?, dtype: float64

## Avoiding "underflow"

> "...repeatedly multiplying small probabilities can lead to underflow; rounding to zero due to numerical approximation limitations. As such, a common alternative is to add the logarithms of the probabilities as opposed to multiplying the raw probabilities themselves..."<br>
$$ \large e^x \cdot e^y = e^{x+y}$$  
$$ \large log_{e}(e)=1 $$  
$$\large  e^{log(x)} = x$$ 

With that, here's an updated version of the function using log probabilities to avoid underflow:

In [None]:
def predict_class_log(row):
    c_probs = []
    for c in range(3):
        # Initialize probability to relative probability of class
        p = len(df[df['Target'] == c])/len(df) 
        for feature in X.columns:
            p += np.log(p_x_given_class(row, feature, c))
        c_probs.append(p)
    return np.argmax(c_probs)

In [12]:
row = 0

df.iloc[row]
print(predict_class_log(row))
df['Predictions'] =  [predict_class_log(row) for row in df.index]
df['Correct?'] = df['Target'] == df['Predictions']
df['Correct?'].value_counts(normalize=True)

0


True     0.96
False    0.04
Name: Correct?, dtype: float64

# Text Classification with Naive Bayes

 $$ \large P(\text{Spam | Word}) = \dfrac{P(\text{Word | Spam})P(\text{Spam})}{P(\text{Word})}$$  

- Where $P(\text{Word | Spam})$ is

 $$ \large P(\text{Word | Spam}) = \dfrac{\text{Word Frequency in Document}}{\text{Word Frequency Across All Spam Documents}}$$  

> "However, this formulation has a problem: **what if you encounter a word in the test set that was not present in the training set?** This new word would have a frequency of zero! To effectively counteract these issues, Laplacian smoothing is often used giving:"  

- ***Laplacian smoothing:***

 $$P(\text{Word | Spam}) = \dfrac{\text{Word Frequency in Document} + 1}{\text{Word Frequency Across All Spam Documents + Number of Words in Corpus Vocabulary}}$$  


# Activity 1: Gaussian Naive Bayes Lab

- [Learn.co: Gaussian Naive Bayes Lab](https://learn.co/tracks/module-3-data-science-career-2-1/machine-learning/section-28-bayesian-classification/gaussian-naive-bayes-lab)

- Notebook Location:
    - `Repo Folder > labs_from_class > sect_28_bayesian_classification > gaussian_naive_ bayes_lab`

    - [LOCAL LINK: SG Version](http://localhost:8892/notebooks/labs_from_class/sect_28_bayesian_classification/gaussian_naive_%20bayes_lab/gauss_bayes_lab_instructor_SG.ipynb)

### INSTRUCTOR RESOURCES
- [LOCAL LINK: Instructor Version](http://localhost:8892/notebooks/labs_from_class/sect_28_bayesian_classification/gaussian_naive_%20bayes_lab/gauss_bayes_lab_instructor.ipynb)

- [Solution](https://github.com/learn-co-students/dsc-gaussian-naive-bayes-lab-online-ds-pt-100719/tree/solution)

# Activity 2:  Document Classification with Naive Bayes Lab

- [Learn.co: Document Classification with Naive Bayes Lab](https://learn.co/tracks/module-3-data-science-career-2-1/machine-learning/section-28-bayesian-classification/document-classification-with-naive-bayes-lab)

- Notebook Location:
    - `Repo Folder > labs_from_class > sect_28_bayesian_classification > document_classification_lab`

    - [LOCAL LINK: SG Version](http://localhost:8892/notebooks/labs_from_class/sect_28_bayesian_classification/document_classification_lab/document_classification_bayes_OOP_SG.ipynb)

### INSTRUCTOR RESOURCES
- [LOCAL LINK: Instructor OOP Version](http://localhost:8892/notebooks/labs_from_class/sect_28_bayesian_classification/document_classification_lab/document_classification_bayes_OOP_instructor.ipynb#)

- [Solution](https://github.com/learn-co-students/dsc-document-classification-with-naive-bayes-lab-online-ds-pt-100719/tree/solution)