# Naive Bayes

## Gausian Naive Bayes

Multinomial Bayes expands upon Bayes' theorem to multiple observations.

Recall that Bayes' theorem is:  

$$ \Large P(A|B) = \frac{P(B|A)\bullet P(A)}{P(B)}$$

Expanding to multiple features, the multinomial Bayes' formula is:  

$$ \Large P(y|x_1, x_2, ..., x_n) = \frac{P(y)\prod_{i}^{n}P(x_i|y)}{P(x_1, x_2, ..., x_n)}$$

Here $y$ is an observation class while $x_1$ through $x_n$ are various features of the observation. Similar to linear regression, these features are assumed to be linearly independent.

In the numerator, you multiply the product of the conditional probabilities $P(x_i|y)$ by the probability of the class y. The denominator is the overall probability (across all classes) for the observed values of the various features. In practice, this can be difficult or impossible to calculate. Fortunately, doing so is typically not required, as you will simply be comparing the relative probabilities of the various classes.

To calculate each of the conditional probabilities in the numerator, $P(x_i|y)$, the Gaussian Naive Bayes algorithm traditionally uses the Gaussian probability density function to give a relative estimate of the probability of the feature observation, $x_i$, for the class $y$. Some statisticians don't agree with this as the probability of any point on a PDF curve is actually 0. 

With that, you have:  

$$\Large P(x_i|y) = \frac{1}{\sqrt{2\pi \sigma_i^2}}e^{\frac{-(x-\mu_i)^2}{2\sigma_i^2}}$$

Where $\mu_i$ is the mean of feature $x_i$ for class $y$ and $\sigma_i^2$ is the variance of feature $x_i$ for class $y$.

## Example

In [4]:
from sklearn import datasets
from scipy import stats
import pandas as pd
import numpy as np

iris = datasets.load_iris()

X = pd.DataFrame(iris.data)
X.columns = iris.feature_names

y = pd.DataFrame(iris.target)
y.columns = ['Target']

df = pd.concat([X, y], axis=1)
df.head()

Unnamed: 0,sepal length (cm),sepal width (cm),petal length (cm),petal width (cm),Target
0,5.1,3.5,1.4,0.2,0
1,4.9,3.0,1.4,0.2,0
2,4.7,3.2,1.3,0.2,0
3,4.6,3.1,1.5,0.2,0
4,5.0,3.6,1.4,0.2,0


In [2]:
df['Target'].value_counts()

2    50
1    50
0    50
Name: Target, dtype: int64

Next, you calculate the mean and standard deviation within a class for each of the features. You'll then use these values to calculate the conditional probability of a particular feature observation for each of the classes.

In [3]:
aggs = df.groupby('Target').agg(['mean', 'std'])
aggs

Unnamed: 0_level_0,sepal length (cm),sepal length (cm),sepal width (cm),sepal width (cm),petal length (cm),petal length (cm),petal width (cm),petal width (cm)
Unnamed: 0_level_1,mean,std,mean,std,mean,std,mean,std
Target,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
0,5.006,0.35249,3.428,0.379064,1.462,0.173664,0.246,0.105386
1,5.936,0.516171,2.77,0.313798,4.26,0.469911,1.326,0.197753
2,6.588,0.63588,2.974,0.322497,5.552,0.551895,2.026,0.27465


Calculate conditional probabilities using the Gausian PDF.

In [5]:
def p_x_given_class(obs_row, feature, class_):
    mu = aggs[feature]['mean'][class_]
    std = aggs[feature]['std'][class_]
    
    # A single observation
    obs = df.iloc[obs_row][feature] 
    
    p_x_given_y = stats.norm.pdf(obs, loc=mu, scale=std)
    return p_x_given_y

# Notice how this is not a true probability; you can get values > 1
p_x_given_class(0, 'petal length (cm)', 0) 

2.1553774365786804

In [7]:
def predict_class(row):
    c_probs = []
    # range = number of classes
    for c in range(3):
        # Initialize probability to relative probability of class i.e. P(Y) in formula
        p = len(df[df['Target'] == c])/len(df) 
        for feature in X.columns:
            # Multiply by P(x | y)
            p *= p_x_given_class(row, feature, c)
        c_probs.append(p)
        # Look for highest probability amongst the 3 values in c_probs
        # np.argmax returns the indices of the maximum values along an axis.
    return np.argmax(c_probs)

In [8]:
# Example using first observation
row = 0
df.iloc[row]

sepal length (cm)    5.1
sepal width (cm)     3.5
petal length (cm)    1.4
petal width (cm)     0.2
Target               0.0
Name: 0, dtype: float64

In [9]:
# Make predictions
predict_class(row)

0

Correctly predicted 0.

Let's evaluate our classifier.

In [10]:
# Create a column predictions by making predictions on each row
df['Predictions'] =  [predict_class(row) for row in df.index]
# Create a collumn correct set to True if Target matches Predictions
df['Correct?'] = df['Target'] == df['Predictions']
# Obtain normalized value counts of that column, ie % of correct predictions
df['Correct?'].value_counts(normalize=True)

True     0.96
False    0.04
Name: Correct?, dtype: float64

Our classifier was correct 96% of the time.

## Document Classification

Recall Bayes theorem:

 $$ \large  P(A|B) = \dfrac{P(B|A)P(A)}{P(B)}$$
 
Applied to a document, one common implementation of Bayes' theorem is to use a bag of words representation. A bag of words representation takes a text document and converts it into a word frequency representation.

In [17]:
doc = "A bag of words representation takes a text document and converts it into a word frequency representation"
bag = {}
for word in doc.split():
    # Get the previous entry, or 0 if not yet documented; add 1
    bag[word] = bag.get(word, 0) + 1 
bag

{'A': 1,
 'bag': 1,
 'of': 1,
 'words': 1,
 'representation': 2,
 'takes': 1,
 'a': 2,
 'text': 1,
 'document': 1,
 'and': 1,
 'converts': 1,
 'it': 1,
 'into': 1,
 'word': 1,
 'frequency': 1}

A common example of using Bayes' theorem to classify documents is a spam filtering algorithm. To do this, you examine the question "given this word (in the document) what is the probability that it is spam versus not spam?" For example, perhaps you get a lot of "special offer" spam. In that case, the words "special" and "offer" may increase the probability that a given message is spam.

You would have:

 $$ P(\text{Spam | Word}) = \dfrac{P(\text{Word | Spam})P(\text{Spam})}{P(\text{Word})}$$  

Using the bag of words representation, you can then define $P(\text{Word | Spam})$ as

 $$P(\text{Word | Spam}) = \dfrac{\text{Word Frequency in Document}}{\text{Word Frequency Across All Spam Documents}}$$ 
 
However, this formulation has a problem: what if you encounter a word in the test set that was not present in the training set? This new word would have a frequency of **zero**! This would commit two grave sins. First, there would be a division by zero error. Secondly, the numerator would also be zero; if you were to simply modify the denominator, having a term with zero probability would cause the probability for the entire document to also be zero when you subsequently multiplied the conditional probabilities in Multinomial Bayes. To effectively counteract these issues, **Laplacian smoothing** is often used giving:   

 $$P(\text{Word | Spam}) = \dfrac{\text{Word Frequency in Document} + 1}{\text{Word Frequency Across All Spam Documents + Number of Words in Corpus Vocabulary}}$$  