# Naive Bayes

We have vectorised our lyrics corpus using Bag of Words. How can we run predictions on that? 
- Using Naive Bayes!

Naive Bayes are their own family of Machine Learning modelling - Bayes means it belongs to a group that gives you probability scores!

They are very useful and have lots of functions: we can introduce thresholds to say only show us results only where you are really sure!

- Based on Bayes' Theorem - it **describes the probability of an event, based on prior knowledge of conditions that might be related to the event.** For example, if cancer is related to age, then, using Bayes' theorem, a person's age can be used to more accurately assess the probability that they have cancer, compared to the assessment of the probability of cancer made without knowledge of the person's age.

### What is a prior?
- A prior is the assumed probability of an event before taking any data into account.

For instance, if we look at the word “yeah” in documents, we would expect it to occur with the average frequency over all documents, before looking at an individual document. The probability associated with the average frequency is the prior of “yeah”.

Naive Bayes - what we want to calculate:
                                            
                            p(doc|A) x p(A) <---- Prior - this is the probability
       ->    p (A|doc) = _____________________
       |                         p(B) <---- Marginal probability - has a nice property:
       |                                    We can ignore it! We're not interested in 
       |                                    absolute probability, we just want to know            Posterior Probability                    probability of Eminem v Madonna
                                         
                                           
\begin{align}
P(doc|A) = P({W_1}|A) . P({W_2}|A) ...
\end{align}     
              
              
### Naive Bayes Probabilistic Model

In plain English, using Bayesian probability terminology, the above equation can be written as

\begin{align}
Posterior = \frac{prior * likelihood}{evidence}
\end{align}            

**The posterior probability is the probability after looking at the data.**

#### Bayes' theorem is stated mathematically as the following equation:

\begin{align}
\displaystyle P(A\mid B)={\frac {P(B\mid A)\,P(A)}{P(B)}}
\end{align} 

where A and B are events and P(B) not equal to 0

- P(A|B) is a conditional probability: the likelihood of event A occurring given that B is true.

- P(B|A) is also a conditional probability: the likelihood of event B occurring given that A is true.

- P(A) and P(B) are the probabilities of observing A and B independently of each other; this is known as the marginal probability.

This might be too abstract, so let us replace some of the variables to make it more concrete. In a bayes classifier, we are interested in finding out the class (e.g. male or female, spam or ham) of an observation given the data:

\begin{align}
P(class|data) = \frac {P(data|class)∗P(class)}{P(data)}
\end{align}            


Here, **P(class) is the prior**. **P(data) is called the marginal probability**. In a classifier, we **can usually ignore the latter**, because we **only need to know the ratio between the classes.**

- class is a particular class (e.g. male)
- data is an observation’s data
- p(class | data) is called the posterior
- p(data | class) is called the likelihood
- p(class) is called the prior
- p(data) is called the marginal probability

### The Bayes Error
If we knew the underlying distribution of the data, we could build a perfect Bayesian model. Even then, there would be a residual error due to noise in the data. We call this the **Bayes Error**.

#### Advantages:

- fast

- accurate

- probability scores given

- works well even under small data

- often very intuitive

#### Disadvantages:

- Overfitting - you'll often have to optimise the hyperparameter alpha and this can be quite difficult
    - Can try and optimise that using GridSearch or TPOT(mentioned there's an issue here for Naive Bayes)

- Computationally costly in large data

- A prior must be chosen

## Naive Bayes Classifier From Scratch 
- Copyright © Chris Albon, 2020

#### Create Data

Our dataset is contains data on eight individuals. We will use the dataset to construct a classifier that takes in the height, weight, and foot size of an individual and outputs a prediction for their gender.

In [1]:
import pandas as pd
import numpy as np

In [2]:
# Create an empty dataframe
data = pd.DataFrame()

# Create our target variable
data['Gender'] = ['male','male','male','male','female','female','female','female']

# Create our feature variables
data['Height'] = [6,5.92,5.58,5.92,5,5.5,5.42,5.75]
data['Weight'] = [180,190,170,165,100,150,130,150]
data['Foot_Size'] = [12,11,12,10,6,8,7,9]

# View the data
data


Unnamed: 0,Gender,Height,Weight,Foot_Size
0,male,6.0,180,12
1,male,5.92,190,11
2,male,5.58,170,12
3,male,5.92,165,10
4,female,5.0,100,6
5,female,5.5,150,8
6,female,5.42,130,7
7,female,5.75,150,9


The dataset above is used to construct our classifier. Below we will create a new person for whom we know their feature values but not their gender. Our goal is to predict their gender.

In [3]:
# Create an empty dataframe
person = pd.DataFrame()

# Create some feature values for this single row
person['Height'] = [6]
person['Weight'] = [130]
person['Foot_Size'] = [8]

# View the data 
person

Unnamed: 0,Height,Weight,Foot_Size
0,6,130,8


In a bayes classifier, we calculate the posterior (technically we only calculate the numerator of the posterior, but ignore that for now) for every class for each observation. Then, classify the observation based on the class with the largest posterior value. In our example, we have one observation to predict and two possible classes (e.g. male and female), therefore we will calculate two posteriors: one for male and one for female.

\begin{align}
p(\text{person is male} \mid \mathbf {\text{person’s data}} )={\frac {p(\mathbf {\text{person’s data}} \mid \text{person is male}) * p(\text{person is male})}{p(\mathbf {\text{person’s data}} )}}
\end{align}


\begin{align}
p(\text{person is female} \mid \mathbf {\text{person’s data}} )={\frac {p(\mathbf {\text{person’s data}} \mid \text{person is female}) * p(\text{person is female})}{p(\mathbf {\text{person’s data}} )}}
\end{align}

### Gaussian Naive Bayes Classifier
A gaussian naive bayes is probably the most popular type of bayes classifier. To explain what the name means, let us look at what the bayes equations looks like when we apply our two classes (male and female) and three feature variables (height, weight, and footsize):

\begin{align}
{\displaystyle {\text{posterior (male)}}={\frac {P({\text{male}})\,p({\text{height}}\mid{\text{male}})\,p({\text{weight}}\mid{\text{male}})\,p({\text{foot size}}\mid{\text{male}})}{\text{marginal probability}}}}
\end{align}


\begin{align}
{\displaystyle {\text{posterior (female)}}={\frac {P({\text{female}})\,p({\text{height}}\mid{\text{female}})\,p({\text{weight}}\mid{\text{female}})\,p({\text{foot size}}\mid{\text{female}})}{\text{marginal probability}}}}
\end{align}

Now let us unpack the top equation a bit:

- P(male) is the prior probabilities. It is, as you can see, simply the probability an observation is male. This is just the number of males in the dataset divided by the total number of people in the dataset.
- p(height∣female)p(weight∣female)p(foot size∣female) is the likelihood. Notice that we have unpacked `person’s data` so it is now every feature in the dataset. The “gaussian” and “naive” come from two assumptions present in this likelihood:
    1. If you look each term in the likelihood you will notice that we assume each feature is uncorrelated from each other. That is, foot size is independent of weight or height etc.. This is obviously not true, and is a “naive” assumption - hence the name “naive bayes.”
    2. Second, we assume have that the value of the features (e.g. the height of women, the weight of women) are normally (gaussian) distributed. This means that p(height∣female) is calculated by inputing the required parameters into the probability density function of the normal distribution:
    
\begin{align}
p(\text{height}\mid\text{female})=\frac{1}{\sqrt{2\pi\text{variance of female height in the data}}}\,e^{ -\frac{(\text{observation’s height}-\text{average height of females in the data})^2}{2\text{variance of female height in the data}} }
\end{align}

- **marginal probability** - is probably one of the most confusing parts of bayesian approaches. In toy examples (including ours) it is completely possible to calculate the marginal probability. However, in many real-world cases, it is either extremely difficult or impossible to find the value of the marginal probability (explaining why is beyond the scope of this tutorial). This is not as much of a problem for our classifier as you might think. Why? Because we don’t care what the true posterior value is, we only care which class has a the highest posterior value. And because the marginal probability is the same for all classes 1) we can ignore the denominator, 2) calculate only the posterior’s numerator for each class, and 3) pick the largest numerator. That is, we can ignore the posterior’s denominator and make a prediction solely on the relative values of the posterior’s numerator.


### Calculate Priors
Priors can be either constants or probability distributions. In our example, this is simply the probability of being a gender. Calculating this is simple:

In [4]:
# Number of males
n_male = data['Gender'][data['Gender'] == 'male'].count()

# Number of males
n_female = data['Gender'][data['Gender'] == 'female'].count()

# Total rows
total_ppl = data['Gender'].count()

In [9]:
n_female

4

In [10]:
total_ppl

8

In [7]:
# Number of males divided by the total rows
P_male = n_male/total_ppl

# Number of females divided by the total rows
P_female = n_female/total_ppl

In [8]:
P_male

0.5

### Calculate Likelihood
Remember that each term (e.g. p(height∣female)) in our likelihood is assumed to be a normal pdf. For example:

\begin{align}
p(\text{height}\mid\text{female})=\frac{1}{\sqrt{2\pi\text{variance of female height in the data}}}\,e^{ -\frac{(\text{observation’s height}-\text{average height of females in the data})^2}{2\text{variance of female height in the data}} }
\end{align}

This means that for each class (e.g. female) and feature (e.g. height) combination we need to calculate the variance and mean value from the data. Pandas makes this easy:

In [11]:
# Group the data by gender and calculate the means of each feature
data_means = data.groupby('Gender').mean()

# View the values
data_means


Unnamed: 0_level_0,Height,Weight,Foot_Size
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,5.4175,132.5,7.5
male,5.855,176.25,11.25


In [12]:
# Group the data by gender and calculate the variance of each feature
data_variance = data.groupby('Gender').var()

# View the values
data_variance

Unnamed: 0_level_0,Height,Weight,Foot_Size
Gender,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
female,0.097225,558.333333,1.666667
male,0.035033,122.916667,0.916667


Now we can create all the variables we need. The code below might look complex but all we are doing is creating a variable out of each cell in both of the tables above.

In [14]:
# Means for male
male_height_mean = data_means['Height'][data_variance.index == 'male'].values[0]
male_weight_mean = data_means['Weight'][data_variance.index == 'male'].values[0]
male_footsize_mean = data_means['Foot_Size'][data_variance.index == 'male'].values[0]

# Variance for male
male_height_variance = data_variance['Height'][data_variance.index == 'male'].values[0]
male_weight_variance = data_variance['Weight'][data_variance.index == 'male'].values[0]
male_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'male'].values[0]

# Means for female
female_height_mean = data_means['Height'][data_variance.index == 'female'].values[0]
female_weight_mean = data_means['Weight'][data_variance.index == 'female'].values[0]
female_footsize_mean = data_means['Foot_Size'][data_variance.index == 'female'].values[0]

# Variance for female
female_height_variance = data_variance['Height'][data_variance.index == 'female'].values[0]
female_weight_variance = data_variance['Weight'][data_variance.index == 'female'].values[0]
female_footsize_variance = data_variance['Foot_Size'][data_variance.index == 'female'].values[0]

#### Finally, we need to create a function to calculate the probability density of each of the terms of the likelihood (e.g. p(height|female)).

In [15]:
# Create a function that calculates p(x | y):
def p_x_given_y(x, mean_y, variance_y):

    # Input the arguments into a probability density function
    p = 1/(np.sqrt(2*np.pi*variance_y)) * np.exp((-(x-mean_y)**2)/(2*variance_y))
    
    # return p
    return p

### Apply Bayes Classifier To New Data Point
Alright, our bayes classifier is ready. Remember that since we can ignore the marginal probability (the demoninator), what we are actually calculating is this:

\begin{align}
{\displaystyle {\text{numerator of the posterior}}={P({\text{female}})\,p({\text{height}}\mid{\text{female}})\,p({\text{weight}}\mid{\text{female}})\,p({\text{foot size}}\mid{\text{female}})}{}}
\end{align}


To do this, we just need to plug in the values of the unclassified person (height = 6), the variables of the dataset (e.g. mean of female height), and the function (p_x_given_y) we made above:

In [16]:
# Numerator of the posterior if the unclassified observation is a male
P_male * \
p_x_given_y(person['Height'][0], male_height_mean, male_height_variance) * \
p_x_given_y(person['Weight'][0], male_weight_mean, male_weight_variance) * \
p_x_given_y(person['Foot_Size'][0], male_footsize_mean, male_footsize_variance)

6.197071843878078e-09

In [17]:
# Numerator of the posterior if the unclassified observation is a female
P_female * \
p_x_given_y(person['Height'][0], female_height_mean, female_height_variance) * \
p_x_given_y(person['Weight'][0], female_weight_mean, female_weight_variance) * \
p_x_given_y(person['Foot_Size'][0], female_footsize_mean, female_footsize_variance)

0.0005377909183630018

#### Because the numerator of the posterior for female is greater than male, then we predict that the person is female.