In [13]:
import pandas as pd
import numpy as np

# Laboratory practice 2.4: Naive Bayes

Naïve Bayes algorithm is a supervised classification algorithm based on Bayes theorem. 

### Bayes’ Theorem

In probability theory and statistics, Bayes’ theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event:

$
P(y|x_1,...,x_n) = \frac{P(x_1,...,x_n|y)P(y)}{P(x_1,...,x_n)}
$

where:

* $x_1,...,x_n$ are n features that are independent of each other and $y$ is the dependent variable.
* $P(y|x_1,...,x_n)$, posterior probability.
* $P(x_1,...,x_n|y)$, likelihood of features $x_1$ to $x_n$ given that their class is y.
* $P(y)$, prior probability.
* $P(x_1,...,x_n)$ marginal probability.

### Assumptions
* All predictors are independent
* All the predictors have an equal effect on the outcome

### Types of Naive Bayes Classifier
#### Multinomial Naive Bayes:
The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. Example: the variables can be how many times the word appears in a text.

+ info: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.MultinomialNB.html

#### Bernoulli Naive Bayes:
This is similar to the multinomial naive bayes but the predictors are boolean variables. The parameters that we use to predict the class variable take up only values yes or no, for example if a word occurs in the text or not.

+ info: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.BernoulliNB.html

#### Gaussian Naive Bayes:
When the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.

+ info: https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html


#### Exercise 1: 

Program the Naive Bayes algorithm:

Here is an example of how it could be programmed given the following data:

We will be using the weather dataset. This dataset includes features *weather* and *temp*, and the corresponding target variable *play*. Now, we need to predict whether players will play or not based on given weather conditions.

In [51]:
weather = ['Sunny','Sunny','Overcast','Rainy','Rainy','Rainy','Overcast','Sunny',
         'Sunny','Rainy','Sunny','Overcast','Overcast','Rainy']
temp = ['Hot','Hot','Hot','Mild','Cool','Cool','Cool','Mild','Cool','Mild','Mild',
        'Mild','Hot','Mild']

play=['No','No','Yes','Yes','Yes','No','Yes','No','Yes','Yes','Yes','Yes','Yes','No']

df = pd.DataFrame()
df["weather"] = weather
df["temp"] = temp
df["play"] = play

In [52]:
df.head(5)

Unnamed: 0,weather,temp,play
0,Sunny,Hot,No
1,Sunny,Hot,No
2,Overcast,Hot,Yes
3,Rainy,Mild,Yes
4,Rainy,Cool,Yes


**Step 1**: Calculate Prior Probability of Classes $P(y)$

#Frequency table

| play      | count |
| ----------- | ----------- |
| Yes      | 9/14 = 0.64       |
| No   | 5/14 = 0.36        |

**Step 2**: Calculate the likelihood table for all features:

#weather

| play| Sunny | Overcast | Rainy |
| ----------- | ----------- |-----------|-----------|
| Yes      | 2/9        |4/9|3/9| 
| No   | 3/5       |0/5|2/5|

#temp

| play| Hot | Mild | Cool |
| ----------- | ----------- |-----------|-----------|
| Yes      | 2/9        |4/9|3/9| 
| No   | 2/5       |2/5|1/5|

**Step 3**: Calculate marginal probabilities

#weather

| Sunny| Overcast | Rainy | 
| ----------- | ----------- |-----------|
| 5/14      | 4/14        |5/14|

#temp

| Hot| Mild | Cool | 
| ----------- | ----------- |-----------|
| 4/14      | 6/14        |4/14|


**Step 4**: Calculate posterior probabiliy:

$
P(y=Yes|x) = P(Yes|Rainy,Mild) = \frac{P(Rainy,Mild|Yes)*P(Yes)}{P(Rainy,Mild)}$

$P(Yes|Rainy,Mild) = \frac{P(Rainy|Yes)*P(Mild|Yes)*P(Yes)}{P(Rainy)*P(Mild)}$

$P(Yes|Rainy,Mild) = \frac{(3/9)*(4/9)*(9/14)}{(5/14)*(6/14)} = 0.62$


Take care with Zero-frequency problem: zero probability to a categorical variable whose category in the test data set wasn’t available in the training dataset. 

#### Exercise 2:

Given the following dataset, create a model using the Naive Bayes algorithm you consider.

Remember to follow the corresponding steps and justify why it is a good or bad model.

In [44]:
from sklearn.datasets import load_breast_cancer
cancer = load_breast_cancer()
df = pd.DataFrame(np.c_[cancer['data'], cancer['target']],
                  columns= np.append(cancer['feature_names'], ['target']))
