### Naive Bayes

Import the required packages:

In [32]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline

from sklearn import naive_bayes
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split

Sources: 

https://towardsdatascience.com/what-is-bayes-rule-bb6598d8a2fd

https://www.geeksforgeeks.org/naive-bayes-classifiers/

### Independence
Two events are independent if the occurrence of one has no effect at all on the probability of the other. If two events are independent, then $P(A \text{ and } B) = P(A) P(B)$. For example, since the toss of a die is independent from the next toss, then the probability of getting a 1 on my first toss and a 5 on my second toss is $(1/6)*(1/6)=1/36$. In contrast, if I draw two cards from a deck without replacement, then these two events are dependent and thus I cannot say that the probability of getting a queen on my first and second draw is $(4/52)*(4/52)$ since it is actually $(4/52)*(3/51)$.

### Bayes Rule
Bayes rule provides us with a way to update our beliefs based on the arrival of new, relevant pieces of evidence. For example, if we were trying to provide the probability that a given person has cancer, we would initially just say it is whatever percent of the population has cancer. However, given additional evidence such as the fact that the person is a smoker, we can update our probability, since the probability of having cancer is higher given that the person is a smoker. This allows us to utilize prior knowledge to improve our probability estimations.

**Bayes Rule** states that:

$P(A|B)=\frac{P(B|A)P(A)}{P(B)}$

The rule has a very simple derivation that directly leads from the relationship between joint and conditional probabilities. First, note that $P(A \text{ and } B) = P(A|B)P(B)$ and $P( A\text{ and } B) = P(B|A)P(A)$. Setting these equal to each other and rearranging gives us Bayes rule.

In this formula, A is the event we want the probability of, and B is the new evidence that is related to A in some way.


P(A|B) is called the **posterior**; this is what we are trying to estimate. In the above example, this would be the “probability of having cancer given that the person is a smoker”.


P(B|A) is called the **likelihood**; this is the probability of observing the new evidence, given our initial hypothesis. In the above example, this would be the “probability of being a smoker given that the person has cancer”.


P(A) is called the **prior**; this is the probability of our hypothesis without any additional prior information. In the above example, this would be the “probability of having cancer”.


P(B) is called the **marginal likelihood**; this is the total probability of observing the evidence. In the above example, this would be the “probability of being a smoker”. In many applications of Bayes Rule, this is ignored, as it mainly serves as normalization.

### Bayes Example
Using the cancer diagnosis example, we can show that Bayes rule allows us to obtain a much better estimate. Now, we will put some made-up numbers into the example so we can assess the difference that Bayes rule made. Assume that the probability of having cancer is 0.05 — meaning that 5% of people have cancer. Now, assume that the probability of being a smoker is 0.10 — meaning that 10% of people are smokers, and that 20% of people with cancer are smokers, so P(smoker|cancer) = 0.20. Initially, our probability for cancer is simply our prior, so 0.05. However, using new evidence, we can instead calculate P(cancer|smoke), which is equal to (P(smoker|cancer) * P(cancer)) / P(smoker) = (0.20 * 0.05) / (0.10) = 0.10.


By introducing new evidence, we therefore obtained a better probability estimation. Initially we had a probability of 0.05, but using the smoker evidence, we were able to get to a more accurate probability that was double our prior.

### Naive Bayes

Naive Bayes classifiers are a collection of classification algorithms based on Bayes’ Theorem. It is not a single algorithm but a family of algorithms where all of them share a common principle, i.e. every pair of features being classified is independent of each other. Since we assume each feature is independent from each other, this is why it is called "naive".

Consider a fictional dataset that describes the weather conditions for playing a game of golf. Given the weather conditions, each tuple classifies the conditions as fit(“Yes”) or unfit(“No”) for playing golf. The dataset is below:

In [4]:
df = pd.read_csv("data/golf.csv")
df

Unnamed: 0,Outlook,Temp,Humidity,Windy,Play Golf
0,Rainy,Hot,High,False,No
1,Rainy,Hot,High,True,No
2,Overcast,Hot,High,False,Yes
3,Sunny,Mild,High,False,Yes
4,Sunny,Cool,Normal,False,Yes
5,Sunny,Cool,Normal,True,No
6,Overcast,Cool,Normal,True,Yes
7,Rainy,Mild,High,False,No
8,Rainy,Cool,Normal,False,Yes
9,Sunny,Mild,Normal,False,Yes


The dataset is divided into two parts, namely, **feature matrix**, X, and the **response vector**, y.

**Feature matrix** contains all the vectors(rows) of dataset in which each vector consists of the value of dependent features. In above dataset, features are ‘Outlook’, ‘Temperature’, ‘Humidity’ and ‘Windy’.

**Response vector** contains the value of class variable(prediction or output) for each row of feature matrix. In above dataset, the class variable name is ‘Play golf’.

The fundamental Naive Bayes assumption is that each feature makes an:

- independent

- equal

contribution to the outcome.

With relation to our dataset, this concept can be understood as:

- We assume that no pair of features are dependent. For example, the temperature being ‘Hot’ has nothing to do with the humidity or the outlook being ‘Rainy’ has no effect on the winds. Hence, the features are assumed to be independent.

- Secondly, each feature is given the same weight(or importance). For example, knowing only temperature and humidity alone can’t predict the outcome accuratey. None of the attributes is irrelevant and assumed to be contributing equally to the outcome.

- Note: The assumptions made by Naive Bayes are not generally correct in real world situations. In fact, the independence assumption is never correct but often works well in practice.

Now, with regards to our dataset, we can apply Bayes’ theorem in following way:

 $P(y|X) = \frac{P(X|y) P(y)}{P(X)} $

where, y is class variable and X is a dependent feature vector (of size n) where:

 $X = (x_1,x_2,x_3,.....,x_n) $
 
 Just to clear, an example of a feature vector and corresponding class variable can be: (refer 1st row of dataset)
 
```X = (Rainy, Hot, High, False)
y = No
```

So basically, P(X|y) here means, the probability of “Not playing golf” given that the weather conditions are “Rainy outlook”, “Temperature is hot”, “high humidity” and “no wind”.

Remember that we are assuming independence, so we can multiply probabilities: $P(X|y) = P(x_1|y)P(x_2|y)...P(x_n|y)$

Hence, we reach to the result:

$P(y|x_1,...,x_n) = \frac{ P(x_1|y)P(x_2|y)...P(x_n|y)P(y)}{P(x_1)P(x_2)...P(x_n)} $

which can be expressed as:

$ P(y|x_1,...,x_n) = \frac{P(y)\prod_{i=1}^{n}P(x_i|y)}{P(x_1)P(x_2)...P(x_n)} $

Now, as the denominator remains constant for a given input, we can remove that term:

 $P(y|x_1,...,x_n)\propto P(y)\prod_{i=1}^{n}P(x_i|y) $

Now, we need to create a classifier model. For this, we find the probability of given set of inputs for all possible values of the class variable y and pick up the output with maximum probability.

Let us try to apply the above formula manually on our weather dataset. For this, we need to do some precomputations on our dataset.

We need to find $P(x_i | y_j)$ for each $x_i$ in X and $y_j$ in y. All these calculations have been demonstrated in the tables below:

<img src="images/golf1.png" width="500">

<img src="images/golf2.png" width="200">

For example, probability of playing golf given that the temperature is cool, i.e P(temp. = cool | play golf = Yes) = 3/9.

Also, we need to find class probabilities (P(y)) which has been calculated in the table 5. For example, P(play golf = Yes) = 9/14.

So now, we are done with our pre-computations and the classifier is ready!

Let us test it on a new set of features (let us call it today):

```today = (Sunny, Hot, Normal, False)```

So, probability of playing golf is given by:

 $P(Yes | today) = \frac{P(Sunny Outlook|Yes)P(Hot Temperature|Yes)P(Normal Humidity|Yes)P(No Wind|Yes)P(Yes)}{P(today)} $

and probability to not play golf is given by:

 $P(No | today) = \frac{P(Sunny Outlook|No)P(Hot Temperature|No)P(Normal Humidity|No)P(No Wind|No)P(No)}{P(today)} $
 
 
 Since, P(today) is common in both probabilities, we can ignore P(today) and find proportional probabilities as:

 $P(Yes | today) \propto \frac{2}{9}.\frac{2}{9}.\frac{6}{9}.\frac{6}{9}.\frac{9}{14} \approx 0.0141 $

and

$ P(No | today) \propto \frac{3}{5}.\frac{2}{5}.\frac{1}{5}.\frac{2}{5}.\frac{5}{14} \approx 0.0068 $

Now, since

 $P(Yes | today) + P(No | today) = 1 $

These numbers can be converted into a probability by making the sum equal to 1 (normalization):

$ P(Yes | today) = \frac{0.0141}{0.0141 + 0.0068} = 0.67 $

and

$ P(No | today) = \frac{0.0068}{0.0141 + 0.0068} = 0.33 $

Since

 $P(Yes | today) > P(No | today) $

So, prediction that golf would be played is ‘Yes’.

Some pther popular Naive Bayes classifiers are **Multinomial Naive Bayes** and **Bernoulli Naive Bayes**, which are popular for document classification (ex: spam or not spam). We'll get more into these later when we learn about natural language processing.


In spite of their apparently over-simplified assumptions, naive Bayes classifiers have worked quite well in many real-world situations, famously document classification and spam filtering. They require a small amount of training data to estimate the necessary parameters.


Naive Bayes learners and classifiers can be extremely fast compared to more sophisticated methods. The decoupling of the class conditional feature distributions means that each distribution can be independently estimated as a one dimensional distribution. This in turn helps to alleviate problems stemming from the curse of dimensionality (which we'll also cover later when we talk more about linear algebra).

### House Votes
Let's use Naive Bayes to try to predict whether a politician was democrat or republican based on their votes. The real dataset from 1984 is located here:

https://archive.ics.uci.edu/ml/datasets/congressional+voting+records

There are 16 key votes making up the dataset labeled as:

1. handicapped-infants: 2 (y,n) 
2. water-project-cost-sharing: 2 (y,n) 
3. adoption-of-the-budget-resolution: 2 (y,n) 
4. physician-fee-freeze: 2 (y,n) 
5. el-salvador-aid: 2 (y,n) 
6. religious-groups-in-schools: 2 (y,n) 
7. anti-satellite-test-ban: 2 (y,n) 
8. aid-to-nicaraguan-contras: 2 (y,n) 
9. mx-missile: 2 (y,n) 
10. immigration: 2 (y,n) 
11. synfuels-corporation-cutback: 2 (y,n) 
12. education-spending: 2 (y,n) 
13. superfund-right-to-sue: 2 (y,n) 
14. crime: 2 (y,n) 
15. duty-free-exports: 2 (y,n) 
16. export-administration-act-south-africa: 2 (y,n)

Here is the dataset:

In [49]:
df = pd.read_csv('data/votes.csv', index_col = 0)
df.head()

Unnamed: 0,party,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16
0,republican,0,1,0,1,1,1,0,0,0,1,0,1,1,1,0,1
1,republican,0,1,0,1,1,1,0,0,0,0,0,1,1,1,0,1
2,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,1,0,0
3,democrat,0,1,1,0,1,1,0,0,0,0,1,0,1,0,0,1
4,democrat,1,1,1,0,1,1,0,0,0,0,1,0,1,1,1,1


Now let's break our dataset up into a test and train set and run the Naive Bayes classifier:

In [50]:
X = df.drop(columns = ["party"])
y = df["party"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=4444)

model = naive_bayes.GaussianNB()
model.fit(X_train,y_train)

print('train accuracy', model.score(X_train, y_train))
print('test accuracy', model.score(X_test, y_test))

train accuracy 0.9342105263157895
test accuracy 0.9389312977099237


We can also view the confusion matrix to see that 2 republicans were classified incorrectly as democrats and 6 democrats were incorrectly classified as republicans:

In [51]:
confusion_matrix(y_test, model.predict(X_test))

array([[75,  2],
       [ 6, 48]])

### Using Dummy Variables

The voting dataset was already nice because the features already took on numeric variables (0's and 1's). What if they didn't? For example, let's change the 0's to No's and the 1's to Yes's. Maybe there were even a few "maybes" in there. Let's also reduce our dataset to just two votes for simplicity:

In [52]:
df = df[['party', '1', '2']]
df = df.replace(0, "No")
df = df.replace(1, "Yes")
df.loc[0, '1'] = 'Maybe'
df.head()

Unnamed: 0,party,1,2
0,republican,Maybe,Yes
1,republican,No,Yes
2,democrat,No,Yes
3,democrat,No,Yes
4,democrat,Yes,Yes


In this case, you'll first want to create a one-hot matrix, as we did in the multiple regression chapter:

In [53]:
one_hot = pd.get_dummies(df[['1', '2']])
one_hot.head()

Unnamed: 0,1_Maybe,1_No,1_Yes,2_No,2_Yes
0,1,0,0,0,1
1,0,1,0,0,1
2,0,1,0,0,1
3,0,1,0,0,1
4,0,0,1,0,1


Now, you would be ready to run your Naive Bayes classifier:

In [55]:
X = one_hot
y = df["party"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.30, random_state=4444)

model = naive_bayes.GaussianNB()
model.fit(X_train,y_train)

print('train accuracy', model.score(X_train, y_train))
print('test accuracy', model.score(X_test, y_test))

train accuracy 0.6282894736842105
test accuracy 0.5877862595419847
