*Disclaimer: This is my first tutorial ever, so please leave any comments down below on how I can improve the content of this piece, or let me know if it helped you!*


# A Short Introduction to the Naive Bayes Classifier



Let's say that we are looking at different teams, Team A and Team B, and the conditions in which they win or lose. Let’s say there are two conditions under which Team A and Team B play: the weather is clear or snowy. 


What we want to do is predict whether Team A or Team B wins in their next game. A simple way to do this would be to use a Naive Bayes Classifier, which uses the famous [Bayes Theorem](https://en.wikipedia.org/wiki/Bayes%27_theorem) to calculate the probability that a team wins given certain conditions.


Here’s what __Bayes Theorem__ looks like: 

$$P(y|x) = \frac{P(x|y)P(y)}{\sum_{k=1}^n(P(x_{k}|y)P(y)}$$

Here’s what each value means:
- $P(y|x)$: This is the probability that we wish to calculate. This means “the probability of y happening given x has happened. In our case, this would mean the “the probability that Team y wins given a condition x. Another name for this is the __posterior probability__.


- $P(x|y)$: This means “the probability of x happening given y happened”, or “the probability that it condition x happens given Team y won”. This is referred to as the __likelihood__.


- $P(y)$: This just refers to the probability that Team y won. When multiplied with the likelihood, the numerator becomes the __prior probability__.


- $\sum_{k=1}^n(P(x_{k}|y)P(y)$: This means to add all of the prior probabilities for each conditions. When calculated, it becomes $P(x)$, or “the probability that condition x occurs”. 


Now that we have a grasp on how to use Bayes Theorem, let’s use the Naive Bayes Classifier from scikit-learn to check if our math is right and make more predictions. 

Make sure you have Jupyter Notebook and scikit-learn installed. Here are instructions for [how to install Jupyter Notebook](https://jupyter.readthedocs.io/en/latest/install.html) and on [how to install scikit-learn](http://scikit-learn.org/stable/install.html). 

Download [this dataset](./naive_bayes.csv) first.

## Coding a Naive Bayes Classifier

Now we can start coding our own Naive Bayes Classifier.

First, handle the imports.

In [1]:
import pandas as pd
import numpy as np
from sklearn.naive_bayes import GaussianNB

Let's read in the data.

The Weather column refers to the conditions, or **features**, of our dataset. We want to predict who will win a game given these conditions.

The Winner column refers to who won the game. I will refer to these as **labels**.
This is what we want to predict.

In [2]:
data = pd.read_csv('./naive_bayes.csv')
data.head()

Unnamed: 0,Weather,Winner
0,Clear,A
1,Clear,A
2,Snowy,B
3,Clear,A
4,Snowy,B


Looking at the data, it is clear we are dealing with categorical variables. The Naive Bayes Classifier does not really understand categorical variables. So, in order to use it, we have to encode these columns as quantitative variables.

Luckily, pandas provides a way to do this via dummy variables.

For the Winner column, it'd be easier if we just encode A winning as 0, and B winning as 1. This would ensure that our encoding remain as a single column, which makes it easier to feed into the Naive Bayes Classifier.

In [6]:
encoded_weather = pd.get_dummies(data[['Weather']])

encoded_winners = pd.DataFrame(np.where(data[['Winner']] == 'A', 0, 1))
encoded_winners.columns = ['Winner']

Now that we've encoded both the columns, we can append them together and create a classifier for it.

In [7]:
X = pd.concat([encoded_weather], axis=1)
y = data.Winner

# Create and fit the different Naive Baye models
clf = GaussianNB()
clf.fit(X, y)

GaussianNB(priors=None)

And that's it! 

Now, we can make predictions. For instance, let's say that that the weather is snowy. What is the probability that Team A won?

In [8]:
sample_X = pd.DataFrame([[0, 1]])

print('Probability for GaussianNB: {:.2f}'.format(clf.predict_proba(sample_X)[0][0]))

Probability for GaussianNB: 0.08


As you can see, the probability that Team A wins given that it is snowy is very small.