# Naive Bayes Modeling

In [None]:
import numpy as np
import pandas as pd
from scipy import stats
from sklearn.naive_bayes import MultinomialNB, GaussianNB
    # There is also a BernoulliNB for a dataset with binary predictors
import seaborn as sns
from matplotlib import pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.metrics import plot_confusion_matrix
from sklearn.preprocessing import OneHotEncoder

%matplotlib inline

## Agenda

SWBAT:

- describe how Bayes's Theorem can be used to make predictions of a target;
- identify the appropriate variant of Naive Bayes models for a particular business problem.

## Using Bayes's Theorem for Classification

Let's recall Bayes's Theorem:

$\large P(h|e) = \frac{P(h)P(e|h)}{P(e)}$

**Does this look like a classification problem?**

- Suppose we have three competing hypotheses $\{h_1, h_2, h_3\}$ that would explain our evidence $e$.
    - Then we could use Bayes's Theorem to calculate the posterior probabilities for each of these three:
        - $P(h_1|e) = \frac{P(h_1)P(e|h_1)}{P(e)}$
        - $P(h_2|e) = \frac{P(h_2)P(e|h_2)}{P(e)}$
        - $P(h_3|e) = \frac{P(h_3)P(e|h_3)}{P(e)}$
        
- Suppose the evidence is a collection of elephant weights.

- Suppose each of the three hypotheses claims that the elephant whose measurements we have belongs to one of the three extant elephant species (*L. africana*, *L. cyclotis*, and *E. maximus*).

In that case the left-hand sides of these equations represent the probability that the elephant in question belongs to a given species.

If we think of the species as our target, then **this is just an ordinary classification problem**.

What about the right-hand sides of the equations? **These other probabilities we can calculate from our dataset.**

- The priors can simply be taken to be the percentages of the different classes in the dataset.
- What about the likelihoods?
    - If the relevant features are **categorical**, we can simply count the numbers of each category in the dataset. For example, if the features are whether the elephant has tusks or not, then, to calculate the likelihoods, we'll just count the tusked and non-tuksed elephants per species.
    - If the relevant features are **numerical**, we'll have to do something else. A good way of proceeding is to rely on (presumed) underlying distributions of the data. [Here](https://medium.com/analytics-vidhya/use-naive-bayes-algorithm-for-categorical-and-numerical-data-classification-935d90ab273f) is an example of using the normal distribution to calculate likelihoods. We'll follow this idea below for our elephant data.

## Elephant Example

Suppose we have a dataset that looks like this:

In [None]:
elephs = pd.read_csv('data/elephants.csv', usecols=['height (in)',
                                                   'species'])

In [None]:
elephs.head()

In [None]:
plt.style.use('fivethirtyeight')
fig, ax = plt.subplots()

sns.kdeplot(data=elephs[elephs['species'] == 'maximus']['height (in)'],
            ax=ax, label='maximus')
sns.kdeplot(data=elephs[elephs['species'] == 'africana']['height (in)'],
            ax=ax, label='africana')
sns.kdeplot(data=elephs[elephs['species'] == 'cyclotis']['height (in)'],
            ax=ax, label='cyclotis');

### Naive Bayes by Hand

Suppose we want to make prediction of species for some new elephant whose weight we've just recorded. We'll suppose the new elephant has:

In [None]:
new_ht = 263

What we want to calculate is the mean and standard deviation for height and weight for each elephant species. We'll use these to calculate the relevant likelihoods.

So:

In [None]:
max_stats = elephs[elephs['species'] == 'maximus'].describe().loc[['mean', 'std'], :]
max_stats

In [None]:
cyc_stats = elephs[elephs['species'] == 'cyclotis'].describe().loc[['mean', 'std'], :]
cyc_stats

In [None]:
afr_stats = elephs[elephs['species'] == 'africana'].describe().loc[['mean', 'std'], :]
afr_stats

In [None]:
elephs['species'].value_counts()

### Calculation of Likelihoods

We'll use the PDFs of the normal distributions with the discovered means and stds to calculate likelihoods:

In [None]:
stats.norm(loc=max_stats['height (in)'][0],
           scale=max_stats['height (in)'][1]).pdf(263)

In [None]:
stats.norm(loc=cyc_stats['height (in)'][0],
          scale=cyc_stats['height (in)'][1]).pdf(263)

In [None]:
stats.norm(loc=afr_stats['height (in)'][0],
          scale=afr_stats['height (in)'][1]).pdf(263)

### Posteriors

What we have just calculated are the likelihoods, i.e.:

- P(weight=7009 | species=maximus) = 2.04%,
- P(weight=7009 | species=cyclotis) = 1.50%, and
- P(height=263 | species=africana) = 0.90%.

(Notice that they do NOT sum to 1!) But what we'd really like to know are the posteriors. I.e. what are:

- P(species=maximus | height=263),
- P(species=cyclotis | height=263), and
- P(species=africana | height=263)?

Since we have equal numbers of each species, every prior is equal to $\frac{1}{3}$. Thus we can calculate the probability of the evidence:

P(height=263) = $\frac{1}{3}(0.0204 + 0.0150 + 0.0090) = 0.0148$,

and therefore calculate the posteriors using Bayes's Theorem:

- P(species=maximus | height=263) = $\frac{1}{3}\frac{0.0204}{0.0148} = 45.9\%$;
- P(species=cyclotis | height=263) = $\frac{1}{3}\frac{0.0150}{0.0148} = 33.8\%$;
- P(species=africana | height=263) = $\frac{1}{3}\frac{0.0090}{0.0148} = 20.3\%$.

Bayes's Theorem shows us that the largest posterior belongs to the *maximus* species. (Note also that, since the priors are all the same, the largest posterior will necessarily belong to the species with the largest likelihood!)

Therefore, the *maximus* species will be our prediction for an elephant of this height.

### More Dimensions

In fact, we also have elephant *weight* data available in addition to their heights. To accommodate multiple features we can make use of **multivariate normal** distributions.

![multivariate-normal](https://upload.wikimedia.org/wikipedia/commons/thumb/8/8e/MultivariateNormal.png/440px-MultivariateNormal.png)

In [None]:
elephants = pd.read_csv('data/elephants.csv',
                       usecols=['height (in)', 'weight (lbs)', 'species'])

In [None]:
maximus = elephants[elephants['species'] == 'maximus']
cyclotis = elephants[elephants['species'] == 'cyclotis']
africana = elephants[elephants['species'] == 'africana']

In [None]:
likeli_max = stats.multivariate_normal(mean=maximus.mean(),
                          cov=maximus.cov()).pdf([263, 7009])
likeli_max

In [None]:
likeli_cyc = stats.multivariate_normal(mean=cyclotis.mean(),
                         cov=cyclotis.cov()).pdf([263, 7009])
likeli_cyc

In [None]:
likeli_afr = stats.multivariate_normal(mean=africana.mean(),
                         cov=africana.cov()).pdf([263, 7009])
likeli_afr

#### Posteriors

In [None]:
post_max = likeli_max / sum([likeli_max, likeli_cyc, likeli_afr])
post_cyc = likeli_cyc / sum([likeli_max, likeli_cyc, likeli_afr])
post_afr = likeli_afr / sum([likeli_max, likeli_cyc, likeli_afr])

print(post_max)
print(post_cyc)
print(post_afr)

### `GaussianNB`

In [None]:
gnb = GaussianNB(priors=[1/3, 1/3, 1/3])

In [None]:
X = elephants.drop('species', axis=1)
y = elephants['species']

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
gnb.fit(X_train, y_train)

In [None]:
gnb.predict_proba(np.array([263, 7009]).reshape(1, -1))

In [None]:
gnb.score(X_test, y_test)

In [None]:
plot_confusion_matrix(gnb, X_test, y_test);

## Comma Survey Example

In [None]:
commas = pd.read_csv('data/comma-survey.csv')

In [None]:
commas.head()

The first question on the survey was about the Oxford comma.

In [None]:
commas.isna().sum()

We'll go ahead and drop the NaNs:

In [None]:
commas = commas.dropna()

In [None]:
commas.shape

In [None]:
commas['In your opinion, which sentence is more gramatically correct?'].value_counts()

Personally, I like the Oxford comma, since it can help eliminate ambiguities, such as:

"This book is dedicated to my parents, Ayn Rand, and God" <br/> vs. <br/>
"This book is dedicated to my parents, Ayn Rand and God"

Let's see how a Naive Bayes model would make a prediction here. We'll think of the comma preference as our target.

In [None]:
commas['Age'].value_counts()

Suppose we want to make a prediction about Oxford comma usage for a new person who falls into the **45-60 age group**.

### Calculating Priors and Likelihoods

The following code makes a table of values that count up the number of survey respondents who fall into each of eight bins (the four age groups and the two answers to the first comma question). 

In [None]:
table = np.zeros((2, 4))

for idx, value in enumerate(commas['Age'].value_counts().index):
    table[0, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind, and loyal.') & (commas['Age'] == value)])
    table[1, idx] = len(commas[(commas['In your opinion, which sentence is '\
                                       'more gramatically correct?'] ==\
                                        'It\'s important for a person to be '\
                                'honest, kind and loyal.') & (commas['Age'] == value)])

In [None]:
table

In [None]:
df = pd.DataFrame(table, columns=['Age45-60',
                            'Age>60',
                            'Age30-44',
                            'Age18-29'])
df

In [None]:
df['Oxford'] = [True, False]
df = df[['Age>60', 'Age45-60', 'Age30-44', 'Age18-29', 'Oxford']]
df

Since all we have is a single categorical feature here we can just read our likelihoods and priors right off of this table:

Likelihoods:

- Age45-60:
    - P(Age45-60 | Oxford=True) = $\frac{123}{470} = 0.2617$;
    - P(Age45-60 | Oxford=False) = $\frac{125}{355} = 0.3521$.

Priors:

- P(Oxford=True) = $\frac{470}{825} = 0.5697$;
- P(Oxford=False) = $\frac{355}{825} = 0.4303$.

### Calculating Posteriors

First we'll calculate the probability of the evidence:

- P(Age45-60) = P(Age45-60 | Oxford=True) * P(Oxford=True) + P(Age45-60 | Oxford=False) * P(Oxford=False) = 0.2617 * 0.5697 + 0.3521 * 0.4303 = 0.3006

In [None]:
(123+125)/825

Now use Bayes's Theorem to calculate the posteriors:

- P(Oxford=True | Age45-60) = P(Oxford=True) * P(Age45-60 | Oxford=True) / P(Age45-60) = 0.5697 * 0.2617 / 0.3006 = 0.4960;
- P(Oxford=False | Age45-60) = P(Oxford=False) * P(Age45-60 | Oxford=False) / P(Age45-60) = 0.4303 * 0.3521 / 0.3006 = 0.5040.

Close! But our prediction for someone in the 45-60 age group will be that they **do not** favor the Oxford comma.

### Comparison with `MultinomialNB`

In [None]:
comma_model = MultinomialNB()

ohe = OneHotEncoder()
ohe.fit(commas['Age'].values.reshape(-1, 1))

X = ohe.transform(commas['Age'].values.reshape(-1, 1)).todense()
y = commas['In your opinion, which sentence is more gramatically correct?']

In [None]:
comma_model.fit(X, y)

In [None]:
comma_model.predict_proba(np.array([0, 0, 1, 0]).reshape(1, -1))