<table align="center">
   <td align="center"><a target="_blank" href="https://colab.research.google.com/github/ds5110/summer-2021/blob/master/09a-naive-bayes-iris.ipynb">
<img src="https://github.com/ds5110/summer-2021/raw/master/colab.png"  style="padding-bottom:5px;" />Run in Google Colab</a></td>
</table>

# 09a -- Naive Bayes with the iris dataset

References: 

* [Section 5.05 from VanderPlas](https://jakevdp.github.io/PythonDataScienceHandbook/05.05-naive-bayes.html) -- github
* [1.9 Naive Bayes](https://scikit-learn.org/stable/modules/naive_bayes.html) -- scikit-learn.org
* [Naive Bayes classifier](https://en.wikipedia.org/wiki/Naive_Bayes_classifier) -- wikipedia

In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Set some plotting parameters
from matplotlib.colors import ListedColormap
plt.rcParams.update({'font.size': 16})
colors = ('red', 'blue')
cmap = ListedColormap(colors)

In [None]:
from sklearn.datasets import make_blobs

std_true = 3    # original: 1.5
n_samples = 100   # original: 100

# Generate some random Gaussian blobs as data
X, y = make_blobs(n_samples, 2, centers=2, random_state=2, cluster_std=std_true)

# Plot the data
plt.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=cmap, alpha=.7);

In [None]:
# Plot the data on the Gaussian probability densities
fig, ax = plt.subplots()

ax.scatter(X[:, 0], X[:, 1], c=y, s=50, cmap=cmap)
ax.set_title('Naive Bayes Model', size=14)

xlim = (-8, 8)
ylim = (-15, 5)

xg = np.linspace(xlim[0], xlim[1], 60)
yg = np.linspace(ylim[0], ylim[1], 40)
xx, yy = np.meshgrid(xg, yg)
Xgrid = np.vstack([xx.ravel(), yy.ravel()]).T

for label, color in enumerate(colors):
    mask = (y == label)
    mu, std = X[mask].mean(0), X[mask].std(0)
    P = np.exp(-0.5 * (Xgrid - mu) ** 2 / std ** 2).prod(1)
    Pm = np.ma.masked_array(P, P < 0.03)
    ax.pcolorfast(xg, yg, Pm.reshape(xx.shape), alpha=0.5,
                  cmap=color.title() + 's')
    ax.contour(xx, yy, P.reshape(xx.shape),
               levels=[0.01, 0.1, 0.5, 0.9],
               colors=color, alpha=0.2)
    
ax.set(xlim=xlim, ylim=ylim);


# Gaussian Naive Bayes

In [None]:
import seaborn as sns; sns.set()

iris = sns.load_dataset('iris')
iris

In [None]:
sns.pairplot(iris, hue='species');

In [None]:
# Separate the features (predictors) and target variables
X_iris = iris.drop('species', axis=1)
y_iris = iris['species']

print('X:', X_iris.shape)
print('y:', y_iris.shape)

<img src="https://raw.githubusercontent.com/jakevdp/PythonDataScienceHandbook/master/notebooks/figures/05.02-samples-features.png">

Figure credit: [05.02-Introducing-Scikit-Learn.ipynb](https://github.com/jakevdp/PythonDataScienceHandbook/blob/master/notebooks/05.02-Introducing-Scikit-Learn.ipynb) (VanderPlas) -- github

# Train & test datasets

* [sklearn.model_selection.train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) (API reference docs) -- scikit-learn.org

In [None]:
# Create train and test datasets (with sklearn)
# NOTE: You can change the train/test ratio -- look at the API reference docs
from sklearn.model_selection import train_test_split

Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state=1)
print(Xtrain.shape, ytrain.shape)

In [None]:
# You can also create 75/25 train/test datasets with fancy indexing (i.e., by hand)
n_train = 112 # n_train = 112 corresponds to a 75/25 ratio for 150 samples
iris_train = iris.iloc[:n_train, :]
iris_test = iris.iloc[n_train:, :]

Xtrain = iris_train.drop('species', axis=1)
ytrain = iris_train['species']

Xtest = iris_test.drop('species', axis=1)
ytest = iris_test['species']

In [None]:
# Perform Gaussian Naive Bayes classification
from sklearn.naive_bayes import GaussianNB # 1. choose model class
model = GaussianNB()                       # 2. instantiate model
model.fit(Xtrain, ytrain)                  # 3. fit model to data
y_model = model.predict(Xtest)             # 4. predict on new data

## EXERCISE

Explain the results of the next cell -- VanderPlas obtains 97% accuracy

In [None]:
# Assess model accuracy with scikit-learn
from sklearn.metrics import accuracy_score
accuracy = accuracy_score(ytest, y_model)

print('accuracy: {:.2f}'.format(accuracy))

In [None]:
# Assess model accuracy (by hand)
accuracy = (ytest == y_model).sum() / ytest.shape[0]

print('accuracy: {:.2f}'.format(accuracy))

## Exercise

Why does the next cell throw an error?

In [None]:
# Assess model accuracy (by hand)
#accuracy = (y_model == ytest).sum() / ytest.shape[0]
#print('accuracy: {:.2f}'.format(accuracy))

# Confusion matrix

Another way to assess model accuracy

<img src="https://github.com/rasbt/python-machine-learning-book-3rd-edition/raw/master/ch06/images/06_08.png" width="300"/>

Define a convenience function for plotting the confusion matrix.

### References 

* Rasckha's [ch06.ipynb](https://github.com/rasbt/python-machine-learning-book-3rd-edition/blob/master/ch06/ch06.ipynb)
* [confusion_matrix](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) -- scikit-learn.org
* [heatmap](https://matplotlib.org/3.3.1/gallery/images_contours_and_fields/image_annotated_heatmap.html) -- matplotlib.org

In [None]:
from sklearn.metrics import confusion_matrix
import matplotlib.pyplot as plt

mat = confusion_matrix(ytest, y_model)

sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');

# Balanced datasets

**EXERCISE:** Repeat the analysis with a balanced train/test datasets

In [None]:
# Modify this cell to obtain balanced datasets
Xtrain, Xtest, ytrain, ytest = train_test_split(X_iris, y_iris, random_state=1)

model = GaussianNB()
model.fit(Xtrain, ytrain)
y_model = model.predict(Xtest)

mat = confusion_matrix(ytest, y_model)

sns.heatmap(mat, square=True, annot=True, cbar=False)
plt.xlabel('predicted value')
plt.ylabel('true value');