# Lecture 4: Supervised Models, [Gaussian Naïve Bayes (GNB)](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html) Classifier

* How to train a Gaussian Naïve Bayesian classifier with `sklearn`.
* How to use a Naïve Bayesian classifier for prediction.

## Imports

In [1]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns; sns.set()

## Data

Instead of generating our own data, we use the well-known Iris data set with three classes:

      Iris-Setosa            Iris-Versicolour      Iris-Virginica
  
  
<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/f/f8/Iris_virginica_2.jpg/1280px-Iris_virginica_2.jpg" alt="Iris-Setosa" style="width: 200px; display: inline; margin-top: 0"/>
<img src="https://upload.wikimedia.org/wikipedia/commons/2/27/Blue_Flag%2C_Ottawa.jpg" alt="Iris-Versicolour" style="width: 200px; display: inline; margin-top: 0"/>
<img src="https://upload.wikimedia.org/wikipedia/commons/5/56/Kosaciec_szczecinkowaty_Iris_setosa.jpg" alt="Iris-Virginica" style="height: 200px; display: inline; margin-top: 0"/>

We can load this data by calling [`load_iris()`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_iris.html):

In [2]:
from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True)

`X` and `y` are `ndarray`s.

Data inspection:
* the $X$'s are continuous

In [3]:
X[:5]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

* the $y$'s

In [4]:
y[:5]

array([0, 0, 0, 0, 0])

* unique $y$'s

In [5]:
np.unique(y)

array([0, 1, 2])

## Model Training

We follow the `sklearn` training process as seen on the slides.

### 1. Choose a Model

In [6]:
from sklearn.naive_bayes import GaussianNB

### 2. Choose Hyperparameters

None for now.

In [7]:
model = GaussianNB()
model

GaussianNB(priors=None, var_smoothing=1e-09)

### 3. Arrange Data in Feature Matrix and Target Vector

* `y` needs to be an `n_samples` long vector
* `X` needs to be a `[n_samples, n_features]` matrix

In [8]:
X.shape

(150, 4)

In [9]:
y.shape

(150,)

### 4. Fit the Model to the Data

In [10]:
model.fit(X, y)

GaussianNB(priors=None, var_smoothing=1e-09)

We're not usually interested in a GNB's parameters. Unlike OLS, its parameters do not have a causal interpretation.

The main performance metric for a GNB model is its accuracy: \\(\frac{correct\space decisions}{all\space  decisions}\\)

The model can compute its accuracy automatically for us with the [`score()`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.score) method:

In [11]:
model.score(X, y)

0.96

This means that the model predicted 96% of all cases in the data correctly.

We can use [`predict()`](https://scikit-learn.org/stable/modules/generated/sklearn.naive_bayes.GaussianNB.html#sklearn.naive_bayes.GaussianNB.predict) to see which cases have been mis-classified:

In [12]:
y_pred = model.predict(X)
y_pred

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

In [13]:
print(y[y != y_pred], y_pred[y != y_pred], sep='\n')

[1 1 1 2 2 2]
[2 2 2 1 1 1]


In this case, we don't have test data, but evaluating the prediction in test data works the same way, just that instead of `r2_score()` we use [`accuracy_score()`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html) to calculate the models accuracy (also import from `sklearn.metrics`).

© 2023 Philipp Cornelius