<a href="https://colab.research.google.com/github/matthewpecsok/4482_fall_2022/blob/main/tutorials/NB-iris-Tutorial.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

NB Iris Tutorial
Converted to Python by Matthew Pecsok from Dr. Olivia Sheng's original tutorial in R
June 12, 2021

1 Data description

2 Library Setup

3 Overall data inspection

4 NB model building using sklearn package

5 Explanatory data exploration

6 Generate performance metrics

7 Simple hold-out evaluation


# 1 Data description


This is perhaps the best known database to be found in the pattern recognition literature.The data set contains 3 classes of 50 instances each, where each class refers to a type of iris plant. One class is linearly separable from the other 2; the latter are NOT linearly separable from each other.

Target variable: class of iris plant.

Number of Instances: 150 (50 in each of three classes)

Number of Attributes: 4 numeric, predictive attributes and the class variable

Attribute Information: 1. sepal length in cm 2. sepal width in cm 3. petal length in cm 4. petal width in cm 5. class: Iris Setosa, Iris Versicolour, Iris Virginica


# 2 Library Setup

https://scikit-learn.org/stable/modules/naive_bayes.html

In [None]:
import pandas as pd
import numpy as np


from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

import matplotlib.pyplot as plt

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

from sklearn import metrics

# Package loading. Install the following packages before running this chunk or knitting this program.

#library(e1071)
#library(psych)
#library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
#library(rminer)
#library(rmarkdown)
#library(tictoc) 
#tic()

# 3 overall data inspection

In [None]:
iris_bunch = load_iris()

In [None]:
type(iris_bunch)

https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html

In [None]:
iris = iris_bunch.data

In [None]:
iris.shape

In [None]:
type(iris)

In [None]:
iris_bunch.feature_names

tranform the data from a numpy array and a list into a pandas dataframe for exploratory data analyisi

In [None]:
iris_df = pd.DataFrame(iris,columns=iris_bunch.feature_names)
iris_df['species'] = iris_bunch.target
iris_df

In [None]:
iris_df.info()

In [None]:
iris_df.describe()

# 4 NB model building using sklearn package

In [None]:
gnb = GaussianNB()
y_pred = gnb.fit(iris_df.drop('species',axis=1), iris_df['species']).predict(iris_df.drop('species',axis=1))
y_pred

# 5 Explanatory data exploration

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
iris_df.boxplot(column='petal length (cm)',by='species',ax=ax)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
iris_df.boxplot(column='sepal length (cm)',by='species',ax=ax)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
iris_df.boxplot(column='petal width (cm)',by='species',ax=ax)
plt.show()

In [None]:
fig, ax = plt.subplots(figsize=(10,6))
iris_df.boxplot(column='sepal width (cm)',by='species',ax=ax)
plt.show()

In [None]:
iris_df[iris_df['species']==1].agg('describe')

In [None]:
iris_df[iris_df['species']==2].agg('describe')

In [None]:
pd.plotting.scatter_matrix(iris_df[iris_df['species']==0].drop('species',axis=1),figsize=(15, 10),alpha=1)
plt.show()

In [None]:
pd.plotting.scatter_matrix(iris_df[iris_df['species']==1].drop('species',axis=1),figsize=(15, 10),alpha=1)
plt.show()

In [None]:
pd.plotting.scatter_matrix(iris_df[iris_df['species']==2].drop('species',axis=1),figsize=(15, 10),alpha=1)
plt.show()

# 6 Simple hold-out evaluation

In [None]:
X_train, X_test, y_train, y_test = train_test_split(iris_df.drop('species',axis=1), iris_df['species'], test_size=0.4, random_state=0,stratify=iris_df['species'])

It's worth noting here that we used the stratify=target argument to the split function to make sure that each target class is represented at the same proportion in the test and train set. sklearn does NOT do this by default, while in R createdatapartition does. 

In [None]:
X_train.describe()

In [None]:
X_test.describe()

In [None]:
y_train.value_counts()

In [None]:
y_test.value_counts()

# 7 Generate performance metrics

In [None]:
gnb = GaussianNB()

gnb = gnb.fit(X_train, y_train)

y_pred_test = gnb.predict(X_test)
y_pred_train = gnb.predict(X_train)

In [None]:
gnb.class_prior_

In [None]:
iris_bunch.feature_names

In [None]:
iris_bunch.target_names

In [None]:
gnb.classes_

In [None]:
# means of features by class
gnb.theta_

In [None]:
# variance of features by class
gnb.sigma_

In [None]:
confusion_matrix(y_train,y_pred_train)

In [None]:
disp = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(y_train,y_pred_train),
    display_labels=iris_bunch.target_names
    )
disp.plot(values_format='',cmap=plt.cm.Blues)
plt.show()

In [None]:
confusion_matrix(y_test,y_pred_test)

In [None]:
confusion_matrix(y_test,y_pred_test)
disp = ConfusionMatrixDisplay(
    confusion_matrix=confusion_matrix(y_test,y_pred_test),
    display_labels=iris_bunch.target_names
    )
disp.plot(values_format='',cmap=plt.cm.Blues)
plt.show()


In [None]:
iris_bunch.target_names

In [None]:
print(metrics.classification_report(y_train,y_pred_train, target_names=iris_bunch.target_names))

In [None]:
print(metrics.classification_report(y_test,y_pred_test, target_names=iris_bunch.target_names))

In [None]:
model_info = pd.DataFrame(columns=['species' , 'feature' , 'mean' , 'var'])

for feature in range(len(iris_bunch.feature_names)):
  #print(pd.DataFrame(data={'species':iris_bunch.target_names,'feature':iris_bunch.feature_names[feature],'mean':gnb.theta_[:,feature],'var':gnb.sigma_[:,feature]}))
  model_info = model_info.append(pd.DataFrame(data={'species':iris_bunch.target_names,'feature':iris_bunch.feature_names[feature],'mean':gnb.theta_[:,feature],'var':gnb.sigma_[:,feature]}))


model_info['sd'] = np.sqrt(model_info['var'])
model_info