# Exercise 00: Examples

## Installing Modules

In [None]:
import sys
!{sys.executable} -m pip install -U pip
!{sys.executable} -m pip install -U scikit-learn matplotlib seaborn pandas

## Loading and Assessing the Iris Dataset

Load the pre-canned iris dataset:

In [None]:
from sklearn import datasets
dataset = datasets.load_iris()

`dataset` has type `Bunch` [here](https://scikit-learn.org/stable/modules/generated/sklearn.utils.Bunch.html), which is essentially a fancy dictionary:

In [None]:
type(dataset)

In [None]:
dataset.keys()

The keys we're most interested in are `'data'`, `'target'`, `'target_names'`, and `'feature_names'`.

`'target_names'` contains the labels for each class:

In [None]:
dataset['target_names']

and `'target'` contains the correct `'target_name'` **index** for each entry in `'data'`:

In [None]:
dataset['target']

The `'data'` array can be quite large.  Each row is an example and each column is a feature for that example.  in this case, there are 150 examples, each with 4 features:

In [None]:
dataset['data'].shape

In [None]:
dataset['data'][0]

What do these numbers mean?  The `'feature_names'` array tells us:

In [None]:
dataset['feature_names']

We can get a feel for the iris dataset by creating scatter plots from pairs of features.

Documentation of interest:
- [subplots](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.subplots.html)
- [scatter](https://matplotlib.org/stable/api/_as_gen/matplotlib.pyplot.scatter.html)
- [axis.set](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.set.html)
- [axis.legend](https://matplotlib.org/stable/api/_as_gen/matplotlib.axes.Axes.legend.html)

In [None]:
import matplotlib.pyplot as plt
from random import randrange

_, ax = plt.subplots()

scatter = ax.scatter(
    dataset.data[:, 0],
    dataset.data[:, 1],
    c=dataset.target
)

ax.set(
    xlabel=dataset.feature_names[0],
    ylabel=dataset.feature_names[1]
)

ax.legend(
    scatter.legend_elements()[0],
    dataset.target_names,
    loc="lower right",
    title="Classes"
);

Creating these plots manually can be tedious, so we'll use Seaborn `pairplot` to create every feature combo in one command.

First, we need to convert the `scikit-learn` dataset into a `pandas` dataframe.

In [None]:
import pandas as pd
import numpy as np
import seaborn as sns

sns.set(style="ticks", color_codes=True)

df = pd.DataFrame(
    data=np.c_[dataset['data'], dataset['target']],   # concatenate 'data' and 'target' values
    columns=dataset['feature_names'] + ['Species'] # concatenate 'feature_names' and "Species"
)

In [None]:
df

Next, we'll convert the numeric `'Species'` to strings by converting that column to "category" data and then mapping the values.

In [None]:
df['Species'] = df['Species'].astype('category')
df['Species'] = df['Species'].cat.rename_categories(dataset['target_names'])

Useful documentation:
* [pairplot](https://seaborn.pydata.org/generated/seaborn.pairplot.html)

**NOTE: THIS CAN TAKE A WHILE TO RENDER!**

In [None]:
import seaborn as sns

g = sns.pairplot(
    df,
    hue="Species"
)

## Training vs Testing

We want to split the data we have into two chunks:
* a training set that we use teach our model
* a testing set that we use to assess model performance

It is **critical** that our model gets no information about the testing dataset,
and that the statistics of the training and testing dataset are consistent.

In this example, we'll use 60% of our data for training, and 40% for testing.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    dataset['data'], dataset['target'], test_size=0.4, random_state=10
)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics

clf = KNeighborsClassifier(3)
clf.fit(X_train, y_train)
predicted = clf.predict(X_test)

print(
    f"Classification report for classifier {clf}:\n"
    f"{metrics.classification_report(y_test, predicted)}\n"
)

disp = metrics.ConfusionMatrixDisplay.from_predictions(y_test, predicted)
print(f"Confusion matrix:\n{disp.confusion_matrix}")
disp.figure_.suptitle("K Nearest Neighbors Confusion Matrix")
plt.show()