<a href="https://colab.research.google.com/github/moktan456/Data-Mining/blob/main/05_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Practical 05 - Data Classification

This practical is designed to be run over two weeks.
You are encouraged to complete as much of the prac as you can before the second week so that we can spend time addressing questions and problems that people have.

As usual, you should save a copy of this notebook in Google drive (or on your own system if not using Colaboratory)


# Q1 Iris classification
We will begin by running some classification algorithms on the Iris data set.
This is one of the most commonly used machine learning datasets for classification. The data can be found at the [UCI repository](https://archive.ics.uci.edu/ml/datasets/iris), however it's small size and large popularity means that many machine learning libraries are bundled with the data.

In this task you will explore some of the basics of data classification using [scikit-learn](https://scikit-learn.org/stable/index.html) (`sklearn`).

1. Import the Iris data from `sklearn`. Read the documenation that describes the data, what the attributes are, and what the classification task is.
1. Split the data into two subsets:
  - A training subset comprising 75% of the data
  - A testing subset comprising 25% of the data
1. Using the k-NN classifier (`sklearn.neighbors.NearestNeighbors`):
  - Train the classifier with the following options and record the error rate:
    - `weights = uniform` or `distance`
    - `k = 1, 3, 7, 11, 17,` or `21`
  - Of the 12 combinations of the above, choose the one with the lowest error rate as your *champion*.
  - Train your *champoin* using the entire training data-set, and evaluate it on the test test.
  - Create a confusion matrix by comparing the predicted and actual classes for the test data.
1. Using a descision tree classifier:
 - Train the classifier using both the `Gini index` and `entropy` criterion for splitting.
 - Choose the classifier which has the highest F1 score as your best classifier.
 - Plot the descision tree for your best classifier.
1. Using a naive-Bayes clasffieir:
 - Train the classifier on all the training data.
 - Predict the classes of the test data.
 - Plot a confusion matrix.

## Import the Iris data from sklearn.
Read the documenation that describes the data, what the attributes are, and what the classification task is.

In [1]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets

In [None]:
iris = datasets.load_iris()
# Inspect the data structure
iris.keys()

In [None]:
# read the description to learn more about the data set
print(iris['DESCR'])

## Split the data into two subsets
 Split the data into two subsets:
  - A training subset comprising 75% of the data
  - A testing subset comprising 25% of the data

In [None]:
# Normally our we are given train/test data separately
# hewever for this prac we will take 25% of the iris data can pretend that it's test data
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(?, ?, # this should be the data matrix and the class labels
                                                    test_size=?, # use a test size of 25%
                                                    random_state=4) # this random state ensures that we get the same subset each time we call this cell

In [None]:
X_train.shape, X_test.shape

## Explore different ways to split data for cross validation

sklearn provides three methods to divide data into train/test sets:
- ShuffleSplit
  - Random sampling
- Kfold
  - Ordered sampling
- StratifiedKFold
  - Stratified sampling

Use each of the above methods to create a 10 fold split of the data for cross validation and visualise the splits.

In [None]:
from sklearn.model_selection import StratifiedKFold, KFold, ShuffleSplit

In [None]:
# This is random sampling
ss = ShuffleSplit(n_splits=10, test_size=15, random_state=4)
# This is non-random sampling, we just break the data in to 10 contiguous sub-sets
kf = KFold(n_splits=10)
# Ensuring the balance between classes in the model/validate sets
# means we should use stratified sampling
skf = StratifiedKFold(n_splits=10)


In [None]:
# This cell sets up a nice visulisation that I found on the scikit-learn documentation page.
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
cmap_data = plt.cm.Paired
cmap_cv = plt.cm.coolwarm

def plot_cv_indices(cv, X, y, group, ax, n_splits, lw=10):
    """
    Create a sample plot for indices of a cross-validation object.
    Adapted from https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#define-a-function-to-visualize-cross-validation-behavior

    Parameters
    ----------
    cv: cross validation method

    X : training data

    y : data labels

    group : group labels

    ax : matplolib axes object

    n_splits : number of splits

    lw : line width for plotting
    """

    # Generate the training/testing visualizations for each CV split
    for ii, (tr, tt) in enumerate(cv.split(X=X, y=y, groups=group)):
        # Fill in indices with the training/test groups
        indices = np.array([np.nan] * len(X))
        indices[tt] = 1
        indices[tr] = 0

        # Visualize the results
        ax.scatter(range(len(indices)), [ii + .5] * len(indices),
                   c=indices, marker='_', lw=lw, cmap=cmap_cv,
                   vmin=-.2, vmax=1.2)

    # Plot the data classes and groups at the end
    ax.scatter(range(len(X)), [ii + 1.5] * len(X),
               c=y, marker='_', lw=lw, cmap=cmap_data)

    ax.scatter(range(len(X)), [ii + 2.5] * len(X),
               c=group, marker='_', lw=lw, cmap=cmap_data)

    # Formatting
    yticklabels = list(range(n_splits)) + ['class', 'group']
    ax.set(yticks=np.arange(n_splits+2) + .5, yticklabels=yticklabels,
           xlabel='Sample index', ylabel="CV iteration",
           ylim=[n_splits+2.2, -.2])
    ax.set_title('{}'.format(type(cv).__name__), fontsize=15)
    return ax

In [None]:
# Set up a figure with three subplots
fig, ax = plt.subplots(1,3, figsize=(18,6))
# visualise the ShulffleSplit algorithm
plot_cv_indices(ss,
                X,
                y,
                group=None,
                ax=ax[0],
                n_splits=10)
# visualise the KFolds algorithm
plot_cv_indices(kf,
                X,
                y,
                group=None,
                ax=ax[1],
                n_splits=10)
# visualise the StratifiedKFolds algorithm
plot_cv_indices(skf,
                X,
                y,
                group=None,
                ax=ax[2],
                n_splits=10)
plt.show()

Have a look at the above figure and note the following:
- The horizontal bars represent the 150 instances in our data set, with thier index shown on the horizontal axis.
- The vertical axis shows different cross validation iterations, plus an visual indicator of the class for each instance.
  - The blue color indicates training data, while orange represents test data. Not how this changes between the three splitting methods.
  - There are three classes of equal number, so we have three equal length bars in the second to last row. The data are sorted so that the first 50 instances are all of class 0, etc..
- Ignore the "group" row, it's not useful here.

From the above figre, decide which splitting algorithm is likely to give us the best results.

## Use the k-NN classifier
Using the k-NN classifier (`sklearn.neighbors.NearestNeighbors`):
  - Train the classifier with the following options and record the error rate:
    - `weights = uniform` or `distance`
    - `n_neighbors = 1, 3, 7, 11, 17,` or `21`
  - Of the 12 combinations of the above, choose the one with highest accuracy as your *champion*.
  - Create a confusion matrix by comparing the predicted and actual classes for the test data.

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import metrics
from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.metrics import ConfusionMatrixDisplay

In [None]:
# Create a dictionary of all the parameters we'll be iterating over
parameters = {'weights': (?,?), # this should be the different weighting schemes
              'n_neighbors':[?]} # this should be a list of the nearest neigbhours
# make a classifier object
knn = KNeighborsClassifier()
# create a GridSearchCV object to do the training with cross validation
gscv = GridSearchCV(estimator=knn,
                    param_grid=parameters,
                    cv=?,  # the cross validation folding pattern
                    scoring='accuracy')
# now train our model
best_knn = gscv.fit(X_train, y_train)

In [None]:
best_knn.best_params_, best_knn.best_score_

In [None]:
knn = KNeighboursClassifier(weights = best_nkk.best_params_['weights'],
                            n_neightbours =

In [None]:
fig, ax = plt.subplots(1,1, figsize=(6, 6))

ConfusionMatrixDisplay.from_estimator(best_knn,
                                      X_test, y_test,
                                      display_labels=iris['target_names'],
                                      ax=ax)
plt.tight_layout()
plt.show()

## Inspect the splitting schemes
In the previous plot we found that the test data set had unbalanced classes, even though the input data has a even ratio of three classes.
This is because our initial split of test/train data was done without reguard to the class labels.

Now we will explore the effect of different splitting schemes on the training of our data.
We'll split the data using ShuffleSplit, KFolds, and StratifiedKFolds, and see how that affects the training of the classifier.

In [None]:
fig, ax = plt.subplots(2,5, figsize=(18,6))
# remake this object so that we get back to the same random initial state
ss = ShuffleSplit(n_splits=10, test_size=15, random_state=4)
print("Using ShuffleSplit")
for i, (model, validate) in enumerate(ss.split(X, y)):
  knn = KNeighborsClassifier(n_neighbors=3, weights='uniform')
  classifier = knn.fit(X[model], y[model])
  ConfusionMatrixDisplay.from_estimator(classifier,
                                        X[validate], y[validate],
                                        display_labels=iris['target_names'],
                                        ax=ax.ravel()[i])


plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(2,5, figsize=(18,6))

print("Using KFolds")
for i, (model, validate) in enumerate(kf.split(X, y)):
  knn = KNeighborsClassifier(n_neighbors=1)
  classifier = knn.fit(X[model], y[model])
  ConfusionMatrixDisplay.from_estimator(classifier,
                                        X[validate], y[validate],
                                        display_labels=iris['target_names'],
                                        ax=ax.ravel()[i])


plt.tight_layout()
plt.show()

In [None]:
fig, ax = plt.subplots(2,5, figsize=(18,6))

print("Using StratifiedKFolds")
for i, (model, validate) in enumerate(skf.split(X, y)):
  knn = KNeighborsClassifier(n_neighbors=1)
  classifier = knn.fit(X[model], y[model])
  ConfusionMatrixDisplay.from_estimator(classifier,
                                        X[validate], y[validate],
                                        display_labels=iris['target_names'],
                                        ax=ax.ravel()[i])


plt.tight_layout()
plt.show()

## Use a descision tree classifier
Using a descision tree classifier:
 - Train the classifier using both the `Gini index` and `entropy` criterion for splitting, and a range of `min_samples_split` between 3 and 20.
 - Choose the classifier which has the highest accuracy score as your best classifier.
 - Plot the descision tree for your best classifier.

In [None]:
from sklearn import tree

In [None]:
# Create a dictionary of all the parameters we'll be iterating over
parameters = {'criterion': (?,?),  # this should be the different splitting criteria
              'min_samples_split':[?]} # this should be the different values for min_samples_split
dtc = tree.DecisionTreeClassifier()
gscv = GridSearchCV(estimator=dtc,
                    param_grid=parameters,
                    cv=5,
                    scoring='accuracy')
best_dtc = gscv.fit(X_train, y_train)
best_dtc.best_params_, best_dtc.best_score_

In [None]:
fig, ax = plt.subplots(1,1, figsize=(12,12))
tree.plot_tree(best_dtc.best_estimator_,
               filled=True, # color the nodes based on class/purity
               ax=ax, fontsize=12)
plt.show()

In [None]:
fig, ax = plt.subplots(1,1, figsize=(6, 6))

ConfusionMatrixDisplay.from_estimator(best_dtc,
                                      X_test, y_test,
                                      display_labels=iris['target_names'],
                                      ax=ax)
plt.tight_layout()
plt.show()

## Use a naive-Bayes clasffieir
Using a naive-Bayes clasffieir:
 - Train the classifier on all the training data.
 - Predict the classes of the test data.
 - Plot a confusion matrix.

In [None]:
from sklearn import naive_bayes

In [None]:
# no parameters to adjust so no need to optimise, just train
fig, ax = plt.subplots(1,1)
nb = naive_bayes.GaussianNB()
nb.fit(X_train, y_train)
ConfusionMatrixDisplay.from_estimator(nb,
                                      X_test, y_test,
                                      display_labels=iris['target_names'],
                                      ax=ax)
plt.tight_layout()
plt.show()

# Helpful tools

## Corner plot

A useful plot for visualising multi-dimensional data is the corner-plot or pair plot.
There is a function built into pandas called `scatter_matrix`, and the plotting package `seaborn` also has a function called `pairplot`.
Let's have a look at them below.

In [None]:
import pandas as pd
import seaborn as sns

In [None]:
# Load the data and labels as data frams and then join them to make a new one
df1 = pd.DataFrame(X, columns=iris.feature_names)
df2 = pd.DataFrame(y, columns=['class'])
df = df1.join(df2)

In [None]:
df.describe()

In [None]:
pd.plotting.scatter_matrix(df1,c=df['class'], figsize=(15, 15), marker='o',
                                 hist_kwds={'bins': 20}, s=60, alpha=.8)
print('Plotted with pandas')

In [None]:
sns.pairplot(df, hue='class', palette=sns.color_palette('colorblind',3))
print("Plotted with seaborn")

The thing that I most prefer about the seaborn plot is that the diagonal entries are still separated by class.
From this plot it is clear that the last two features are good at separating the three classes, where as the first two attributes are not so useful.

## Correlation plot

A correlation matrix is simlar to the corner plot above but it simply reports the correlation between each of the attributes.

We can compute the correlation matrix using pandas with the `df.corr()` method, and the plot using either `matplotlib` or `seaborn`.

In [None]:
# compute correlation matrix
cor = df.corr()

In [None]:
# plot the covariance with matplotlib
fig, ax = plt.subplots(1,1, figsize=(8,8))
im = ax.imshow(cor)
cb = plt.colorbar(ax=ax, mappable=im)
plt.show()

In [None]:
# use seaborn to do the plot
sns.heatmap(df.corr(), annot=True, cmap=plt.cm.Reds)

Looking at the correlation plot we can see that the petal length/width are highly correlated with the class attribute and are likely useful attributes.
The fact that they are also highly correlated with each other means that we might be able to use just one of the two features.

The sepal width has much lower correlation and so is probably not so useful.