# Workbook 6: Supervised Machine Learning

## Description and aims

This tutorial is designed to give you your first experience of machine learning in practice by implementing a simple nearest-neighbour classifier.

The learning outcomes are:
- experience of implementing the K Nearest Neighbours classification algorithm
- experience of using the sklearn DecisionTree classification algorithm
-  experience of working through different preprocessing steps to try and improve the performance of your classifier

<div class="alert alert-warning" style="color:black">
    <h1>Activity 1: Loading and Visualising Data</h1>
   We will start by importing and visualising the two datasets used as examples in the lecture: students marks,  and Iris
<ul>
    <li>You should already have uploaded the data and figures from the lecture materials folder - if not, do so now.</li>
    <li>Then run the 5 code cells below to load and display the two datasets</li>
            </ul></div>

In [None]:
import numpy as np
import matplotlib.pyplot as plt

import workbook6_utilities as wb6
%matplotlib inline



### The Student marks dataset

In [None]:

grades, result, simpleResult = wb6.load_student_marks_dataset("../lectures/data/assessment-grades-2features.csv")

wb6.plot_student_marks(grades,result,simpleResult)

### Example 2:  Iris flowers <img src="../lectures/figures/ML/Iris-image.png" style="float:right">
- classic Machine Learning Data set
- 4 measurements: sepal and petal width and length
- 50 examples  from each 3 sub-species for iris flowers
- three class problem:
 - so for some types of algorithm have to decide whether to make  
   a 3-way classifier or nested 1-vs-rest classifers
- most ML classifiers can get over 90%



In [None]:
import sklearn.datasets
irisX,irisy = sklearn.datasets.load_iris(return_X_y=True)
columnLabels= ("sepal_length", "sepal_width", "petal_length", "petal_width")
title="Scatterplots of 2D slices through the 4D Iris data"
wb6.show_scatterplot_matrix(irisX,irisy,columnLabels,title)

<div class="alert alert-warning" style="color:black">
    <h1>Activity 2: Implementing K-Nearest Neighbours</h1>
</div>
            
Basic process for predicting the label of a new point from the trainig set
1. Measure distance to new poitn from every member of the trainig set
2. Find the K Nearest Neighbours  
   in other words, the K members of the trainig set with the smallest distances  (*calculated in step 1*)
3. Count the labels that were provided for those K trainig items,  
   and return themost common one as the predicted label.

Below is a figure illustrating the start and first two steps of process.  
It is followed by a code cell with a simple implementation of a class for 1-Nearest neighbours. 

Read through the code  to get a sense for how it implements the algorithm.  
Your tutor will discuss it with you in the lab sessions.
<img src="../lectures/figures/ML/kNN-steps.png">



In [None]:
# Example for K = 1 

class simple_1NN:

    def __init__(self,verbose = True):
        # this version only looks at the single nearest neighbour
        self.K=1
        self.verbose= verbose
        
    def fit(self,X,y):
        # ask the data how big it is and store that info
        self.numExemplars = X.shape[0]
        self.numFeatures = X.shape[1]
        # store a copy of the data (X) and the labels (y)
        self.modelX = X
        self.modelY = y
        self.labelsPresent = np.unique(self.modelY) # list the unique values found in the labels provided
        if (self.verbose):
            print("There are {} training examples, each described by values for {} features".format(self.numExemplars,self.numFeatures))
            print("So self.modelX is a 2D array of shape {}".format(self.modelX.shape))
            print("self.modelY is a list with {} entries, each being one of these labels {}".format(len(self.modelY), self.labelsPresent))
        
    def predict(self,newItems):
        # read how many  newitems there are
        numToPredict = newItems.shape[0]
        # make an empty list to hold their predicted labels
        predictions = np.empty(numToPredict)
        
        #loop through each new item each one
        for item in range(numToPredict):
            # predicting its label
            thisPrediction = self.PredictNewItem ( newItems[item])
            # adding that predictin to our list
            predictions[item] = thisPrediction
        return predictions
    
    def PredictNewItem(self,newItem):
        
        # Step 1: measure and store distance to each training item
        distFromNewItem = np.zeros((self.numExemplars)) # array with one entry for each trainig set item, intialised to zero
        for exemplar in range (self.numExemplars):
            distFromNewItem[exemplar] = self.EuclideanDistance(newItem,  self.modelX[exemplar])
  
        # Step 2: find the one closest training example: This is K=1, 
        closest = 0
        for trainingExample in range (0, self.numExemplars):
            if  ( distFromNewItem[trainingExample] < distFromNewItem[closest] ):
                closest=trainingExample
 
        # step 3: count the votes - because this is for K=1 so we don't need to take a vote
        labelOfClosest = self.modelY[closest]
        return labelOfClosest
    
    def EuclideanDistance(self,a,b):
        ## this numpy function calculates the euclidean distance
        return np.linalg.norm(a-b)
  


<div class="alert alert-warning" style="color:black" >
<h2> Activity 2.1</h2>
    Run the code provided for K=1 with the two datasets and make sure you understand the outputs and how they are produced
<ul>
    <li>For the marks dataset this creates a plot to show a decision surface</li>
    <li>For the  iris data set this uses a confusion matrix <br> (google what a confusion matrix is if you're not sure)</li>
    </ul>
    </div>

**The Marks dataset - illustrating a 2D Decision surface**

In [None]:



# create and train the classifier
myKNNmodel = simple_1NN()
myKNNmodel.fit(grades,simpleResult) 


#visualise the decision surface
wb6.PlotDecisionSurface(grades, simpleResult, myKNNmodel,"1-NN simplified outcomes", ("exam","cw"),minZero=True)



**The Iris dataset - illustrating a confusion matrix**

In [None]:
# make train/test split 
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
irisX,irisy = load_iris(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(irisX, irisy, test_size=0.33,stratify=irisy)

irisClassNames = ("setosa","versicolor","virginica")

model = simple_1NN()
model.fit(X_train,y_train)
ypred = model.predict(X_test)
cm = confusion_matrix(y_test, ypred)
CMPlot=ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=irisClassNames)
CMPlot.plot()


<div class="alert alert-warning" style="color:black" >
<h2> Activity 2.3: Create your own implementation of K-Nearest Neighbours</h2>
    Using the code above,  extend the predict method for the class simple_1NN  to use the votes from K>1 neighbours.


<ul>
    <li>Start by creating an empty class called Simple_KNN and copying in the pseudo-code as comments</li>
    <li>Then copy the code from the simple_1NN class into the relevant places</li>
    <li> You should only need to make minor changes to the __init__ method to set the value of K </li>
    <li> in the predictNewItem() method you will need to change step 2  and step 3 </li>
    <li> The pseudocode suggests one possible ways of doing step 2. </li>
    </ul>
    <b> It's often helpful to put in some print() statements to show what is going on as you develop your code</b><br>
        And if you can write your code  so that it runs in 'partially completed' state then you can build it up in bits.
</div>

### Pseudocode for KNearest Neighbours
**init()**  :  
SPECIFY function to calculate distance metric d(i,j) for any two items *i* and *j*     
  e.g. Euclidean (continuous variables) or Hamming (categorical)  
SET value of K

**fit(trainingData)** :  

SET numExemplars = READ(number of rows in training data)  
SET numFeatures = READ(number of columns in training data) 

*#Just store a local copy of the training data as two arrays:*   
CREATE_AND_FILL(X_train of shape (numExemplars , numFeatures)).     
CREATE_AND_FILL(y_train of shape( numExemplars))
  
**predict(newItems)** :  
SET numToPredict = READ(number of rows in newItems)  
SET predictions = CREATE_EMPTYARRAY( numToPredict)
 
FOREACH item in (0...,numToPredict-1)    
...SET predictions[item] = predictNewItem ( newItems[item]) 
 
RETURN predictions  


**predictNewItem(newItem)**:

*Step 1:   Make 1D array distances from newItem to each trainig set item*   
FOREACH exemplar in (0,...,numExemplars -1  
...SET distFromNewItem [exemplar] = d (newItem , X_train[exemplar] )   

*Step 2: Get indexes of the k nearestk neighbours for our new item*        
SET closestK = GET_IDS_OF_K_CLOSEST(K,distFromNewItem)
 
  
*Step 3: Store majority vote in a  1D array y_pred with numToPredict entries*     
SET labelcounts = CREATE(1D array with m zero values)  

FOREACH  k in (0,...K-1)   
... SET thisindex = closestK[newItem][k]  
... SET thislabel = y_train[thisindex]  
... INCREMENT labelcounts[thislabel]  

SET thisPrediction = READ(index of labelcounts with highest value)    

RETURN thisPrediction

FUNCTION GET_IDS_OF_K_CLOSEST  
PARAMETER distFromNewItem # distance matrix  
PARAMETER K  



SET closestK= EMPTYLIST  
SET arraySize = len(distFromNewItem)  

FOR k in (0,...,K-1)  
... SET thisClosest=0  
... FOR exemplar in (1,...,arraySize -1)  
......IF ( distFromNewItem[exemplar] < distFromNewItem[thisClosest]  )  
......... SET thisClosest = exemplar  
... SET closestK[k] = thisClosest # store this id  
... SET distFromNewItem[thisClosest] = BigNumber # so we don't pick it again in next loop

RETURN closestK


In [None]:
# your KNN class code here

class simple_KNN:
    

<div class="alert alert-warning" style="color:black">
<h2> Activity 2.4: Test your implementation on the two example datasets</h2>
Use the toolbar to copy and paste the two cells from activity 2.1 below here. <br>
Then edit them so that they create and use objects of your new class, instead of the class simple_1NN

Start with K=1 - this should produce the same results as you got in activity 2.1, then try with K = {3,5,7}
<ul>
    <li>Use the student marks for <b>qualititative</b> judgements : how does the decision surface change?</li>
    <li>Use the Iris data set for <b>quantitative</b> judgements :  how does the confusion matrix change?</li>
    </ul>
    </div>

<div class="alert alert-warning" style="color:black" >
<h1> Activity 3: Decision Trees</h1></div>





The next image below  illustrates how the tree induction process works for the student marks dataset.  
- It was generated by calling the decision tree repeatedly for increasing depths.
- For depth 0 I've just created a text box with the relevant stats in.
<img src="DecisionTreeExample-studentMarks.png">

<div class="alert alert-warning" style="color:black">
<h2> Activity 3.1: exploring how to control tree-growth to prevent over-fitting</h2>
The aim of this activity is for you to experiment with what happens when you change three parameters that affect how big and complex the tree is allowed to get.
<ul>
    <li> max_depth</li>
    <li>min_samples_split, (default value is 2)</li>
    <li>min_samples_leaf, (default value is 1)</li>
    </ul>


Experiment with the Iris data set below to see if you can work out what each of these parameters does, and how it affects the tree 
<ul>
<li> Each time you run the  cell below, it will give you a different train-test split of the Iris data.<br>
    Does this affect what tree you get? </li>
    <li> Is there a combination of values that means you consistently get similar trees?</li>
    <li>    What is a good way of judging 'similarity?</li>
    </ul>
    </div>

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree


# load iris dataset and split into train:test
iris = sklearn.datasets.load_iris()
irisX = iris.data
irisy = iris.target
X_train, X_test, y_train, y_test = train_test_split(irisX, irisy, test_size=0.33,stratify=irisy)

model = DecisionTreeClassifier(random_state=1234, max_depth=None,min_samples_split=2,min_samples_leaf=1)
model.fit(X_train,y_train)
ypred = model.predict(X_test)

cm = confusion_matrix(y_test, ypred)
CMPlot=ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=iris.target_names)
CMPlot.plot()


fig = plt.figure(figsize=(12,12))
_ = tree.plot_tree(model, feature_names=iris.feature_names,  class_names=iris.target_names, filled=True)


<div class="alert alert-warning" style="color:black" > <h1> Activity 4: (stretch)</h1></div>
Using the code from last week,  apply a StandardScaler to the Iris data set and evaluate the effect this has on the accuracy.

Because there is a random element in how  the data set is split into training / test split,  it is not valid just to split the data once then compare the results with / without scaling.

Instead  you will need to do ten repeats  of:
- Use the sklearn method to split the data into 66:34 train/test sets
- Construct,  train, and test,  an instance of your kNN model on the unscaled data and store its accuracy 
- Create an instance of the standard scaler and then:
  - call its fit() method to set its parameters from the training set.
  - call its transform() method for both the traing and test sets
  - Construct,  train, and test,  an instance of your kNN model on this scaled data and store its accuracy 

That should gives you ten pairs of values (one per repeat) for the scaled and raw data accuracy.  
Use an online statistical tool (e.g. https://www.graphpad.com/quickcalcs/ttest1.cfm) that lets you copy your data in the perform a 'paired t-test" to find out the probability that normalising the data improves prediction accuracy

<div class="alert alert-block alert-danger"> Please save your work (click the save icon) then shutdown the notebook when you have finished with this tutorial (menu->file->close and shutdown notebook</div>

<div class="alert alert-block alert-danger"> Remember to download and save your work if you are not running this notebook locally.</div>