# Workbook 6: Supervised Machine Learning

## Description and aims

This tutorial is designed to give you your first experience of machine learning in practice by implementing a simple nearest-neighbour classifier.

The learning outcomes are:
- experience of implementing the K Nearest Neighbours classification algorithm
- experience of using the sklearn DecisionTree classification algorithm
-  experience of working through different preprocessing steps to try and improve the performance of your classifier

<div class="alert alert-warning" style="color:black">
    <h1>Activity 1: Loading and Visualising Data</h1>
   We will start by importing and visualising the  Iris dataset used  in the lecture.
<ul>
    <li><b>Run the 2 code cells below</b> to load and display the iris dataset</li>
            </ul></div>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import math

import week7_utils as W7utils
%matplotlib inline



## Iris flowers <img src="figures/Iris-image.png" style="float:right">
- classic Machine Learning Data set
- 4 measurements: sepal and petal width and length
- 50 examples  from each 3 sub-species for iris flowers
- three class problem:
 - so for some types of algorithm have to decide whether to make  
   a 3-way classifier or nested 1-vs-rest classifers
- most ML classifiers can get over 90%



In [None]:
import sklearn.datasets
irisX,irisy = sklearn.datasets.load_iris(return_X_y=True)
title="Scatterplots of 2D slices through the 4D Iris data"

iris_features= ("sepal_length", "sepal_width", "petal_length", "petal_width")
iris_names= ['setosa','versicolor','virginica']
W7utils.show_scatterplot_matrix(irisX,irisy,iris_features,title)

<div class="alert alert-warning" style="color:black">
    <h1>Activity 2: Implementing K-Nearest Neighbours</h1>
</div>
            
Basic process for predicting the label of a new point from the trainig set
1. Measure distance to new point from every member of the training set
2. Find the K Nearest Neighbours  
   in other words, the K members of the training set with the smallest distances  (*calculated in step 1*)
3. Count the labels that were provided for those K training items,  
   and return the most common one as the predicted label.

Below is a figure illustrating the start and first two steps of process.  
It is followed by a code cell with a simple implementation of a class for 1-Nearest neighbours. 

<b>Read through the code  to get a sense for how it implements the algorithm. </b><br>
Your tutor will discuss it with you in the lab sessions.
<img src="figures/kNN-steps.png">



In [None]:
# Example for K = 1 

class simple_1NN:

    def __init__(self, verbose = True):
        #we'll use straight line distance -code from week6 reproduced in this week's utils file
        self.distance = W7utils.euclidean_distance
        # this version only looks at the single nearest neighbour
        self.K=1
        
        #just affects prints to screen
        self.verbose= verbose
        
    def fit(self,X,y):
        # ask the data how big it is and store that info
        self.numTrainingItems = X.shape[0]
        self.numFeatures = X.shape[1]
        # store a copy of the data (X) and the labels (y)
        self.modelX = X
        self.modelY = y
        self.labelsPresent = np.unique(self.modelY) # list the unique values found in the labels provided
        if (self.verbose):
            print(f"There are {self.numTrainingItems} training examples, each described by values for {self.numFeatures} features")
            print(f"So self.modelX is a 2D array of shape {self.modelX.shape}")
            print(f"self.modelY is a list with {len(self.modelY)} entries, each being one of these labels {self.labelsPresent}")
        
    def predict(self,newItems):
        # read how many  newitems there are
        numToPredict = newItems.shape[0]
        # make an empty list to hold their predicted labels
        predictions = np.empty(numToPredict)
        
        #loop through each new item each one
        for item in range(numToPredict):
            # predicting its label
            thisPrediction = self.predict_new_item ( newItems[item])
            # adding that prediction to our list
            predictions[item] = thisPrediction
        return predictions
    
    def predict_new_item(self,newItem):
        
        # Step 1: measure and store distance to each training item
        distFromNewItem = np.zeros((self.numTrainingItems)) # array with one entry for each training set item, intialised to zero
        for stored_example in range (self.numTrainingItems):
            distFromNewItem[stored_example] = self.distance(newItem,  self.modelX[stored_example])
  
        # Step 2: find the one closest training example: This is K=1, 
        closest = 0
        for stored_example in range (0, self.numTrainingItems):
            if  ( distFromNewItem[stored_example] < distFromNewItem[closest] ):
                closest=stored_example
 
        # step 3: count the votes - because this is for K=1 so we don't need to take a vote
        labelOfClosest = self.modelY[closest]
        return labelOfClosest
    

  


<div class="alert alert-warning" style="color:black" >
<h2> Activity 2.1</h2>
    <b>Run the code provided below</b> for K=1 with the two datasets and make sure you understand the outputs and how they are produced
<ul>
    <li>For the marks dataset this creates a plot to show a decision surface<br>
    (you do not need to understand how the PlotDecisionSurface() methods works)</li>
    <li>For the  iris data set this uses a confusion matrix <br> (ask the internet what a confusion matrix is if you're not sure)</li>
    </ul>
    </div>

**The Iris dataset - illustrating a confusion matrix**

In [None]:
# make train/test split 
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris
import numpy as np

irisX,irisy = load_iris(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(irisX, irisy, test_size=0.33,stratify=irisy)


myKNNmodel = simple_1NN()
myKNNmodel.fit(X_train,y_train)
y_pred = myKNNmodel.predict(X_test)
print(y_pred.T) #.t turns column to row so it sghows onscreen better 


In [None]:
print ( (y_test==y_pred))
accuracy = 100* ( y_test == y_pred).sum() / y_test.shape[0]
print(f"Overall Accuracy = {accuracy} %")

confusionMatrix = np.zeros((3,3),int)
for i in range(50):
    actual = int(y_test[i])
    predicted = int(y_pred[i])
    confusionMatrix[actual][predicted] += 1
print(confusionMatrix)

#and here's sklearn's built-in method
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(y_test, y_pred,display_labels= iris_names )

### The iris data set - illustrating the decision surface
We will only use the two petal features so we can visualise it in 2d

In [None]:
petals = X_train[:,2:4]
myKNNmodel.fit(petals,y_train)
y_pred = myKNNmodel.predict(X_test[:,2:4])
accuracy = 100* ( y_test == y_pred).sum() / y_test.shape[0]
print(f"Overall Accuracy in 2D = {accuracy} %")

title= "1-Nearest Neighbour on petal features"
W7utils.PlotDecisionSurface(petals,y_train,myKNNmodel, title, iris_features[2:4],stepSize= 0.1)

<div class="alert alert-warning" style="color:black" >
<h2> Activity 2.3: Create your own implementation of K-Nearest Neighbours</h2>
    Using the code above,  extend the predict method for the class simple_1NN  to use the votes from K>1 neighbours.


<ul>
    <li>I have started you off by creating an empty class called Simple_KNN</li>
    <li> Then I  copied in the pseudo-code as comments starting ##</li>
    <li>I have also helped you by copying in the code from the simple_1NN class into the relevant places</li>
    <li> I have marked everywhere you need to write code with #==></li>
    <li> <b>it is a total of 18 lines, most are very obvious</b></li>
    </ul>
    <p><b> It's often helpful to put in some print() statements to show what is going on as you develop your code</b><br>
        And if you can write your code  so that it runs in 'partially completed' state then you can build it up in bits.
</div>

### Pseudocode for KNearest Neighbours
**init()**  :  
SPECIFY function to calculate distance metric d(i,j) for any two items *i* and *j*     
  e.g. Euclidean (continuous variables) or Hamming (categorical)  
SET value of K

**fit(trainingData)** :  

SET numExemplars = READ(number of rows in training data)  
SET numFeatures = READ(number of columns in training data) 

*#Just store a local copy of the training data as two arrays:*   
CREATE_AND_FILL(X_train of shape (numExemplars , numFeatures)).     
CREATE_AND_FILL(y_train of shape( numExemplars))
  
**predict(newItems)** :  
SET numToPredict = READ(number of rows in newItems)  
SET predictions = CREATE_EMPTYARRAY( numToPredict)
 
FOREACH item in (0...,numToPredict-1)    
...SET predictions[item] = predictNewItem ( newItems[item]) 
 
RETURN predictions  


**predictNewItem(newItem)**:

*Step 1:   Make 1D array distances from newItem to each trainig set item*   
FOREACH exemplar in (0,...,numExemplars -1  
...SET distFromNewItem [exemplar] = d (newItem , X_train[exemplar] )   

*Step 2: Get indexes of the k nearest neighbours for our new item*        
SET closestK = GET_IDS_OF_K_CLOSEST(K,distFromNewItem)
 
  
*Step 3: Calculate most popular of the m possible labels*     
SET labelcounts = CREATE(1D array with m zero values)  

FOREACH  k in (0,...K-1)   
... SET thisindex = closestK[k]   
... SET thislabel = y_train[thisindex]  
... INCREMENT labelcounts[thislabel]  

SET thisPrediction = READ(index of labelcounts with highest value)    

RETURN thisPrediction

**get_ids_of_k_closest(distFromNewItem, K):**

SET closestK= CREATE(1D array with K values)  
SET arraySize = len(distFromNewItem)  

FOR k in (0,...,K-1)  
... SET thisClosest=0  
... FOR exemplar in (0,...,arraySize -1)  
......IF ( distFromNewItem[exemplar] < distFromNewItem[thisClosest]  )  
......... SET thisClosest = exemplar  
... SET closestK[k] = thisClosest # store this id  
... SET distFromNewItem[thisClosest] = BigNumber # so we don't pick it again in next loop

RETURN closestK


In [None]:
# your KNN class code here

class simple_KNN:

    def __init__(self, verbose = True):
        """init function, Needs adapting to take an argument K with default 1"""
        
        ## SPECIFY function to calculate distance metric d(i,j) for any two items *i* and *j*
        self.distance= W7utils.euclidean_distance
        ## SET value of K
        #===> change line below to take K from an argument to this init() method <====
        self.K=1
        
        #just affects prints to screen
        self.verbose= verbose     


    def fit(self,X,y):
        """stores the dataset values X and labels y. Same code as 1-NN"""
        
        ##SET numExemplars = READ(number of rows in training data)  
        self.numTrainingItems = X.shape[0]
        
        ##SET numFeatures = READ(number of columns in training data) 
        self.numFeatures = X.shape[1]
        
        # Just store a local copy of the training data as two arrays:*   
        ## CREATE_AND_FILL(X_train of shape (numExemplars , numFeatures)).     
        self.modelX = X
        ## CREATE_AND_FILL(y_train of shape( numExemplars))
        self.modelY = y
        
        #additional reporting -  not part of algorithm
        self.labelsPresent = np.unique(self.modelY) # list the unique values found in the labels provided
        if (self.verbose):
            print(f"There are {self.numTrainingItems} training examples, each described by values for {self.numFeatures} features")
            print(f"So self.modelX is a 2D array of shape {self.modelX.shape}")
            print(f"self.modelY is a list with {len(self.modelY)} entries, each being one of these labels {self.labelsPresent}")


  
      def predict(self,newItems):
        """ make a prediction for each new item - same code as 1-NN"""
        
        ## SET numToPredict = READ(number of rows in newItems) 
        numToPredict = newItems.shape[0]
        
        ## SET predictions = CREATE_EMPTYARRAY( numToPredict)
        predictions = np.empty(numToPredict)
        
        ##FOREACH item in (0...,numToPredict-1) 
        for item in range(numToPredict):
        
            ##...SET predictions[item] = predictNewItem ( newItems[item]) 
            thisPrediction = self.predict_new_item ( newItems[item])
            predictions[item] = thisPrediction
            
            
        ## RETURN predictions    
        return predictions:  
 

 
    def predict_new_item(self,newItem):
        """make prediction for single item. Step 1 is same as 1-NN steps 2 and 3 need writing"""

        ## Step 1:   
        ## Make 1D array distances from newItem to each training set item*   
        distFromNewItem = np.zeros((self.numTrainingItems)) 

        ## FOREACH exemplar in (0,...,numExemplars -1  
        for stored_example in range (self.numTrainingItems):
            ## ...SET distFromNewItem [exemplar] = d (newItem , X_train[exemplar] )   
            distFromNewItem[stored_example] = self.distance(newItem,  self.modelX[stored_example])
  

        ## Step 2: Get indexes of the k nearest neighbours for our new item    
    
        ## SET closestK = GET_IDS_OF_K_CLOSEST(K,distFromNewItem)
        #closestK is array with K elements  
        #===> add one line of  code  to call the new function <===       

 
        ## Step 3: Calculate most popular of the m possible labels* 
    
        ## SET labelcounts = CREATE(1D array with m zero values)  
        #==> add one line of code using numpy.zeros to do this.  <===
        #remember that in fit() we created self.labelsPresent
        # so m = len(self.labelsPresent) 
      
       ##    FOREACH  k in (0,...K-1)  
       #==> add line of code putting in a for() loop here <===
    
           ##... SET thisindex = closestK[k] 
           #==> add line of code to do this
            
           ##... SET thislabel = y_train[thisindex]  
           #==> add line of code to do this
            
           ##... INCREMENT labelcounts[thislabel] 
           #==> add line of code to do this

           ##SET thisPrediction = READ(index of labelcounts with highest value)    
           #==> add one or two lines of code to do this
           # suggest you google "python highest value in numpy array" 
        
    ##RETURN thisPrediction   
    return thisPrediction
    
    
  
                
                
def get_ids_of_k_closest(distFromNewItem, K):
    """new function that returns array containing indexes of K closest items"""
    
    # Several way of doing this.  
    #This one just does K iterations of the loop from 1-NN that found the sigble closest 

    ## SET closestK= CREATE(1D array with K values) 
    #==> add line of code to do this using np.empty(k,dtype=int)  <==

    ##SET arraySize = len(distFromNewItem)  
    #==> add line of code to do this, 
    #distFromNewItem is a numpy array so you use its .shape[0] attribute <===
    
    ## FOR k in (0,...,K-1)  
    #==> add line of code to do this
    # look at 1-NN predict_new_item() for inspiration for the contents of this loop

        ##... SET thisClosest=0
        #==> add line of code to do this
    
        ##... FOR exemplar in (0,...,arraySize -1) 
        #==> add line of code to do this

            ##......IF ( distFromNewItem[exemplar] < distFromNewItem[thisClosest]  )  
            #==> add line of code to do this

                ##......... SET thisClosest = exemplar
                #==> add line of code to do this

                ##... SET closestK[k] = thisClosest  
                #===> add line of code to do this
                
        ##... SET distFromNewItem[thisClosest] = BigNumber 
        # so we don't pick it again in next loop
        #==>add line of code to do this, you could use 100000 for bignum

    ##RETURN closestK
    #==> add line of code to do this
                
                

<div class="alert alert-warning" style="color:black">
<h2> Activity 2.4: Test your implementation on the iris dataset</h2>
Use the toolbar to copy and paste the two cells from activity 2.1 below here. <br>
Then edit them so that they create and use objects of your new class, instead of the class simple_1NN

Start with K=1 - this should produce the same results as you got in activity 2.1, then try with K = {3,5,7}
<ul>
    <li>Make  <b>qualititative</b> judgements : how does the decision surface change?</li>
    <li>Make <b>quantitative</b> judgements :  how does the confusion matrix change?</li>
    </ul>
    </div>

<div class="alert alert-warning" style="color:black" >
<h1> Activity 3: Decision Trees</h1></div>

In the lecture notebook we illustrated how the decision tree is created by a process of expanding nodes.

We often want to control how we learn a model (in this case, grow a tree) h to avoid a phenomenon call **over-fitting**.

- This is where the model is capturing fine-details of the training set and so failing to generalise from the training set to the real world.
- like in the images where all the dogs faced left

<div class="alert alert-warning" style="color:black">
<h2> Activity 3.1: exploring how to control tree-growth to prevent over-fitting</h2>
The aim of this activity is for you to experiment with what happens when you change three parameters that affect how big and complex the tree is allowed to get.
<ul>
    <li> max_depth</li>
    <li>min_samples_split, (default value is 2)</li>
    <li>min_samples_leaf, (default value is 1)</li>
    </ul>


Experiment with the Iris data set below to see if you can work out what each of these parameters does, and how it affects the tree 
<ul>
<li> Each time you run the  cell below, it will give you a different train-test split of the Iris data.<br>
    Does this affect what tree you get? </li>
    <li> Is there a combination of values that means you consistently get similar trees?</li>
    <li>    What is a good way of judging 'similarity?</li>
    </ul>
    </div>

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree


# load iris dataset and split into train:test
iris = sklearn.datasets.load_iris()
irisX = iris.data
irisy = iris.target
X_train, X_test, y_train, y_test = train_test_split(irisX, irisy, test_size=0.33,stratify=irisy)



## Experiment with changing these values
depth= 1 #  try 2,3,4,5
minsplit = 2 #try 3,4,5
minleaf=1 #try 3,4,5

model = DecisionTreeClassifier(random_state=1234, max_depth=depth,min_samples_split=minsplit,min_samples_leaf=minleaf)
model.fit(X_train,y_train)
y_pred = model.predict(X_test)


CMPlot=ConfusionMatrixDisplay.from_predictions(y_test,y_pred, display_labels=iris_names)



fig = plt.figure(figsize=(12,12))
_ = tree.plot_tree(model, feature_names=iris.feature_names,  class_names=iris.target_names, filled=True)


<div class="alert alert-warning" style="color:black" > <h1> Activity 4: (stretch)</h1></div>
Using the code from last week,  apply a StandardScaler to the Iris data set and evaluate the effect this has on the accuracy.

Because there is a random element in how  the data set is split into training / test split,  it is not valid just to split the data once then compare the results with / without scaling.

Instead  you will need to do ten repeats  of:
- Use the sklearn method to split the data into 66:34 train/test sets
- Construct,  train, and test,  an instance of your kNN model on the unscaled data and store its accuracy 
- Create an instance of the standard scaler and then:
  - call its fit() method to set its parameters from the training set.
  - call its transform() method for both the traing and test sets
  - Construct,  train, and test,  an instance of your kNN model on this scaled data and store its accuracy 

That should gives you ten pairs of values (one per repeat) for the scaled and raw data accuracy.  
Use an online statistical tool (e.g. https://www.graphpad.com/quickcalcs/ttest1.cfm) that lets you copy your data in the perform a 'paired t-test" to find out the probability that normalising the data improves prediction accuracy

<div class="alert alert-block alert-danger"> Please save your work (click the save icon) then shutdown the notebook when you have finished with this tutorial (menu->file->close and shutdown notebook</div>

<div class="alert alert-block alert-danger"> Remember to download and save your work if you are not running this notebook locally.</div>