# Workbook 6: Supervised Machine Learning

## Description and aims

This tutorial is designed to give you your first experience of machine learning in practice by implementing a simple nearest-neighbour classifier.

The learning outcomes are:
- experience of implementing the K Nearest Neighbours classification algorithm
- experience of using the sklearn DecisionTree classification algorithm
-  experience of working through different preprocessing steps to try and improve the performance of your classifier

and from the perspective of your programming skills
- more experience of class inheritance 
- experience of using numpy's argmin method
- more experience of using python dictionaries

<div class="alert alert-warning" style="color:black">
    <h1>Activity 1: Loading and Visualising Data</h1>
   We will start by importing and visualising the  Iris dataset used  in the lecture.
<ul>
    <li><b>Run the 2 code cells below</b> to load and display the iris dataset</li>
            </ul></div>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import math

import week6_utils as w6utils
%matplotlib inline



## Iris flowers <img src="figures/Iris-image.png" style="float:right">
- classic Machine Learning Data set
- 4 measurements: sepal and petal width and length
- 50 examples  from each 3 sub-species for iris flowers
- three class problem:
 - so for some types of algorithm have to decide whether to make  
   a 3-way classifier or nested 1-vs-rest classifers
- most ML classifiers can get over 90%



In [None]:
import sklearn.datasets
iris_x,iris_y = sklearn.datasets.load_iris(return_X_y=True)
title="Scatterplots of 2D slices through the 4D Iris data"

iris_features= ("sepal_length", "sepal_width", "petal_length", "petal_width")
iris_names= ['setosa','versicolor','virginica']
w6utils.show_scatterplot_matrix(iris_x,iris_y,iris_features,title)

<div class="alert alert-warning" style="color:black">
    <h1>Activity 2: Implementing K-Nearest Neighbours</h1>
</div>
            
Basic process for predicting the label of a new point from the trainig set
1. Measure distance to new point from every member of the training set
2. Find the K Nearest Neighbours  
   in other words, the K members of the training set with the smallest distances  (*calculated in step 1*)
3. Count the labels that were provided for those K training items,  
   and return the most common one as the predicted label.

Below is a figure illustrating the start and first two steps of process.  
It is followed by a code cell with a simple implementation of a class for 1-Nearest neighbours. 

<b>Read through the code  to get a sense for how it implements the algorithm. </b><br>
Your tutor will discuss it with you in the lab sessions.
<img src="figures/kNN-steps.png">



In [None]:
# Example for K = 1 

class Simple1NNClassifier:
    """ 
    Simple example class for 1-Nearest Neighbours algorithm.
    Assumes numpy is imported as np and uses euclidean distance
    """    
    def dist_a_b(self,a:np.array,b:np.array)->float:
        """ euclidean distance between same-size vectors a and b"""
        assert a.shape==b.shape, 'vectors not same size calculating distance'
        return np.linalg.norm(a-b) 
    
    def fit(self,x:np.ndarray,y:np.array):
        """ just stores the data for k-nerarest neighbour"""
        self.num_training_items = x.shape[0]
        self.num_features = x.shape[1]
        self.model_x = x
        self.model_y = y
        
    def predict(self,new_items:np.ndarray):
        """ makes predictions for an array of new items"""
        num_to_predict = new_items.shape[0]
        y_pred = np.zeros((num_to_predict),dtype=int)
        
        # measure distances - creates an array with numToPredict rows and num_trainItems columns
        dist = np.zeros((num_to_predict,self.num_training_items))
        for new_item in range(num_to_predict):
            for stored_example in range(self.num_training_items):
                dist[new_item][stored_example]= self.dist_a_b(new_items[new_item],
                                                              self.model_x[stored_example ])

        #make predictions: 
        closest = np.argmin(dist, axis=1) #closest has one entry for each row (item to predict)
        for item_idx in range(num_to_predict):
            y_pred[item_idx] = self.predict_one(item_idx, dist)
        return y_pred
    
    def predict_one(self,item_idx:int,distances:np.ndarray):
        """ makes a class prediction for a single new item
        This version is just for 1 Nearest Neighbour
        Parameters
        ----------
        item_idx (int): item to make predciton for - i.e. idx of row in distances matrix
        dist (numpy ndarray): array of distances between new items (rows) and training set records(columns)
        """
        # we're going to use numpy's argmin method (google it)
        # which gives us the  get indexes of column with lowest value in an array
        idx_of_nearest_neighbour = np.argmin (distances[item_idx])
        return self.model_y[ idx_of_nearest_neighbour]
        
  


<div class="alert alert-warning" style="color:black" >
<h2> Activity 2.1</h2>
    <b>Run the code provided below</b> for K=1 with the two datasets and make sure you understand the outputs and how they are produced
<ul>
    <li>For the marks dataset this creates a plot to show a decision surface<br>
    (you do not need to understand how the PlotDecisionSurface() methods works)</li>
    <li>For the  iris data set this uses a confusion matrix <br> (ask the internet what a confusion matrix is if you're not sure)</li>
    </ul>
    </div>

In [None]:
# make train/test split of datasets
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(iris_x, iris_y, test_size=0.33,stratify=iris_y)

In [None]:
#make a model
my_1NN_model = Simple1NNClassifier()

# fit it to the training data
my_1NN_model.fit(train_x,train_y)

# use it to make predictions for test data
predictions = my_1NN_model.predict(test_x)
print(f' predictions are {predictions.T}') #.t turns column to row so it shows on screen better 


# make array of whether two arrays have equal values
print ( f'individual matches to acvtual values are{test_y==predictions}')

# do some counting to get the accuracy
accuracy = 100* ( test_y == predictions).sum() / test_y.shape[0]
print(f"\nOverall Accuracy = {accuracy} %")

confusionMatrix = np.zeros((3,3),int)
for i in range(50):
    actual = int(test_y[i])
    predicted = int(predictions[i])
    confusionMatrix[actual][predicted] += 1
print(confusionMatrix)

#and here's sklearn's built-in method
from sklearn.metrics import ConfusionMatrixDisplay
ConfusionMatrixDisplay.from_predictions(test_y, predictions,display_labels= iris_names )

### Using a 2-D version of the iris data set to illustrate the decision surface
We will only use the two petal features so we can visualise it in 2d


In [None]:
#make data - labels are the same as before
petals_train = train_x[:,2:4]
petals_test = test_x[:,2:4]

#make model
my_1NN_model2 = Simple1NNClassifier()
# fit it to data
my_1NN_model2.fit(petals_train,train_y)

#make predictions, score them 
y_pred = my_1NN_model2.predict(petals_test)
accuracy = 100* ( test_y == y_pred).sum() / test_y.shape[0]
print(f"Overall Accuracy in 2D = {accuracy} %")

title= "1-Nearest Neighbour on petal features"
w6utils.plot_decision_surface(petals_train,train_y,my_1NN_model2, title, iris_features[2:4],step_size= 0.1)

<div class="alert alert-warning" style="color:black" >
<h2> Activity 2.2: Create your own implementation of K-Nearest Neighbours</h2>
    <p> Using the code above,  extend the predict method for the class Simple1NNClassifier  to use the votes from K>1 neighbours.</p>
    <ol>
        <li>Create a class that inherits most of the code: <code>class SimpleKNNClassifier (Simple1NNClassifier):</code> </li>
        <li> Create a new initialisation method that takes one parameter: the number of neighbours to consider(K)<br>
and saves it in <code>self.K</li>
        <li> Over-ride the <code>predict_one()</code> method <br>
        so that instead of just finding the label of the single closest neighbour it:
         <ol> 
             <li> Finds the indexes of the <code>self.K</code> nearest neighbours.<br>
                 HINT: you can replace <code>np.argmin</code> with <code>np.argpartition</code> <br>
                 <a href = https://stackoverflow.com/questions/34226400/find-the-index-of-the-k-smallest-values-of-a-numpy-array>
                 This question</a> is the same and the first answer is really useful.
             </li>
             <li>Stores the labels of these instances.<br>
                 The most general way to do with without making assumptions is to use a dictionary<br>
             but this will mean explicitly casting labels to strings to be safe</li>
             <li> Iterates through the labels to see which is most popular. <br>
             You may find the reminder below if you are not used to python dictionaries</li>
             <li> returns the most popular label as the prediction for item</li>
         </ol>
</div>

<div style="background:lightblue;color:black">
    <h3> Reminder: Storing data in python dictionaries and iterating through their contents</h3>
    <p> Python dictionaries are a way of storing data that can be accessed via a key<br>
for example: <code> my_dict= {'name':'jim','familyname':"Smith", 'job':'professor'}</code><br>
<b>Keys are usually strings</b>, but the values associated with a key can be any type, including numbers.</p>

<p> The following snippets of code might be useful to you - <b>after</b> you have edited them.</p>
<p> Make a new code cell in the notebook, copy the snippets in and run it, then edit it as you need.</p>
<p><pre style='background:lightblue;colour:black'>    
labels = ['a','b','a','c','a','d','b']
indexes = [1,4,6]
mydict={}
<span style="color:green">for</span> idx <span style="color:green">in</span> indexes:
    <span style="color:green">if</span> labels[idx] <span style="color:green">in</span> mydict.keys():
        mydict[labels[idx]] += 1
    <span style="color:green">else</span>: #create a new dictionary entry if needed
        mydict[labels[idx]] = 1
<span style="color:green">print</span>(f'mydict is {mydict}')

leastvotes=99
<span style="color:green">for</span> key,val <span style="color:green">in</span> mydict.items():
    <span style="color:green">if</span> val < leastvotes:
        unpopular= key
        leastvotes=val
<span style="color:green">print</span>(f'{unpopular}, {leastvotes}')
    </pre></p>
    </div>

In [None]:
class SimpleKNNClassifier(Simple1NNClassifier):
    """
    Complete this class to prodiuce a KNN classifier"""
    
    def __init__(self):
        """ your code here
        you will need to change the function signature
        """
        
    def predict_one(self,item_idx:int,distances:np.ndarray):
        """ makes a class prediction for a single new item
        You should write this to accept any number of neighbour K
        Parameters
        ----------
        item_idx (int): item to make predciton for - i.e. idx of row in distances matrix
        dist (numpy ndarray): array of distances between new items (rows) and training set records(columns)
        """
        prediction = -99999 #dummy value
        
        # YOUR CODE HERE
        
        return prediction
        

<div class="alert alert-warning" style="color:black">
<h2> Activity 2.3: Test your implementation on the iris dataset</h2>
<p>Use the toolbar to copy and paste the second and third  cells from activity 2.1 below here. <br>
Then edit them so that they create and use objects of your new class, instead of the class Simple1NNClassifier.

Start with K=1 - this should produce   the same results as you got with my code in activity 2.1, then try with K = {3,5,7}
<ul>
    <li>Make  <b>qualititative</b> judgements : how does the decision surface change?</li>
    <li>Make <b>quantitative</b> judgements :  how does the confusion matrix change?</li>
    <li> In Machine Learning we talk about algorithms having  <b>hyper-parameters</b> that control their behaviour.<br>
        Adapt your code to investigate:
        <ul>
        <li>What value for the hyper-parameter <b>K</b> gives the best accuracy on the <b>test</b> set?</li>
        <li>What value for the hyper-parameter <b>K</b> gives the best accuracy on the <b>test</b> set?</li>
            <li> If these are not the same, can you explain why not?</li>
        </ul>
    </li>
    </ul>
    </div>

<div class="alert alert-warning" style="color:black" >
<h1> Activity 3: Decision Trees</h1></div>

In the lecture notebook we illustrated how the decision tree is created by a process of expanding nodes.

We often want to control how we learn a model (in this case, grow a tree) h to avoid a phenomenon call **over-fitting**.

- This is where the model is capturing fine-details of the training set and so failing to generalise from the training set to the real world.
- like in the images where all the dogs faced left

<div class="alert alert-warning" style="color:black">
<h2> Activity 3.1: exploring how to control tree-growth to prevent over-fitting</h2>
The aim of this activity is for you to experiment with what happens when you change three <b>hyper-parameters</b> that affect how big and complex the tree is allowed to get.
<ul>
    <li> max_depth: default is None)</li>
    <li>min_samples_split: default value is 2</li>
    <li>min_samples_leaf: default value is 1</li>
    </ul>


Experiment with the Iris data set we loaded earlier to see if you can work out what each of these hyper-parameters does, and how it affects the tree. 
<ul>
<li> If you uncomment the first line after the imports, it will give you a different train-test split of the Iris data each time you run it.<br>
    Does this affect what tree you get? </li>
    <li> Is there a combination of hyper-parameter values that means you consistently get similar trees?</li>
    <li>    What is a good way of judging 'similarity?</li>
    </ul>
    </div>

In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree


# do a new split of the data into into train:test
#train_x, test_x, train_y, test_y = train_test_split(iris_x, iris_y, test_size=0.33,stratify=iris_y)



## Experiment with changing these values
depth= 1 #  try 1,3,5
minsplit = 2 #try 2,5
minleaf=1 #try 1,5

#make a model with those hyper-parameters
model = DecisionTreeClassifier(random_state=1234, max_depth=depth,min_samples_split=minsplit,min_samples_leaf=minleaf)
model.fit(train_x,train_y)
predictions = model.predict(test_x)

accuracy = 100* ( test_y == predictions).sum() / test_y.shape[0]
print(f"Overall Accuracy in 2D = {accuracy} %")


CMPlot=ConfusionMatrixDisplay.from_predictions(test_y,predictions, display_labels=iris_names)



fig = plt.figure(figsize=(10,10)) #you may need to increase the figure size for larger trees
_ = tree.plot_tree(model, feature_names=iris_features,  class_names=iris_names, filled=True)


<div class="alert alert-warning" style="color:black" > <h1> Activity 4: Creating a test harness for comparing ML algorithms on a dataset</h1>
<p> Now you have done some manual experimenting with different hyper-parameter values for algorithms, it's time to think about automating that process.</p>
<p>Complete the cell below to create a method that: </p>
<ul>
    <li> Takes a train and test data  arrays as  parameters <br>
        HINT: develop your code using train_x, test_x,train_y,test_y for the iris data from above</li>
    <li> Runs your SimpleKNNClassifier with K={1,3,5,7,9} and stores the test accuracy for each <br>
    HINT: you could use:
        <ul>
            <li>a for loop to run the algorithm with different settings k  for  the number of neighbours(K),</li>
            <li> an <a href=https://www.geeksforgeeks.org/formatted-string-literals-f-strings-python/>f-string</a> e.g. <code>experiment_name= f'KNN_K={k}'</code> to create a meaningful name for each run </li>
            <li>a   dictionary to store your results, where each experiment has the string <em>experiment_name</em> as the key and the accuracy as the value </li>
            </ul>for for this?</li>
    <li> Runs a DecisionTreeClassifier with all the different combinations of hyper-parameters from activity 3<br>
       HINT: You could do this in the same way as I've suggested above but with nested for-loops (one for each hyper-parameter) and a more complex python f-string to create the name (key), then store the results in the same dictionary.  </li>
    <li> Reports the results and which algorithm-hyperparameter combination has the highest test accuracy</li>
</ul>
</div>


In [None]:
def first_ml_test_harness(train_x:np.ndarray,train_y:np.ndarray,
                          test_x:np.ndarray,test_y:np.ndarray):
    """ code to compare supervised machine learning algorithms on a dataset"""
    # your code here
    
    print('not implemented yet')
    

In [None]:
#now run your code for the iris data
first_ml_test_harness(train_x, train_y)

<div class="alert alert-block alert-danger"> Please save your work (click the save icon) then shutdown the notebook when you have finished with this tutorial (menu->file->close and shutdown notebook</div>

<div class="alert alert-block alert-danger"> Remember to download and save your work if you are not running this notebook locally.</div>