# Workbook 6: Supervised Machine Learning

## Description and aims

This tutorial is designed to give you your first experience of machine learning in practice by implementing a simple nearest-neighbour classifier.

The learning outcomes are:
- experience of implementing the K Nearest Neighbours classification algorithm
- experience of using the sklearn DecisionTree classification algorithm
-  experience of working through different preprocessing steps to try and improve the performance of your classifier

## Activity 1: Getting to know your data: 

We will start by importing and visualising the two datasets used as examples in the lecture: students marks,  and Iris
### You should already have uploaded the data and figures from the lecture materials folder - if not, do so now.
### Then run the 5 code cells below to load and display the two datasets

In [None]:
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

def show_scatterplot_matrix(X,y,featureNames,title=None):
    f = X.shape[1]
    if(len(y) != X.shape[0]):
        print("Error,   the y array  must have the same length as there are rows in X")
        return
    fig, ax = plt.subplots(f,f,figsize=(12,12))
    plt.set_cmap('jet')
    for feature1 in range(f):
        ax[feature1,0].set_ylabel( featureNames[feature1])
        ax[0,feature1].set_xlabel( featureNames[feature1])
        ax[0,feature1].xaxis.set_label_position('top') 
        for feature2 in range(f):
            xdata = X[:,feature1]
            ydata = X[:,feature2]
            ax[feature1, feature2].scatter(xdata,ydata,c=y)
    if title != None:
        fig.suptitle(title,fontsize=16,y=0.925)
        
        
# simple function - currently only works for 2D data - but could easily be extended
def PlotDecisionSurface(trainX,trainy,theClassifier,theTitle,featureNames,xvar=0,yvar=1,stepSize=2.0,minZero=False):
    #create and prettify the plot
    cmap="Set3"
    fig,ax= plt.subplots(figsize=(8, 8))
    ax.set_title(theTitle)
    ax.set_xlabel(featureNames[xvar])
    ax.set_ylabel(featureNames[yvar])

    #define a grid we use to plot the decision boundaries
      #get max/min values for gri edges
    columnMax,columnMin = np.max(trainX,axis=0), np.min(trainX,axis=0)
    if(minZero==True):
        x_min , y_min= 0,0
    else:
        x_min, y_min = columnMin[ xvar]*0.95, columnMin[yvar]*0.95
    x_max, y_max = columnMax[xvar]*1.05, columnMax[yvar]*1.05 
    #make the grid
    xx, yy = np.meshgrid(np.arange(x_min, x_max, stepSize),np.arange(y_min, y_max, stepSize))

    #predict and plotfor evey point on the grid
    Z = theClassifier.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    ax.contourf(xx, yy, Z,cmap=cmap)

    # Plot also the training points
    ax.scatter(x=trainX[:,xvar ],y= trainX[:, yvar], c=trainy.astype(float), alpha=1.0, cmap=cmap, edgecolor="black")
            

### The Student marks dataset

In [None]:

grades= np.genfromtxt("../lectures/data/assessment-grades-2features.csv", delimiter= ',',skip_header=1)

featureNames=("exam", "CW_mean")
nStudents = grades.shape[0]

outcomes= ("Pass","Resit Exam", "Resit Coursework","Resit Both")
simpleoutcomes= ("pass","resit")

# make target labels
result = np.empty(nStudents, dtype=np.int8)

for row in range (nStudents):
    exam = grades[row][0]
    cw   = grades[row][1]
    if (exam>=35 and cw>=35 and (exam +cw >=80) ):
        result[row] = 0 # PASS 

    elif ( cw>=40 and exam < 40):
        result[row] = 1 #resit just exam 
    elif ( cw<40 and exam>=40):
        result[row]= 2 # resit just coursework
    else:
        result[row] = 3  # resit both
        
simpleResult = np.where(result<1,0,1)

In [None]:
# easiest to split the data into 4/2 subgroups ot plot the outomes /simplified outcomes

passStudents = np.empty((0,2))
resitCWStudents = np.empty((0,2))
resitExamStudents = np.empty((0,2))
resitBothStudents = np.empty((0,2))

for student in range (nStudents):
    if (result[student]==0):
        passStudents = np.vstack( (passStudents,grades[student]) )
    elif (result[student]==1):
        resitExamStudents = np.vstack( (resitExamStudents,grades[student]) )
    elif (result[student]==2):
        resitCWStudents = np.vstack( (resitCWStudents,grades[student]) )
    else:
        resitBothStudents = np.vstack( (resitBothStudents,grades[student]) )
simpleResitStudents = np.vstack( (resitExamStudents,resitCWStudents,resitBothStudents))

print(passStudents.shape)
print(resitExamStudents.shape)
print(resitCWStudents.shape)
print(resitBothStudents.shape)
print(simpleResitStudents.shape)

In [None]:
fig,ax = plt.subplots(1,2,figsize=(14,5))
plt.xlabel("Exam")
plt.ylabel("Coursework")
ax[0].set_title("Outcomes")
ax[1].set_title("Simplified Outcomes")

ax[0].scatter(passStudents[:,0],passStudents[:,1],label = "Pass" )
ax[0].scatter(resitExamStudents[:,0],resitExamStudents[:,1],label = "Resit Exam" )
ax[0].scatter(resitCWStudents[:,0],resitCWStudents[:,1],label = "Resit CW" )
ax[0].scatter(resitBothStudents[:,0],resitBothStudents[:,1],label = "Resit Both" )
ax[1].scatter(passStudents[:,0],passStudents[:,1],label = "Resit" )
ax[1].scatter(simpleResitStudents[:,0],simpleResitStudents[:,1],label = "Pass" )

ax[0].legend(loc='lower right')
ax[1].legend(loc='lower right') 

### Example 2:  Iris flowers <img src="../lectures/figures/ML/Iris-image.png" style="float:right">
- classic Machine Learning Data set
- 4 measurements: sepal and petal width and length
- 50 examples  from each 3 sub-species for iris flowers
- three class problem:
 - so for some types of algorithm have to decide whether to make  
   a 3-way classifier or nested 1-vs-rest classifers
- most ML classifiers can get over 90%



In [None]:
import sklearn.datasets
irisX,irisy = sklearn.datasets.load_iris(return_X_y=True)
columnLabels= ("sepal_length", "sepal_width", "petal_length", "petal_width")
title="Scatterplots of 2D slices through the 4D Iris data"
show_scatterplot_matrix(irisX,irisy,columnLabels,title)

## Activity 2: Implementing K-Nearest Neighbours
Below is the pseudocode for the K-nearest Neighbours algorithm.
- Make sure you understand this,   
- Then read the cell below which is my implentation for K=1

### Pseudocode for KNearest Neighbours
**init()**  :  
SPECIFY function to calculate distance metric d(i,j) for any two items *i* and *j*     
  e.g. Euclidean (continuous variables) or Hamming (categorical)  
SET value of K

**fit(trainingData)** :  

SET numExemplars = READ(number of rows in training data)  
SET numFeatures = READ(number of columns in training data)  
*#Just store a local copy of the training data as two arrays:*   
CREATE_AND_FILL(X_train of shape (numExemplars , numFeatures)).     
CREATE_AND_FILL(y_train of shape( numExemplars))
  
**predict(newItems)** :  


*Step 1:   Make 2D array distances of shape (num_newItems , numExemplars)*   
SET numToPredict = READ(number of rows in newItems)  
FOREACH newItem in (0...,numToPredict-1)    
...FOREACH exemplar in (0,...,numExemplars -1)    
.....SET distances [newItem] [exemplar] = d (newItem , X_train[exemplar] )   

*Step 2: Get indexes of the k nearest neighbours for each new item*    
SET closestK = CREATE(2DArray with numToPredict rows and K columns)  
FOREACH newItem in (0...,numToPredict-1)        
...SET distFromNewItem = CREATE(2D array with  2 columns, and numExemplars rows)  
...FOREACH exemplar in (0,...,numExemplars -1)    
.......SET distFromNewItem[exemplar][0] = distances[newItem][exemplar] *#column 0 holds distance from new item*  
.......SET distFromNewItem[exemplar][1] = exemplar          *#column 1 holds the index in the training set*  
...SET  sortedByDist = SORT (rows of distFromNewItem by increasing distances (column 0) )  
...FOREACH k in ( 0,...,K-1)  
.....SET closestK[newItem][k] = sortedByDist[k][1] *#column 1 of each row holds the index*   
 
  
*Step 3: Store majority vote in a  1D array y_pred with numToPredict entries*     
FOREACH newItem in (0...,numToPredict-1)  
...SET labelcounts = CREATE(1D array with m zero values)  
...FOREACH  k in (0,...K-1)   
...... SET thisindex = cloesetK[newItem][k]  
...... SET thislabel = y_train[thisindex]  
...... INCREMENT labelcounts[thislabel]  
...SET thisPrediction = READ(index of labelcounts with highest value)    
...SET y_pred[newItem] = thisPrediction  
 
RETURN y_pred  



In [None]:
# Example for K = 1 
from sklearn.metrics.pairwise import euclidean_distances
class simple_1NN:

    def __init__():
        self.K=1
        
    def fit(self,X,y):
        self.numExemplars = X.shape[0]
        self.numFeatures = X.shape[1]
        self.modelX = X
        self.modelY = y
        
    def predict(self,newItems):
        numToPredict = newItems.shape[0]
        yPred = np.zeros((numToPredict,1))
        
        # measure distances - creates an array with numToPredict rows and num_trainItems columns
        dist = euclidean_distances(newItems,self.modelX)

        #make predictions: This is K=1, TO DO- in your own time extend to work with K>1
        closest = np.argmin(dist, axis=1) 
        # this is a 1D array with numToPredict entries, 
        # closest i holds the index (0...self.Numexplars -1) of the column j with the smalles value of dist[i][j] 
        for item in range(numToPredict):
            yPred[item] = self.modelY [ closest[item]]
        
        return yPred

### Activity 2.1 Run the code provided for K=1 with the two datasets and make sure you understand the outputs and how they are produced
- for the marks dataset this creates a plot to show a decision surface
- for the  iris data set this uses a confision matrix

**The Marks dataset - illustrating a 2D Decision surface**

In [None]:


# create and train the classifier
myKNNmodel = simple_1NN()
myKNNmodel.fit(grades,simpleResult) 

#visualise the decision surface
PlotDecisionSurface(grades, simpleResult, myKNNmodel,"1-NN simplified outcomes", ("exam","cw"),minZero=True)

**The Iris dataset - illustrating a confusion matrix**

# Confusion matrix for lateral flow tests


 Actual |  Predicted Covid.  | Predicted Not Covid |
 ---|---|---|
 covid.  | 50. | 50 |
 not covid| 0 | 100 |
 



In [None]:
# make train/test split 
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_iris

irisX,irisy = load_iris(return_X_y = True)
X_train, X_test, y_train, y_test = train_test_split(irisX, irisy, test_size=0.33,stratify=irisy)


model = simple_1NN()
model.fit(X_train,y_train)
ypred = model.predict(X_test)
confusionMatrix = np.zeros((3,3),int)
for i in range(50):
    actual = int(y_test[i])
    predicted = int(ypred[i])
    confusionMatrix[actual][predicted] += 1
print(confusionMatrix)



### Activity 2.2 (stretch): edit the code in the cells above to produce:
- a confusion matrix for the marks dataset
- a decision surface for the four-class version of the marks dataset  
  i.e. using the labels held in the array "results" instead of "simplifiedResults"
- a decision surface for the Iris Data  
  you will need to choose just two features and use 'slicing' create a training set with just those columns
  - you could use the first two, or the best two you identified last week

### Activity 2.3: Create your own implementation of K-Nearest Neighbours
Using the code above,  extend the predict method for the class simple_1NN  to use the votes from K>1 neighbours.


These are the lines you will need to change:
````  
#make predictions: This is K=1, TO DO- in your own time extend to work with K>1
        for item in range(numToPredict):
            closest = np.argmin(dist, axis=1) 
            yPred[item] = self.modelY [ closest[item]]
```` 

Some hints: 
- Make sure you change the class name to something appropriate
- You should  set (and store) the value of K in the constructor method
- If I have an array of myDist of (say, for simplicity) five values,  
   and I want to resort them by size, keep track of the item ids  
   so i can find the K with the smallest values in the original array. 
   
   The image shows the idea of how to do this <img src="howto-get-K-closest.png" style="float:right" width = 50%>


   I can do it using the code below, which:
   - creates a 2d array holding each value and its index in the original array
   - then makes a new array which is the 'unSorted' 2d array, but sorted by the values in the first column ([0])
   - then looks in the second column (which holds the indexes) for the first K rows

In [None]:
# This cells just contains some hints about how to prodice a sorted array
myDist = np.array([4,7,1,3,9])

myUnsortedArray = np.empty((5,2))
print(myUnsortedArray.shape)
for row in range (5):
    myUnsortedArray [row][0] = myDist[row]
    myUnsortedArray [row][1] = row
print('myUnsortedArray contents before sorting')
print(myUnsortedArray)

in==

print('mySortedArray contents after sorting')
print(mySortedArray)

print('Last week we learned about slicing- we will now use slicing to pull out all rows but just specific columns')
print(' the values in array myDist in ascending order are: {}'.format(mySortedArray[:,0]))
print(' the positions those values were in, again by ascending order of value are : {}'.format(=))
print('Now remember you do not have to choose every row ... ')

In [None]:
# your KNN class code here

### Activity 2.4: Test your impementation on the two example datasets

**Use the student marks for qualititative judgements** : how does the decision rurface change?  
**Use the Iris data set for quantitative judgements** :  how does the confusion matrix change?
just ot visualise what is hapening,  the Iris data to 
- use the toolbar to copy and paste the two cells from activity 2.1 below here
- then edit them so that they create and use objects of your new class, instead of the class simple_1NN

- start with K=1 - this should produce the same results as you got in activity 2.1
- then try with K = {3,5,7}
- what happens to the accuracy?
- what happens to the decision surface?


## Activity 3: Decision Trees





### Activity 3.1: Run the next cell to remind yourself how the decision trree model is grown
The next cell just illustrates how the tree induction process works for the student marks dataset.
It just calls the decision tree repeatedly for increasing depths.
For depth 0 I've just created a text box with the relevant stats in.


In [None]:
from sklearn.tree import DecisionTreeClassifier 
from sklearn import tree

fig,ax = plt.subplots(1,3,figsize=(18,8))
fig.suptitle("Illustration of how Decision Trees select and insert nodes to increase data purity")
for depth in range (0,3):
    if(depth==0):
        ax[0].text(0.25, 0.6, " gini=0.147\n samples=150,\n value=[138,12],\n class=pass",fontsize=14, 
        bbox={'facecolor': 'darkOrange', 'alpha': 0.5, 'pad': 10})
        ax[0].axes.get_yaxis().set_visible(False)
        ax[0].axes.get_xaxis().set_visible(False)
        ax[0].set_frame_on(False)
        ax[0].set_title("Depth 0")
    else:
        model = DecisionTreeClassifier(random_state=1234, max_depth=depth,min_samples_split=2,min_samples_leaf=1)
        model.fit(grades,simpleResult)
        _ = tree.plot_tree(model, feature_names=("exam","coursework"), class_names= ("pass","resit"),filled=True,ax=ax[depth])
        ax[depth].set_title("Depth "+str(depth))
        
fig.savefig("DecisionTreeExample-studentMarks.png")

### Activity 3.2: exploring how to control tree-growth to prevent over-fitting

The aim of this activity is for you to experiment with what happens when you change three parameters that affect how big and complex the tree is allowed to get.
- max_depth
- min_samples_split, (default value is 2)
- min_samples_leaf, (default value is 1)

Experiment with the Iris data set below to see if you can work out what each of these parameters does, and how it affects the tree 

- Each time you run the  cell below, it will give you a different train-test split of the Iris data.
  Does this affect what tree you get?
  
- Is there a combination of values that means you consistently get similar trees?
- What is a good way of judging 'similarity?

In [None]:

# load iris dataset and split into train:test
iris = sklearn.datasets.load_iris()
irisX = iris.data
irisy = iris.target

X_train, X_test, y_train, y_test = train_test_split(irisX, irisy, test_size=0.33,stratify=irisy)

model = DecisionTreeClassifier(random_state=1234, max_depth=None,min_samples_split=2,min_samples_leaf=1)
model.fit(X_train,y_train)
ypred = model.predict(X_test)
confusionMatrix = np.zeros((3,3),int)
for i in range(50):
    actual = int(y_test[i])
    predicted = int(ypred[i])
    confusionMatrix[actual][predicted] += 1
print(confusionMatrix)



fig = plt.figure(figsize=(12,12))
_ = tree.plot_tree(model, 
                   feature_names=iris.feature_names,  
                   class_names=iris.target_names,
                   filled=True)


# Activity 4: (stretch)
Using the code from last week,  apply a StandardScaler to the Iris data set and evaluate the effect this has on the accuracy.

Because there is a random element in how  the data set is split into training / test split,  it is not valid just to split the data once then compare the results with / without scaling.

Instead  you will need to do ten repeats  of:
- Use the sklearn method to split the data into 66:34 train/test sets
- Construct,  train, and test,  an instance of your kNN model on the unscaled data and store its accuracy 
- Create an instance of the standard scaler and then:
  - call its fit() method to set its parameters from the training set.
  - call its transform() method for both the traing and test sets
  - Construct,  train, and test,  an instance of your kNN model on this scaled data and store its accuracy 

That should gives you ten pairs of values (one per repeat) for the scaled and raw data accuracy.  
Use an online statistical tool (e.g. https://www.graphpad.com/quickcalcs/ttest1.cfm) that lets you copy your data in the perform a 'paired t-test" to find out the probability that normalising the data improves prediction accuracy

<div class="alert alert-block alert-danger"> Please save your work (click the save icon) then shutdown the notebook when you have finished with this tutorial (menu->file->close and shutdown notebook</div>

<div class="alert alert-block alert-danger"> Remember to download and save your work if you are not running this notebook locally.</div>