# The leave-one-out algorithm

This Notebook will demonstrate the leave-one-out algorithm to estimate the best value of *k* to build a classifier for the patient dataset that is shown in Figure 20.3. As you saw in Section 3.4 of Part 20, we can use this algorithm to select a good value of *k* for the classifier.

You should spend about 1 hour on this Notebook.

In [None]:
# This Notebook uses the following libraries:

import pandas as pd
from sklearn.neighbors import KNeighborsClassifier


# This line suppresses a warning about a future deprecation in
# the KNeighborsClassifier functions; you should ignore it
import warnings
warnings.filterwarnings("ignore", category=DeprecationWarning) 

## The leave-one-out algorithm

In this case, our challenge is to use the leave-one-out algorithm to work out the optimum value of *k* for estimating whether a new patient should be classified as belonging to group A or group B. The algorithm was described in Part 20, and is repeated here (with Steps 6 and 7 added):

Step 1: Select a value of *k* to examine.
    
Step 2: Select one of the *n* training data points as the validation data. The remaining *n*-1  data points are used as a training set.

Step 3: Build a *k*-NN classifier with the *n*-1 training data points, and use this to predict the class of the validation data point. Check the predicted class against the actual class of the test data.  

Step 4: Repeat Steps 2 and 3 for each of the *n* labelled data points by choosing a different data point as validation data and using the rest of the *n*-1 data instances as training data.

Step 5: Calculate an error rate as a ratio of incorrect classifications (*f*) to the total number of points in the test dataset (*n*), i.e. error rate = *f*/*n*.

Step 6: With a different value of *k*, repeat Steps 2 to 5. Repeat this step until all values of *k* are examined.

Step 7: Choose the value of *k* with the lowest error rate as an empirical optimal value. If there is a tie, choose the smallest *k*.


The task of implementing the leave-one-out algorithm here is best carried out in two stages. 


For stage 1, we will develop a function which takes a single member of a dataset, and uses the remaining data to classify it with the *k*-NN algorithm. 

For stage 2, we will develop a second function which uses the function from stage 1 to calculate how many members of the dataset were correctly classified.


We have provided a description of the working function, and suggested solutions for both these two stages, which you can use. However, you will gain much more benefit if you attempt to write the function yourself before looking at our proposed solution, even if you do not manage to build complete working functions yourself.

### Stage 1

As stated above, we first require a function which will take a single member of a dataset, and use the remaining data to classify it with the *k*-NN algorithm.

We will call the function `classify_single_case`. This function should take four values, consisting of a DataFrame of training data, a series of target values, an index term, and a value *k* for the classifier.

The function should then create a *k*-NN classifier, train it with all the data *except* the data point indexed by the index term, and return the class which the trained classifier predicts for the indexed term.

The following cell contains a suggested implementation of the function <code>classify_single_case</code>. Before reading the cell, you should consider how you would write the function, and possibly attempt to implement it.

In [None]:
def classify_single_case(trainingData_df, targetValues_ss, ix, k):
    '''Use k-NN to classify the member of trainingData_df with index
       ix using a k-nearest neighbours classifier. The classifier is
       trained on the data in trainingData_df and the classes in
       targetValues_ss, with the data point indexed by ix omitted.
       Returns the class assigned to the data point with index ix.
    '''

    # Create a classifier instance to do k-nearest neighbours
    myClassifier = KNeighborsClassifier(n_neighbors=k,
                                        metric='euclidean',
                                        weights='uniform')

    # Now apply the classifier to all data points except
    # the one indexed by ix
    myClassifier.fit(trainingData_df.drop(ix, axis='index'),
                     targetValues_ss.drop(ix))

    # Return the class predicted by the trained classifier:

    return myClassifier.predict(trainingData_df.loc[ix])[0]

We can see the operation of `classify_single_case` by comparing the outputs to those in Figure 20.8 in Part 20, Section 3.4. In that table, we can see that the point B8 is correctly classified by the algorithm where *k*=3, but incorrectly classified for *k*=5.

Trying the function on the patients dataset, and applying it to point B8 (which has index 17 in the DataFrame):

In [None]:
'''

Predict the class of the data point with index 17, using a k-NN classifier
with k=3

The actual class of the data point with this index is 'B'

'''
    
# Import the patient data as a DataFrame
patientData_df = pd.read_csv('data/patients.csv')

# Use the two columns 'Exercise time (hours)' and 
# 'Sleep time (hours)' for the training data
trainingData_df = patientData_df[['Exercise time (hours)', 'Sleep time (hours)']]

# Use the column 'Patient group' as the target values
targetValues_ss = patientData_df['Patient group']

# Return the predicted value of the data point with index 17 for k=3:
classify_single_case(trainingData_df,
                     targetValues_ss,
                     17,
                     3)

In [None]:
'''

Predict the class of the data point with index 17, using a k-NN classifier
with k=5

The actual class of the data point with this index is 'B'

'''
    
# Import the patient data as a DataFrame
patientData_df = pd.read_csv('data/patients.csv')

# Use the two columns 'Exercise time (hours)' and 
# 'Sleep time (hours)' for the training data
trainingData_df = patientData_df[['Exercise time (hours)', 'Sleep time (hours)']]

# Use the column 'Patient group' as the target values
targetValues_ss = patientData_df['Patient group']

# Return the predicted value of the data point with index 17 for k=5:
classify_single_case(trainingData_df,
                     targetValues_ss,
                     17,
                     5)

As we would hope, the function has correctly classified the point for *k*=3, but incorrectly for *k*=5.

### Stage 2

Having defined the <code>classify_single_case</code> function in Stage 1, we can now use the function to estimate the best value of *k* to use for the classifier, when applied to the [Wholesale customers data full.csv](./data/Wholesale%20customers%20data%20full.csv) dataset. 

As before, we will use the `Channel` column as the target values, and the `Fresh`, `Milk`, `Grocery`, `Frozen`, `Detergents_Paper` and `Delicatessen` columns as the training data (that is, we will not use the `Region` column). First, import the data file, and create a DataFrame with the appropriate training data, and a Series with the appropriate classes.

In [None]:
icmaData_df = pd.read_csv('data/Wholesale customers data full.csv')

icmaTrainingData_df = icmaData_df[['Fresh', 'Milk', 'Grocery', 'Frozen', 
                                   'Detergents_Paper','Delicatessen']]
icmaTargetValues_ss = icmaData_df['Channel']


Next, to obtain a list of predicted values for some *k*, we apply the function `classify_single_case` to the training data for each data point. For example, the predicted values for *k*=5 are:

In [None]:
[classify_single_case(icmaTrainingData_df,
                      icmaTargetValues_ss,
                      i,
                      5)
 for i in icmaTrainingData_df.index]

To identify the number of discrepencies between the predicted values and the actual values, we can compare the Series of predicted classes with the Series of actual classes (where `True` means the predicted class is the same as the actual class, and `False` means that they are different):

In [None]:
[classify_single_case(icmaTrainingData_df,
                      icmaTargetValues_ss,
                      i,
                      5)
 for i in icmaTrainingData_df.index] == icmaTargetValues_ss

We can then use the `count` method to find the total number of correctly classified predictions:

In [None]:
list([classify_single_case(icmaTrainingData_df,
                           icmaTargetValues_ss,
                           i,
                           5)
      for i in icmaTrainingData_df.index] == icmaTargetValues_ss).count(True)

To find the optimum value of *k* we want the value that gets the prediction correct most often. To determine this value, we just carry out the above calculation for a range of values of *k*. Here, we will try values from 1 to 7:

In [None]:
for k in range(1, 8):
    print('{}\t{}'.format(k,
                          list([classify_single_case(icmaTrainingData_df,
                                                     icmaTargetValues_ss,
                                                     i,
                                                     k)
                                for i in icmaTrainingData_df.index
                               ] == icmaTargetValues_ss
                              ).count(True)))
    

The highest values are for *k*=3 and *k*=4, so when building a classifier for unseen data, we would probably use one of these values.

## What next?
If you are working through this Notebook as part of an inline exercise, return to the module materials now.

If you are working through this set of Notebooks as a whole, you've completed the Part 20 Notebooks.
