# Lab 1 - KNN

**Due: September 19th at 11:59pm**
 
### Learning Goals:
This first lab is intended to introduce several important machine learning concepts, libraries, and algorithms.  As with any programming assignment, you'll also be practicing and improving your general CS skills, like problem decomposition, algorithmic thinking, implementation and testing, language syntax, etc..  Here are some of the specific things you should learn while completing this assignment:

* How to begin analysing a dataset to understand the real-world problem it represents
* How to load and use the SciKit Learn library to import data and train and evaluate ML models
* How to implement the $k$-Nearest Neighbor algorithm

The assignment is presented in the form of an interactive Python notebook; it contains a mix of pre-made examples to help you understand how to do things, scaffolding with missing parts where you'll write your own code, and written short-answer questions.  Your job is to **fill in answers to the written questions** (which may require you to modify scaffolding code and re-run it, or may require you to write your own code and run it), as well as **write the indicated functions** to complete the programming portion of the assignment.  Look for the <span style="color:red">TODO: ...</span> markers to help you spot places you'll need to complete things.

Please be sure to ***remove* the TODO markers** as you complete each aspect (i.e. don't leave a TODO that's for something you've actually done, that's poor style because it's confusing to anyone reading your code/documentation/etc.)

This lab is intended to be done with a partner (i.e. teams of two), though it can be done solo; partners will assigned based on the form you filled out (or arbitrarily if you didn't fill out the form). 

You'll have approximately **one week** to complete this lab, so be sure to start early and plan properly to get it all done in a timely fashion.  It's recommended that you start at the top and work your way down (i.e. the later parts are more difficult than the earlier ones). 

//# Part 1: Playing with Libraries

Modify this markdown cell to add the answers to the following short-answer questions.  In doing so, consult the code cells below, which show an example of using SciKit Learn to load and classify a simple data
set.  Note that you may need to make minor modifications to the example code for some answers (e.g. changing the number of neighbors used, running on the train vs test set).  **NOTE:** remember that to edit a markdown cell, you need to double-click on it, then change the text, and finally 'run' the cell to get the pretty formatted version.

1. What is the full name of the data set used here?

    Iris plants dataset (_iris_dataset)

2. What sort of real-world user might be interested in a system that could successfully solve this classification problem?

    A botanist, biologist, or gardner looking to classify irises for personal interest or research purposes might be interested in a system that could solve this classification problem.

3. What are the stakes for this problem?  In other words, who might be hurt if the system makes a mistake?  How bad are the consequences?

    This depends on what the classification was used for. If a gardener were using this find out what kinds of irises are growing in their garden, the stakes are relatively low; misclassifying an iris in this context could spread only minor misinformation. If a biologist were using this for publications or pharmaceutical researches, the stakes are higher, since it could result in research or medical accidents.

4. How many _features_ (**not** including the class label) does each example in the data set have?

    Four: sepal length, sepal width, petal length, petal width. 

5. How many _examples_ does the data set contain?

    150

6. What are the available class labels? Give both the encoding in the data set (i.e. the raw value) and the human-readable label associated with each value.

    Iris-Setosa(0), Iris-Versicolour(1), Iris-Virginica(2)

7. When run with a 60/40 train/test split, what is the accuracy of a Nearest Neighbor classifier (i.e. $k=1$) on the _testing set_?
    
   0.9166666666666666

8. Accuracy of 3-nearest neighbor?

    0.9333333333333333

9. Accuracy of 5-nearest neighbor?

    0.95

10. Accuracy of 20-nearest neighbor?

    0.9166666666666666

11. Accuracy of 40-nearest neighbor?

    0.8666666666666667

12. Accuracy of 80-nearest neighbor?

    0.6

13. What is the accuracy of nearest neighbor ($k=1$) on the _training set_ (*not* the testing set)?

    100%

14. Is there a difference between the accuracy on the _training set_ and accuracy on the _testing set_?  Explain why the observed behavior occurs.
    
    Yes, the accuracy on the _training set_ will always be 100% while the accuracy on the _testing set_ may differ. If we evaluate the accuracy of the KNN ($k=1$) model based on its training set, the nearest neighbor of each query point is always itself, since it is recorded as one of the training point. Therefore, the algorithm always returns the label of itself, resulting an overall prediction accurary of 100% and has no indication on whether this algorithm could perform well on a novel query point. On the other hand, since the _testing set_ might contain query points novel to the model, there is no guarentee to the accuracy due to noise in data, etc. 

In [1]:
# Import the libraries we'll need...
import numpy as np     # the data we'll be loading is in the form of numpy arrays 
from sklearn import datasets  # here we grab a module from the SciKit-Learn library that contains example data sets
from sklearn import neighbors # this module contains an implementation of a KNN classifier
from sklearn.model_selection import train_test_split  # and here we get one method from the model_selection module

In [2]:
# once we've imported the 'datasets' module, we can use it to load some data...
iris = datasets.load_iris()

In [4]:
# let's check out the description:
print(iris.DESCR)# Split the data into training and testing sets

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

                    Min  Max   Mean    SD   Class Correlation
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :

NameError: name 'X' is not defined

In [5]:
# let's see what it looks like
iris.data.shape

(150, 4)

In [6]:
iris.target.shape

(150,)

**Note:** this next cell produces a very long output! If you don't want to look at it, you can click the blue bar to the left of a cell to collapse it (i.e. hide it from view until you click the blue bar again to expand it).  After examining the output of this cell, it's recommended that you collapse it so you don't have to scroll past the whole thing repeatedly.  You may want to do this with some other cells as well (e.g. the data description from above may similarly be annoying to scroll past, so you may want to hide it when you're not using it).

In [None]:
iris.data

In [None]:
iris.target

In [None]:
# note that we can index into our data set to get a single example:
iris.data[1]

In [None]:
# now let's train a nearest neighbor classifier...
# note that we can use the parameter 'n_neighbors' to control the "K" in our KNN
classifierA = neighbors.KNeighborsClassifier(metric='euclidean', n_neighbors=1)
classifierA.fit(iris.data, iris.target) # this step trains classifierA with the data and targets

In [None]:
# ...then test it out to see how it performs (the result is the percentage of examples it gets right)
classifierA.score(iris.data, iris.target)

In [None]:
# ...hmm, that seems too good to be true; probably because we used the same data for training and testing! 
# Lets try splitting our data into two parts so we can get a more realistic idea of how this would work.
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=0)

print("training shape:", X_train.shape, ", testing shape:", X_test.shape)

In [None]:
# now we can re-train our classifier using just the "training" part of the data:
# note: you'll need to modify this cell in different ways to answer questions 8-12
classifierB = neighbors.KNeighborsClassifier(metric='euclidean', n_neighbors=1)
classifierB.fit(X_train, y_train)
print("true accuracy: ", str(classifierB.score(X_test, y_test)))

accuracy:  0.9166666666666666***
Note that SciKit learn has quite good documentation, so if you'd like to know more information about how to use these methods (e.g. what parameters are available, what do they do, etc.) you can look them up fairly easily.  Generally, a google search for `sklearn <myMethod>` will give you a link to the docs within the first couple of hits.

Note also that the sklearn documentation is broken into two parts, the [Users Guide](https://scikit-learn.org/stable/user_guide.html) and the [API](https://scikit-learn.org/stable/modules/classes.html) Reference; the former is written more like a textbook, and contains long-form explanations and examples, while the latter is a more typical API reference with complete and detailed explanations of tha interface for each class and method.  The two are also cross-linked; pretty much every page in the API has a link to the relevant Users Guide page, and each mention of a class or method in the Users Guide is a link to the API.

For instance, here's the user's guide page on KNNs: https://scikit-learn.org/stable/modules/neighbors.html

And here's the API page for KNeighborsClassifier: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

You'll typically want to refer to _both_ of them to get a full understanding of how to use a given module from the library.
***

# Part 2: Nearest Neighbor from Scratch

This part of this assignment asks you to implement your own Nearest Neighbor classifier.  You’ve already got scaffolding (above) that shows how to use the Nearest Neighbor classifier from the toolkit. Your job is to write your own nearest neighbor classifier, which _should_ give the same answer as the one from Scikit Learn.  Here's a recommended development plan:

1. Implement the `distance()` function to compute the distance between two examples, then test it to be sure it works correctly
1. Implement and test the `nearestNeighbor()` function to find the nearest example to a single novel query
1. Implement and test the `testNearestNeighbor()` function to apply nearestNeighbor to each example in a labeled test set and compute the accuracy of your nearest neighbor classifier
1. Extend your nearest neighbor classifier to a full $k$-nearest neighbor classifier
1. Finally, go back and refactor and/or optimize your code as necessary


For this course, we're not worried about low-level optimization (i.e. the kind of stuff you might do in C or C++), but we _are_ concerned about big-O efficiency.  Correctness is far more important than efficiency, but efficiency shouldn't be completely ignored either.  For instance, if your implementation is incorrect (i.e. does not work) you'll loose a lot of points.  If your implementation gives the correct result but it's $O(n)$ in space when it should be $O(1)$, or $O(n^2)$ time when it should be $O(n)$, you'll loose a few points.

As a result, you should always start out by implementing an algorithm in the simplest way possible, even if it's less efficient, since you're less likely to make mistakes when writing a simple solution as compared to a complex one.  Then, once you've got a working baseline, you can try to improve it and make sure the behavior doesn't change (i.e. you can tell easily if your "optimization" ends up breaking something).

For this assignment, detailed descriptions of each aspect of the development plan are given in the cells below, along with recommended prototypes for the functions described.

/ ******
### Distance

* The `distance()` function should take two inputs (you can assume each will be an indexable collection of numbers, e.g. a list or numpy array), and return the Euclidean distance between them.  

* Recall that Euclidean distance (also called the $L^2$-norm) between two $d$ dimensional vectors $\vec{a}$ and $\vec{b}$ is calculated as: $$| \vec{a} - \vec{b}|_2 = \sqrt{\sum_{i=1}^d (a_i - b_i)^2}$$

* Remember that the $\sum$ in math notation translates to an accumulator-pattern for-loop in code, and that the sub-script indexing in math notation ($x_i$) translates to array-style indexing (`x[i]`).

* Note that this function should be able to handle input vectors of any length (e.g. it should handle distances in any Cartesian space, regardless of how many dimensions it has, so long as the two inputs are the same length as each other); don't just hard-code it to handle the particular properties of the data set we happen to be using for testing.

* Note also that while the mathematical notation goes from 1 to $d$, you'll likely want your code to include a loop from 0 to $(d-1)$.  This is because mathematicians tend to index starting at 1, but computer scientists tend to index starting at 0.  This sort of thing is a frequent source of off-by-one errors when converting from math notation to code notation, so keep an eye out for it going forward.


In [None]:
# note that the lack of types in Python means argument types can be confusing;
# in this case, 'a' and 'b' should be arrays, and the function should compute Euclidean distance between them,
# which should be returned as a scalar

from math import sqrt  # you may want to use this function for calculating Euclidean distance;
                       # alternatively, you can try raising something to the power of 0.5

def distance(a, b):    
    summation = 0
    for i in range(len(a)):
        summation += (a[i] - b[i])**2
        
    dist = sqrt(summation)
    
    return dist

#### Testing Distance

- Test the distance function thoroughly to be sure it works as intended.  

- Be sure to do some basic error-checking 
    - remember Python does auto-typing, which means the type-checker won't catch errors for you 
    - e.g. in this case if the two inputs aren't both one-dimensional lists/arrays of numbers, or don't have the same number of elements, that's a problem the user should be alerted to right away, but Python won't automatically catch it
    - having your code check for and report errors this will save you headaches later on when debugging code that calls this function.
    - make sure that whatever error messages get produced are **helpful and informative** so that they actually make later debuggin easier; just having the program crash with no info isn't very helpful (though it's still better than continuing to run and silently propogating errors that pop up elsewhere).

    - remember that there are several ways of indicating an error to the user; simply printing an obvious error message and returning a 'dummy' value is one way.  An alternative is throwing an exception.

Here's a few examples of simple tests to get you started:

In [None]:
def checkError (a, b):
    if not type(a) == type(b):
        raise TypeError("a and b must be of the same type")
        return
    if not len(a) == len(b):
        raise TypeError("a and b must be of the same length")
        return

In [None]:
# check this result by hand to make sure your distance function works; 
# then check it using some other point pairs to be sure
a = [0,0]
b = [3,2]
print(f"a = {a}")
print(f"b = {b}")
checkError(a, b)
d = distance(a, b)
print(f"calculated distance between (a, b) = {d}")

In [None]:
# here's another test using the first two points from the training data;
# again, check the result by hand to make sure your function works correctly
a = X_train[0]
b = X_train[1]
print(f"a = {a}")
print(f"b = {b}")
checkError(a, b)
d = distance(a, b)
print(f"distance between (a, b) = {d}")

In [None]:
# some tests to see what happens if we call the function on different length inputs
# NOTE that other than the first example, the rest of these SHOULD produce some sort of
# noticable error message, since the distance function is only well defined for
# inputs of the same length
for (a, b) in [ ([0, 1], [2, 3]), (1, [2, 3]), ([1, 2], "hello"), ([1, 2], [[1,2], [2, 3], [3, 4]]) ]:
    print(f"for a: {a}, b: {b}")
    try:
        checkError(a, b)
        d = distance(a, b)
        print(f"distance between (a, b) = {d}")
    except Exception as e:
        print(f"!!Exception: {e}")
    print("-----")

In [None]:
for (a, b) in [([0, 0], [0,2]), ([1,2,3,4], [1,2,5,6]), ([0, 1, 3], [9, 8]) ]:
    print(f"for a: {a}, b: {b}")
    try:
        checkError(a, b)
        d = distance(a, b)
        print(f"distance between (a, b) = {d}")
    except Exception as e:
        print(f"!!Exception: {e}")
    print("-----")

### Nearest Neighbor
- The `nearestNeighbor()` function should take in a set of training examples (both features and labels), as well as a novel query point, and return the label of the point in the training set closest to the novel query point.

- Be sure your function returns the *label* of the winning point, not the distance to that point or the coordinates of the point itself

- You should make use of the `distance()` function you wrote above

- You may also create additional helper functions if you want

- You _may_ use library functions if you want, so long as they aren't shortcuts around the learning process (i.e.  don't make assumptions that trivialize the assignment)

- On the other hand, you shouldn't _need_ to use any library functions, and properly written the function is pretty short even without them (my reference implementation is 8 lines)

- Note that there are several ways to solve this problem, but not all are equally computationally efficient.  It can be done in $O(1)$ space (discounting the size of the inputs) and $O(n)$ time.  
  - if you use helper functions or library functions, also keep in mind their efficiency; e.g. if you use Python's built-in `sort()` function, your runtime will be **at best** $O(n \log(n))$ regardless of what you put in your own function

In [None]:
# this function should return the training label associated with the training 

def nearestNeighbor(train, trainLabels, query):
    # train = set of data assocaited with each reference
    # trainLabels = what type of iris
    # query = flower we're trying to get
    
    # find label of the neighbor that has the smallest distance to query
    minDist = distance(train[0], query)
    label = trainLabels[0]
    
    for i in range(1, len(train)):
        currDist = distance(train[i], query)
        if currDist < minDist:
            minDist = currDist
            label = trainLabels[i]
    
    return label

 #### Tests for Nearest Neighbor
- Again, be sure to test this function and ensure it works correctly before moving on

In [None]:
# test code for nearestNeighbor
# the data points and their labels that we would use to train our model
train = ([15.1, 3.5, 1.4, 0.2],
       [20.9, 3.0, 1.4, 0.2],
       [4.7, 13.2, 1.3, 0.2],
       [300.6, 9.32, 1.5, 0.2],
       [7.0, 1.2, 1.9, 0.2],
       [0.4, 3.9, 1.7, 0.4])

trainLabels = ([1, 0, 3, 4, 2, 5])

# as each of the query point is designed to be nearest to only one of the training points, we anticipate them to be labelled as 1, 0, 3, 4, 2, 5.
queryList = ([15, 3.5, 1.4, 0.2],
       [21, 3.0, 1.4, 0.2],
       [4.8, 13.1, 1.3, 0.2],
       [300, 9.32, 1.5, 0.2],
       [7.0, 1.2, 1.8, 0.1],
       [0.4, 4, 1.7, 0.4])

for vector in queryList:
    print(f"nearest neighbor of {vector} is: " + str(nearestNeighbor(train, trainLabels, vector)))

  ***
### Evaluating a Nearest Neighbor classifier
- Define the `testNearestNeighborClassifier()` function to take 4 inputs: training data, training labels, testing data, and testing labels. 

- Fill in the body of this function so that it loops over the examples in the test set, and for each one performs a nearest-neighbor classification. 

- The function should print out the overall accuracy (i.e. the number of times the predicted label matched the true label for the testing examples, divided by the total number of testing examples). 

- This function should make use of the helper functions you wrote in the previous steps.

- Again, note that you may create additional helper functions if you wish

In [None]:
# this function should print the average accuracy of a nearest neighbor classifier 
#      applied to each of the test points
def testNearestNeighborClassifier(train, trainLabels, test, testLabels):
    # predict the label of each test query point
    predictedLabels = []
    for query in test:
        predictedLabel = nearestNeighbor(train, trainLabels, query)
        predictedLabels.append(predictedLabel)
    
    # calculate accuray
    count = 0
    for i in range(len(testLabels)):
        if predictedLabels[i] == testLabels[i]:
            count += 1
    accuracy = count / len(testLabels)
    
    print("test accuracy: ", str(accuracy))

#### Testing nearest neighbor classifier
- Test this function on the same train/test data you used with the SciKit Learn nearest neighbor classifier; your function should produce the same accuracy as the version from the toolkit (at least to the first few significant figures).

In [None]:
# note that this should give you the same value as the library implementation of 
# KNeighborsClassifier with n_neighbors=1
testNearestNeighborClassifier(X_train, y_train, X_test, y_test)

classifierB = neighbors.KNeighborsClassifier(metric='euclidean', n_neighbors=1)
classifierB.fit(X_train, y_train)
print("sklearn accuracy: ", str(classifierB.score(X_test, y_test)))

***
### K Nearest Neighbor classifier

- Now that you've done the basic single-nearest neighbor, implement $k$-nearest neighbors below

- Create the function`testKNearestNeighborClassifier()` to take a number of neighbors as a parameter, in addition to the four parameters you used above. 

- You may want to start by copying your code from above and modifying it

- You'll likely want to create some new helper functions for this as well (e.g. `kNearestNeighbor()` as an analog to the `nearestNeighbor()` you wrote above)

- You can add as many cells and/or helper functions as you like; be sure to test these helper functions too.

- You may also want to look at the documentation for `numpy`, as it has some functions that you might find helpful.  

- Once again, an _efficient_ implementation of `kNearestNeighbor()` (i.e. finding the label for one example) should be linear time in $n$ and constant space (though the constant here will be $k$ times what it was for the earlier version of `nearestNeighbor()`).

In [None]:
import heapq

def kNearestNeighbor1(train, trainLabels, query, k):
    # solution 1 : with heap. time complexity: O(n), space complexity: O(n)
    
    # pairs up distances with labels, time: O(n), space: O(n)
    neighbors = []
    for i in range(len(train)):
        currDist = distance(train[i], query)
        neighbors.append((currDist, trainLabels[i]))
    
    # get k nearest neighbor from heap
    heapq.heapify(neighbors)
    kNearest = heapq.nsmallest(k, neighbors, key=None)
    
    # get labels of neighbors
    nearestLabels = [n[1] for n in kNearest]
    
    return nearestLabels

In [None]:
import numpy as np

def kNearestNeighbor2(train, trainLabels, query, k):
    # solution 2 : with array. time complexity: O(n), space complexity: O(k)
    
    kNeighbors = np.zeros((k, 2))
    
    # go through each of the training points and only record the ones that have k smallest distance
    n = 0 # tracks the number of items added in the array
    for i in range(len(train)):
        currDist = distance(train[i], query)
        # the if else bolck ensures only the smallest k distances are recorded in the array
        if (n < k):
            kNeighbors[n] =  [currDist, trainLabels[i]]
            n += 1
            # kNeighbors = np.append(kNeighbors, [currDist, trainLabels[i]])
        elif (currDist < kNeighbors[: , 0].max()):
            idxReplace = kNeighbors[: , 0].argmax(axis=0)
            kNeighbors[idxReplace] = [currDist, trainLabels[i]]

    nearestLabels = (kNeighbors[: , 1]).astype('int64') # slice the 2D array to only preserve labels, i.e. the second column
    return nearestLabels

In [None]:
def testKNN():
    # the training points and their labels
    train = ([300.1, 3.5, 1.4, 0.2], # 0
       [300.9, 3.5, 1.4, 0.2], # 1
       [4.7, 13.2, 7000, 0.2], # 2
       [300.6, 9.32, 1.5, 0.2], # 3
       [.0, 1.2, 7000, 0.2], # 4
       [0.4, 3.9, 7000, 0.4], # 5 
       [0.4, 3.9, 7100, 0.4]) # 6 

    trainLabels = ([1, 6, 3, 4, 5, 10, 2])
    
    # as the query points are designed to be significantly closer to some training points than to the others, the labels of their nearest neighbors are listed below
    queryList = ([300, 3.5, 1.4, 0.2],   # closest to #0, #1, #3 (not listed by distance)
           [20.9, 3.0, 7000, 0.2])   # closest to  #2, #4, #5, #6 (not listed by distance)

    for vector in queryList:
        for k in range(1, 4):
            print(f"1. labels of the {k} nearest neighbor of {vector} are " + str(kNearestNeighbor1(train, trainLabels, vector, k)))
            print(f"2. labels of the {k} nearest neighbor of {vector} are " + str(kNearestNeighbor2(train, trainLabels, vector, k)))
        print("-----")

testKNN()

In [None]:
# this function tests k nearest neighbor classifier

import numpy as np

def testKNearestNeighborClassifier(train, trainLabels, test, testLabels, k):
    predictedLabels = []
    for query in test:
        neighbors = kNearestNeighbor2(train, trainLabels, query, k) # get the labels of neighbors
        np_neighbors = np.array(neighbors)
        nLabel = np.bincount(np_neighbors).argmax() # get the pluraity vote from the labels
        predictedLabels.append(nLabel) # append the vote to a list for accuracy test later
    
    # count how many of our label matches the actual label and calculate the accuracy
    count = 0
    for i in range(len(predictedLabels)):
        if predictedLabels[i] == testLabels[i]:
            count += 1
    
    acc = count / len(testLabels)
    print("test accuracy: ", str(acc))

#### Testing K nearest neighbor

- Again, you should be able to test this function by running it with different numbers of neighbors, and you should  get the same results as you would with SciKit Learn, modulo possible tie-breaking issues 
    - (i.e. for some values of $k$ your code may be 'correct' and still return a different value than sklearn if you use a different method to resolve voting ties than the library does; this is especially likely for even values of $k$). 

- You may also want to print `X_train`, `y_train`, `X_test`, and `y_test` to look at them and make sure you understand what they contain (and how they are formatted)

- Note that this function should give you the same result for k=1 that your previous nearest neighbor code gives

In [None]:
# again, you can test against the lirbary for other values of k, though keep in 
# mind that the library might do tie breaking differently, so you can get different 
# results even with a "correct" algorithm, particularly for even values of K
k_values = [1, 3, 5, 20, 40, 80]
for k in k_values:
    
    classifierB.fit(X_train, y_train)
    
    print("when k = ", k) # print k
    # print our algorithm's accuracy
    testKNearestNeighborClassifier(X_train, y_train, X_test, y_test, k)
    
    # print SciKit accuracy
    classifierB = neighbors.KNeighborsClassifier(metric='euclidean', n_neighbors=k)
    classifierB.fit(X_train, y_train)
    trueAccuracy = classifierB.score(X_test, y_test)
    print("sklearn accuracy: ", trueAccuracy)
    
    print("-----")