# Scikit Learn and the K-nearest Neighbor Algorithm

In this notebook we'll introduce the `sklearn` package and a few important concepts in machine learning:

* Splitting data into test, train, and validation sets.
* Fitting models to a dataset.
* And using "Hyperparameters" to tune models. 

Lets revisit the example we saw in the first class:

In [7]:
import pandas as pd
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split

# Load the data
heart_dataset = pd.read_csv('../data/heart-disease.csv')
heart_dataset.head()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,3,145,233,1,0,150,0,2.3,0,0,1,1
1,37,1,2,130,250,0,1,187,0,3.5,0,0,2,1
2,41,0,1,130,204,0,0,172,0,1.4,2,0,2,1
3,56,1,1,120,236,0,1,178,0,0.8,2,0,2,1
4,57,0,0,120,354,0,1,163,1,0.6,2,0,2,1


In [8]:
# Split the data into input and labels
labels = heart_dataset['target']
input_data = heart_dataset.drop(columns=['target'])

# Split the data into training and test
training_data, validation_data, training_labels, validation_labels = train_test_split(
    input_data, 
    labels, 
    test_size=0.20
)

# Build the model
model = KNeighborsClassifier()
model.fit(training_data, training_labels)

# See how it did.
print("Test accuracy: ", model.score(test_data, test_labels))

Test accuracy:  0.7540983606557377


# SKLearn's API

Scikit learn has a wonderfully unified API that always follows this pattern: 

* Create a model from a class.
    * This is where you set the "hyperparameters" of the model.
* Call that model's `.fit` method using the training data to train the model.
* Call that model's `.score` method to evaluate the model against the validation/test data.

For example:

In [13]:
# Lets build multiple models using a few different "hyperparameters"
model_one = KNeighborsClassifier()
model_two = KNeighborsClassifier(weights='distance')
model_three = KNeighborsClassifier(n_neighbors=1, weights='distance')

for i, model in enumerate([model_one, model_two, model_three]):
    model.fit(training_data, training_labels)
    print(f' {i+1} validation accuracy: ', model.score(validation_data, validation_labels))

 1 validation accuracy:  0.5901639344262295
 2 validation accuracy:  0.6065573770491803
 3 validation accuracy:  0.5737704918032787


# The K-Nearest Neighbor's Model

So what is the actual difference between these three models? How does KNN actually work?

KNN is a relatively straightforward model. When you want to make a prediction with KNN you simply compare the item you're making a prediction about to the training dataset using a distance function and based on the class of the "nearest" neighbors the model makes a prediction.

K is how many neighbors to look at, if k is 5 the model looks at the 5 nearest neighbors and whichever class is most common among those 5 neighbors is the one selected. Lets look at some pictures from the pre-reading (https://towardsdatascience.com/laymans-introduction-to-knn-c793ed392bc2):

![](https://miro.medium.com/max/552/1*6YK2xQ4wxBGGrCaegT9JfA.png)

![](https://miro.medium.com/max/552/1*z-y9I2aHAGj4GtMI5cR1OA.png)

![](https://miro.medium.com/max/552/1*7tSKxmXPca1IlgjRHtwOGg.png)

![](https://miro.medium.com/max/552/1*_EYdoVX941aZXa5BH6XnHQ.png)

These examples are all in 2-dimensional space, but the algorithm generalizes to n-dimensions (based on the number of features in our training data). 

K is controlled in `sklearn` by the `n_neighbors` parameter. 

Another hyperparameter in KNN is the `weights` parameter, which has 3 possible values, from the docs (https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html):

* ‘uniform’ : uniform weights. All points in each neighborhood are weighted equally.
* ‘distance’ : weight points by the inverse of their distance. in this case, closer neighbors of a query point will have a greater influence than neighbors which are further away.
* [callable] : a user-defined function which accepts an array of distances, and returns an array of the same shape containing the weights.

Similarly, the distance metric can be provided:

> metric: str or callable, default=’minkowski’

> the distance metric to use for the tree. The default metric is minkowski, and with p=2 is equivalent to the standard Euclidean metric. See the documentation of DistanceMetric for a list of available metrics. If metric is “precomputed”, X is assumed to be a distance matrix and must be square during fit. X may be a sparse graph, in which case only “nonzero” elements may be considered neighbors.
