<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

#  K-Nearest Neighbors with Scikit-Learn

_Authors: Alex Sherman (DC)_

<a id="learning-objectives"></a>
## Learning Objectives

1. Utilize KNN model on iris dataset
2. Implement SKLearn KNN model
3. Assess fit KNN Model using SKLearn

### Lesson Guide
- [Learning Objectives](#learning-objectives)
- [Overview of the iris dataset](#overview-of-the-iris-dataset)
	- [Terminology](#terminology)
- [Exercise: "Human learning" with iris data](#exercise-human-learning-with-iris-data)
- [Human learning on the iris dataset](#human-learning-on-the-iris-dataset)
- [K-nearest neighbors (KNN) classification](#k-nearest-neighbors-knn-classification)
	- [Using the train/test split procedure (K=1)](#using-the-traintest-split-procedure-k)
- [Tuning a KNN model](#tuning-a-knn-model)
	- [What happen if we view the accuracy of our training data?](#what-happen-if-we-view-the-accuracy-of-our-training-data)
	- [Training error versus testing error](#training-error-versus-testing-error)
- [Standardizing features](#standardizing-features)
	- [Use StandardScaler to standardize our data.](#use-standardscaler-to-standardize-our-data)
- [Comparing KNN with other models](#comparing-knn-with-other-models)


<a id="overview-of-the-iris-dataset"></a>
## Overview of the iris dataset
---

In [None]:
# read the iris data into a DataFrame
import pandas as pd
import numpy as np
data = './assets/dataset/iris.data'
iris = pd.read_csv(data)

In [None]:
iris.head()

<a id="terminology"></a>
### Terminology

- **150 observations** (n=150): each observation is one iris flower
- **4 features** (p=4): sepal length, sepal width, petal length, and petal width
- **Response**: iris species
- **Classification problem** since response is categorical

<a id="exercise-human-learning-with-iris-data"></a>
## Exercise: "Human learning" with iris data

**Question:** Can you predict the species of an iris using petal and sepal measurements?

1. Read the iris data into a Pandas DataFrame, including column names.
2. Gather some basic information about the data.
3. Use sorting, split-apply-combine, and/or visualization to look for differences between species.
4. Write down a set of rules that could be used to predict species based on iris measurements.

**BONUS:** Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.

In [None]:
import pandas as pd
import matplotlib.pyplot as plt

# display plots in the notebook
%matplotlib inline

# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (8, 6)
plt.rcParams['font.size'] = 14

#### Read the iris data into a pandas DataFrame, including column names.

In [None]:
# define the URL from which to retrieve the data (as a string)
path = './assets/dataset/iris.data'

# retrieve the CSV file and add the column names
iris = pd.read_csv(path)

#### Gather some basic information about the data.

In [None]:
# observe first five rows of data
iris.head(30)

In [None]:
iris.shape

In [None]:
iris.dtypes

In [None]:
iris.describe()

In [None]:
iris.species.value_counts()

In [None]:
iris.isnull().sum()

#### Use sorting, split-apply-combine, and/or visualization to look for differences between species.

In [None]:
iris.head()

In [None]:
# sort the DataFrame by petal_width
iris.sort_values(by='petal_width', ascending=True, inplace=True)

In [None]:
iris.head()

In [None]:
# sort the DataFrame by petal_width and display the NumPy array
iris.sort_values(by='petal_width', ascending=True).values[0:5]

#### Split-apply-combine: Explore the data while using a groupby on 'species'

In [None]:
# mean of sepal_length grouped by species
iris.groupby(by='species', axis=0).sepal_length.mean()

In [None]:
# mean of all numeric columns grouped by species
iris.groupby('species').mean()

In [None]:
# description of all numeric columns grouped by species
iris.groupby('species').describe()

In [None]:
# description of all numeric columns grouped by species
iris.groupby('species').describe()

In [None]:
# box plot of petal_width grouped by species
iris.boxplot(column='petal_width', by='species')

In [None]:
# box plot of all numeric columns grouped by species
iris.boxplot(by='species', rot=45)

In [None]:
# map species to a numeric value so that plots can be colored by species
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

# alternative method
iris['species_num'] = iris.species.factorize()[0]

In [None]:
iris.head()

In [None]:
# scatter plot of petal_length vs petal_width colored by species
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap='brg')

In [None]:
# scatter matrix of all features colored by species
pd.scatter_matrix(iris.drop('species_num', axis=1), c=iris.species_num, figsize=(12, 10))

#### Write down a set of rules that could be used to predict species based on iris measurements.

In [None]:
# define a new feature that represents petal area ("feature engineering")
# since iris petals are more ovular shaped as opposed to rectangular
# we're going to use the formula for area of an ellipse
# r1 * r2 * 3.14
iris['petal_area'] = ((iris.petal_length/2) * (iris.petal_width/2) * 3.124)

In [None]:
# description of petal_area grouped by species
iris.groupby('species').petal_area.describe().unstack()

In [None]:
# box plot of petal_area grouped by species
iris.boxplot(column='petal_area', by='species',figsize=(5,8))

In [None]:
# only show irises with a petal_area between 3 and 7
iris[(iris.petal_area > 3) & (iris.petal_area < 7)].sort_values('petal_area')

My set of rules for predicting species:

- If petal_area is less than 2, predict **setosa**.
- Else if petal_area is less than 7.4, predict **versicolor**.
- Otherwise, predict **virginica**.

#### Bonus: Define a function that accepts a row of data and returns a predicted species. Then, use that function to make predictions for all existing rows of data, and check the accuracy of your predictions.

In [None]:
val_a,val_b = ('a','b')
val_a

In [None]:
def predict_flower(df):
    preds = []
    for ind, row in df.iterrows():        
        if row.petal_area < 2:
            prediction = 'Iris-setosa'
        elif row.petal_area < 7.4:
            prediction = 'Iris-versicolor'
        else:
            prediction = 'Iris-virginica'
        preds.append(prediction)
    
    df['prediction'] = preds   
    
    
predict_flower(iris)

In [None]:
iris.head()

In [None]:
sum(iris.species == iris.prediction) / 150

<a id="human-learning-on-the-iris-dataset"></a>
## Human learning on the iris dataset
---

How did we (as humans) predict the species of an iris?

1. We observed that the different species had (somewhat) dissimilar measurements.
2. We focused on features that seemed to correlate with the response.
3. We created a set of rules (using those features) to predict the species of an unknown iris.

We assumed that if an **unknown iris** has measurements similar to **previous irises**, then its species is most likely the same as those previous irises.

In [None]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt

# increase default figure and font sizes for easier viewing
plt.rcParams['figure.figsize'] = (10, 8)
plt.rcParams['font.size'] = 14

# create a custom colormap
from matplotlib.colors import ListedColormap
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

In [None]:
# map each iris species to a number
iris['species_num'] = iris.species.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})

In [None]:
# box plot of all numeric columns grouped by species
iris.drop('species_num', axis=1).boxplot(by='species', rot=45)

In [None]:
# create a scatter plot of PETAL LENGTH versus PETAL WIDTH and color by SPECIES
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='species_num', colormap=cmap_bold)

In [None]:
iris['pred_num'] = iris.prediction.map({'Iris-setosa':0, 'Iris-versicolor':1, 'Iris-virginica':2})



# create a scatter plot of PETAL LENGTH versus PETAL WIDTH and color by PREDICTION
iris.plot(kind='scatter', x='petal_length', y='petal_width', c='pred_num', colormap=cmap_bold)

---

<a id="k-nearest-neighbors-knn-classification"></a>
## K-nearest neighbors (KNN) classification
---

K Nearest Classification is (as its name implies) a classification model that uses the 'K' most similar observations in order to make a prediction.

KNN is a supervised learning method therefore the 'K' most similar observations must have a known target value.

The process of of prediction using KNN is fairly straight forward.

1. Pick a value for K.
2. Search for the K observations in the data that are "nearest" to the measurements of the unknown iris.
    - Euclidian distance is often used as the distance metric, but other metrics are allowed.
3. Use the most popular response value from the K "nearest neighbors" as the predicted response value for the unknown iris.

The below visualizations show how a given area can change in its prediction as K changes.  Colored points represent true values and colored aread represents a prediction space, in that if an unknown point was to fall in a space its predicted value would be the color of that scace it is in.

<a id="knn-classification-map-for-iris-k"></a>
### KNN classification map for iris (K=1)

![1NN classification map](./assets/images/iris_01nn_map.png)

### KNN classification map for iris (K=5)

![5NN classification map](./assets/images/iris_05nn_map.png)

### KNN classification map for iris (K=15)

![15NN classification map](./assets/images/iris_15nn_map.png)

<a id="knn-classification-map-for-iris-k"></a>
### KNN classification map for iris (K=50)

![50NN classification map](./assets/images/iris_50nn_map.png)

We can see that as more Ks are added, the classification spaces boarders become more distinct, however you can see that the spaces are not perfectly pure as far as the known elements within them.

**Question:** What's the "best" value for K in this case?

**Answer:** ...

In [None]:
# read the NBA data into a DataFrame
import pandas as pd
path = './assets/dataset/NBA_players_2015.csv'
nba = pd.read_csv(path, index_col=0)

In [None]:
# map positions to numbers
nba['pos_num'] = nba.pos.map({'C':0, 'F':1, 'G':2})

In [None]:
# create feature matrix (X)
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']
X = nba[feature_cols]

In [None]:
# create response vector (y)
y = nba.pos_num

<a id="using-the-traintest-split-procedure-k"></a>
### Using the train/test split procedure (K=1)

In [None]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

#### STEP 1: split X and y into training and testing sets (using random_state for reproducibility)

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

#### STEP 2: train the model on the training set (using K=1)

In [None]:
knn = KNeighborsClassifier(n_neighbors=1)
knn.fit(X_train, y_train)

#### STEP 3: test the model on the testing set, and check the accuracy

In [None]:
y_pred_class = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred_class))

#### Repeating for K=50

In [None]:
knn = KNeighborsClassifier(n_neighbors=50)
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)
print(metrics.accuracy_score(y_test, y_pred_class))

#### Comparing testing accuracy with null accuracy

Null accuracy is the accuracy that could be achieved by **always predicting the most frequent class**. It is a benchmark against which you may want to measure your classification model.

#### examine the class distribution

In [None]:
y_test.value_counts()

#### Compute null accuracy

In [None]:
y_test.value_counts().head(1) / len(y_test)

<a id="tuning-a-knn-model"></a>
## Tuning a KNN model
---

In [None]:
# instantiate the model (using the value K=5)
knn = KNeighborsClassifier(n_neighbors=5)

# fit the model with data
knn.fit(X, y)

# store the predicted response values
y_pred_class = knn.predict(X)

**Question:** Which model produced the correct predictions for the two unknown irises?

**Answer:** ...

**Question:** Does that mean that we have to guess how well our models are likely to do?

**Answer:** ...

In [None]:
# calculate predicted probabilities of class membership
knn.predict_proba(X)

<a id="what-happen-if-we-view-the-accuracy-of-our-training-data"></a>
### What happen if we view the accuracy of our training data?

In [None]:
scores = []
for k in range(1,100):
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X,y)
    pred = knn.predict(X)
    score = float(sum(pred == y)) / len(y)
    scores.append([k, score])

In [None]:
data = pd.DataFrame(scores,columns=['k','score'])
data.plot.line(x='k',y='score')   

#### Searching for the "best" value of K

In [None]:
# calculate TRAINING ERROR and TESTING ERROR for K=1 through 100

k_range = range(1, 101)
training_error = []
testing_error = []

for k in k_range:

    # instantiate the model with the current K value
    knn = KNeighborsClassifier(n_neighbors=k)

    # calculate training error
    knn.fit(X_train, y_train)
    y_pred_class = knn.predict(X)
    training_accuracy = metrics.accuracy_score(y, y_pred_class)
    training_error.append(1 - training_accuracy)
    
    # calculate testing error
    knn.fit(X_train, y_train)
    y_pred_class = knn.predict(X_test)
    testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
    testing_error.append(1 - testing_accuracy)

In [None]:
# allow plots to appear in the notebook
%matplotlib inline
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')

In [None]:
# create a DataFrame of K, training error, and testing error
column_dict = {'K': k_range, 'training error':training_error, 'testing error':testing_error}
df = pd.DataFrame(column_dict).set_index('K').sort_index(ascending=False)
df.head()

In [None]:
# plot the relationship between K (HIGH TO LOW) and TESTING ERROR
df.plot(y='testing error')
plt.xlabel('Value of K for KNN')
plt.ylabel('Error (lower is better)')

In [None]:
# find the minimum testing error and the associated K value
df.sort_values('testing error').head()

In [None]:
# alternative method
min(zip(testing_error, k_range))

<a id="training-error-versus-testing-error"></a>
### Training error versus testing error

In [None]:
# plot the relationship between K (HIGH TO LOW) and both TRAINING ERROR and TESTING ERROR
df.plot()
plt.xlabel('Value of K for KNN')
plt.ylabel('Error (lower is better)')

- **Training error** decreases as model complexity increases (lower value of K)
- **Testing error** is minimized at the optimum model complexity

#### Making predictions on out-of-sample data

Given the statistics of a (truly) unknown player, how do we predict his position?

In [None]:
import numpy as np
# instantiate the model with the best known parameters
knn = KNeighborsClassifier(n_neighbors=14)

# re-train the model with X and y (not X_train and y_train) - why?
knn.fit(X, y)

# make a prediction for an out-of-sample observation
knn.predict(np.array([2, 1, 0, 1, 2]).reshape(1, -1))

What could we conclude?

- When using KNN on this dataset with these features, the **best value for K** is likely to be around 14.
- Given the statistics of an **unknown player**, we estimate that we would be able to correctly predict his position about 74% of the time.

<a id="standardizing-features"></a>
## Standardizing features
---

There is one major issue that applies to most machine learning models. They are sensitive to feature scale.

This means that it matters whether out feature are centered around zero and have similar variance to each other.

In the case of KNN on the Iris data set, image we measure sepal length in kilometers, but sepal width in millimeters. Our data will show variation in sepal width, but almost no variation in sepal length.

Unfortunately, KNN cannot automatically adjust to this. Other models tend to struggle with scale as well, even linear regression when you get into more advanced methods such as regularization.

Fortuantely, this is an easy fix.

<a id="use-standardscaler-to-standardize-our-data"></a>
### Use StandardScaler to standardize our data.

StandardScaler standardizes our data by subtracting the mean from each feature and dividing by it's standard deviation.

#### Seperate feature matrix and response for sklearn.

In [None]:
# create feature matrix (X)
feature_cols = ['ast', 'stl', 'blk', 'tov', 'pf']
X = nba[feature_cols]
# create response vector (y)
y = nba.pos_num

#### Create train-test-split.

Notice that we create the train-test-split first. This is because we will reveal information about our testing data if we standardize right away.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=99)

#### Instantiate and fit standard scaler.

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.fit_transform(X_test)

#### Fit a KNN model and look at the testing error.
Can you find a number of neighbors that improves our results from before?

In [None]:
# calculate testing error
knn = KNeighborsClassifier(n_neighbors=11)
knn.fit(X_train, y_train)
y_pred_class = knn.predict(X_test)
testing_accuracy = metrics.accuracy_score(y_test, y_pred_class)
testing_error = 1 - testing_accuracy
print testing_error

<a id="comparing-knn-with-other-models"></a>
## Comparing KNN with other models
---

**Advantages of KNN:**

- Simple to understand and explain
- Model training is fast
- Can be used for classification and regression
- Being a non-parametric method, it is often successful in classification situations where the decision boundary is very irregular

**Disadvantages of KNN:**

- Must store all of the training data
- Prediction phase can be slow when n is large
- Sensitive to irrelevant features
- Sensitive to the scale of the data
- Accuracy is (generally) not competitive with the best supervised learning methods