# Classification-1

---

**intro to classification with nearest neighbors**

In contrast to regression, classification is used when we need to predict if an observation falls into a discrete bin (yes or no, tree or flower). In this notebook we'll go over a fundamental classification algorithm - K Nearest Neighbors - by creating it from scratch before implementing it with sklearn, tuning parameters, and generating some visualizations.

In practice, i.e. outside of these notebooks, the most important thing you can do as a problem-solving-person is know your data intimately. Without understanding what you're working with on a fundamental level, you'll only be able to derive valuable insights if you get lucky. Because I love metaphor: building a model without understanding your data is like building a bridge out of silver and gold. It might look pretty, but those solid gold I-beams will buckle as soon as you sneeze in their direction. 

So, before we get into the guts of this one, we'll take a "step 0."

**Contents:**

0. Data Exploration
1. KNN from scratch
2. sklearn, metrics, visualizations

---

### Data Exploration

In [None]:
# standard imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [None]:
# reading in some data, iris is a classic dataset
# you can copy the link (remove /iris.data) to view the source

cols = ['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'class']

df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
                 names = cols)

df.head()  # you probably noticed I use .head(), you may prefer .sample(n)

In [None]:
# use "class" to make binary variables so we can visualize the distribution

# dummy variables are a way to code categorical variables for computation,
# if we were using "class" to build a model we would hold-out one of the
# levels to avoid perfect collinearity

class_dummies = pd.get_dummies(df['class']) 

class_dummies.sample(5)

In [None]:
# now concatenate on columns and describe()
# without the dummies, we wouldn't be able to show any metrics for "class"

df = pd.concat([df, class_dummies], axis=1)

df.describe().round(3)  # df.round(n) is helpful for interpretability

In [None]:
# since our classes are all evenly distributed, we can change 'class' into a numeric variable for later

df['class'] = [0 if c == 'Iris-setosa' 
               else 1 if c == 'Iris-versicolor'  # it's possible to do elif in a list comprehension
               else 2  # no if for the last else
               for c in df['class']]

In [None]:
# I'm not the best at mental math or holding all of these numbers in my head
# so I prefer to have a visualization to point to, rather than recalling which
# variables had outliers and what the boundaries are

# let's make box plots of the four measurement variables

fig, axs = plt.subplots(2, 2, figsize=(10, 10))

axs[0, 0].boxplot(df['sepal_length'])
axs[0, 0].set_title('sepal length')

axs[0, 1].boxplot(df['sepal_width'])
axs[0, 1].set_title('sepal width')

axs[1, 0].boxplot(df['petal_length'])
axs[1, 0].set_title('petal length')

axs[1, 1].boxplot(df['petal_width'])
axs[1, 1].set_title('petal width')

plt.show()

In [None]:
# there are a few points in sepal width that might be outliers, but nothing egregious.
# it's the most spread-out distribution (smallest box:graph ratio) so it may just
# have a higher threshold (3 std instead of 1.5 or 2)

# we can make two logical plots, sepal size (l*w) and petal size (l*w)
# that will give us a clearer visual if there are any points that really stand out

colors = ['g' if c == 0  # assigning a color vector based on "class"
          else 'c' if c == 1  
          else 'm'  
          for c in df['class']]

fig, (ax1, ax2) = plt.subplots(1, 2, figsize=(20, 10))  # two horizontal subplots

# sepal width v height
ax1.scatter(df['sepal_width'], df['sepal_length'], color=colors)
ax1.set_title('sepal size', fontsize=16)
ax1.set_xlabel('sepal width', fontsize=16)
ax1.set_ylabel('sepal height', fontsize=16)
ax1.grid()

# petal width v height
ax2.scatter(df['petal_width'], df['petal_length'], color=colors)
ax2.set_title('petal size', fontsize=16)
ax2.set_xlabel('petal width', fontsize=16)
ax2.set_ylabel('petal height', fontsize=16)
ax2.grid()

plt.show()

Deviating from the practical approach here - solving this problem with petal data would be trivial, and if this was a "real-world" problem then we would be extremely excited to see such a clearly good option - since the sepal data is messier and more realistic, we'll use that later on. But first...

---

### KNN from scratch

Nearest neighbors is a logically intuitive algorithm even if you don't go through it step by step. It's simple pattern recognition that mimics the idea that things near each other are like each other. For example: plants in a desert are like to other plants found in a desert, and not like plants in a rainforest.

The K in K Nearest Neighbors can be any natural number that you want. The ideal K for a dataset will depend on the number of observations and how they're spread out across the dimensions of interest. Generally speaking, a lower K leads to a more sensitive and "wild looking" decision boundary while a higher K leads to a smoother, less tolerant boundary (unless you go too high, where one class swallows the rest).

Digging into K, we may want to look at how we use those nearest neighbors to make a prediction about our observed point. A simple majority is the most basic version, and the one we're about to create, but stepping from there into other metrics (like taking a mean) is actually quite trivial.

**The logical steps for KNN:**
1. choose a K
2. for point $i_n$, calculate the distance to all other observations
3. sort, and keep the K closest observations
4. use the target values of the observations to predict $y_{i_n}$
5. repeat 2-5 for all $i_n$

Note: in step 2, KNN may start to break-down or lose useful interpretability at higher dimensions due to "the curse of dimensionality." To imagine the curse, think of two points on a line about 1 inch apart, then make the line a plane where they're also 1 inch apart on the second axis. Now they're $\sqrt{1^2 + 1^2}$ inches apart, at 4 dimensions they're $\sqrt{1^2 + 1^2 + 1^2 + 1^2} = 2$ inches apart, and so on until so many dimensions are added that even a large number of random points will all be very far apart.

In [None]:
# step 1: assume K = 2, that was easy

# step 2: calculate all distances between a point and every other point

# say we have the following 7 points. We know 3 are in blue group, and 3 are in red,
# green we're unsure about:

ex_points = [np.array((1, 1)),
             np.array((2, 2)),
             np.array((3, 2)),
             np.array((2, 5)),
             np.array((4, 4)),
             np.array((4, 5)),
             np.array((5, 4))]

colors = ['b']*3 + ['g'] + ['r']*3

# now plot
plt.figure(figsize=(8, 8))
plt.scatter([p[0] for p in ex_points],
            [p[1] for p in ex_points],
            c=colors,
            s=48)

plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)

plt.show()

In [None]:
# we can loop through all of the points, finding distances between all others

distance_matrix = []

for point in ex_points:
    point_dists = []  # these will be our rows
    for x in ex_points:  
        point_dists.append(np.linalg.norm(point - x))  # the norm of the difference of two vectors is their distance
    distance_matrix.append(point_dists)
    
dist_df = pd.DataFrame(data=distance_matrix)  # matrix to df
print(dist_df.shape)  # print a callable tuple of the lengths of a dataframe's (rows, columns)
dist_df

In [None]:
# now we can go through the distances and find the shortest ones
# note: each point is closest to itself, but a point can't be its own neighbor

nearest_neighbor = []

for idx in range(dist_df.shape[0]):  # iterate through row indices
    row = dist_df.iloc[idx]  # locate index
    row = row.drop(idx)  # drop self
    nearest_neighbor.append(row.idxmin())  # get the index of the minimum distance
    
nearest_neighbor  # verify against our above df or plot

In [None]:
# time to expand to nearest two neighbors

nearest_two = []

for idx in range(dist_df.shape[0]):
    row = dist_df.iloc[idx].drop(idx)  # combine two steps to clean up
    row = row.sort_values()  # sort to get more than just the min index
    near_idx = row.index[:2]  # row.index retrieves the indices as a series, we want 2
    nearest_two.append([i for i in near_idx])  # we can pull out the indices with a list comprehension
    
nearest_two  # again it's fairly trivial to scroll up and verify

In [None]:
# visually, we can check what class our green point is in
# but let's create our own little classifier

classes = []

for n in nearest_two:  # each list within the list
    c = []
    for idx in n:  # each number is an index
        c.append(colors[idx])  # here colors are 'classes'
    if len(set(c)) > 1:  # mix of 'r' and 'b'
        classes.append('g')
    else:  # only 'r' or 'b'
        classes.append(c[0])  
        
# copy paste and plot
plt.figure(figsize=(8, 8))
plt.scatter([p[0] for p in ex_points],
            [p[1] for p in ex_points],
            c=classes,  # only change
            s=48)

plt.xlabel('x', fontsize=16)
plt.ylabel('y', fontsize=16)

plt.show()  # voila

**Therefore**...

we should classify our 'uncertain' point as red, since the closest two points are 5 and 4 (both red).

---

### sklearn, metrics, visualizations

Now that we have a solid understanding of what's going on under the hood, it's time to pull out sklearn and bring our flower data back into the mix.

In keeping with a practical lens, we'll need to train-test split our data before we train a model.

**Important: Do not use testing data to fit/train a model.** 

In order to determine which K is going to be the best for predicting species, we need some kind of metric to make a decision against. To keep things simple, we'll use a confusion matrix and our best judgement. 

Steps:

1. train-test split (80-20 is typical)
2. with train, generate models for some K's
3. use metrics & visualizations to determine best K

In [None]:
# KNN classifier is what we need right now, we can also use KNN to solve regression problems

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import train_test_split  # import for splitting

from sklearn.metrics import confusion_matrix
from mlxtend.plotting import plot_confusion_matrix  # current best option for plotting a confusion matrix

seed = 42  # for convenience

In [None]:
# recall that the sepal data is messier, so we'll use that for this exercise

X = df[['sepal_length', 'sepal_width']]  # matrix of features

y = df['class']  # vector of targets


# you can hit shift+tab on "train_test_split" for more info, but the high-level
# explanation is that we're randomly & reproducibly splitting up our data

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size = 0.2,
                                                    random_state = seed)

len(X_train), len(X_test)

In [None]:
# initialize KNN with differen K's

knn_2 = KNeighborsClassifier(n_neighbors=2)
knn_3 = KNeighborsClassifier(n_neighbors=3)
knn_5 = KNeighborsClassifier(n_neighbors=5)
knn_10 = KNeighborsClassifier(n_neighbors=10)
knn_15 = KNeighborsClassifier(n_neighbors=15)

In [None]:
# let's run through knn_2

# start by fitting

knn_2.fit(X_train, y_train);

In [None]:
# now a confusion matrix to see how well the model fits
# create a function so we can run this for the other knn versions as well

def plot_cm(y_true, y_pred):
    '''
    using a fitted model, plot a confusion matrix
    '''
    cm = confusion_matrix(y_true, y_pred)
    fig, ax = plot_confusion_matrix(conf_mat=cm)                       
    plt.show()

# plot
plot_cm(y_train, knn_2.predict(X_train))

In [None]:
# we can already get a good sense for how well the model is working,
# but a visualization of the decision boundary will really help us out

from matplotlib.colors import ListedColormap  # allows us to use our own color selection

# plotting function - nice to do this when the plot is reusable and the code is messy

def plot_knn(X, y, model, title):
    
    # use color-blind friendly colors - http://mkweb.bcgsc.ca/colorblind/
    # hex color source - https://htmlcolorcodes.com/color-picker/
    point_color = ListedColormap(['#FF0000', '#0000FF', '#FFFF00'])
    back_color = ListedColormap(['#FF9696', '#9696FF', '#FFFF96'])
    
    # create a background mesh of points to show the decision boundary 
    # - if you copypaste this, make sure to change cushion and step to suit your data
    cushion = 0.5
    x_min, x_max = X.iloc[:, 0].min() - cushion, X.iloc[:, 0].max() + cushion
    y_min, y_max = X.iloc[:, 1].min() - cushion, X.iloc[:, 1].max() + cushion
    
    step = 0.02
    xx, yy = np.meshgrid(np.arange(x_min, x_max, step),
                         np.arange(y_min, y_max, step))
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)
    
    # now plot the background mesh
    plt.figure(figsize=(7, 7))
    plt.pcolormesh(xx, yy, Z, cmap=back_color)

    # and scatterplot the points
    plt.scatter(X.iloc[:, 0], X.iloc[:, 1],
                c=y, cmap=point_color,
                edgecolor='k', linewidths=2, s=36)
    plt.title(title, fontsize=24)

    plt.show()
      

plot_knn(X_train, y_train, knn_2, 'K = 2')

In [None]:
# now we can loop through the other 4 K's, plotting as we go
# notice how the decision boundary changes as K goes up

for model in [knn_3, knn_5, knn_10, knn_15]:
    
    model.fit(X_train, y_train);
    plot_cm(y_train, model.predict(X_train))
    plot_knn(X_train, y_train, model, title='K = {}'.format(model.get_params()['n_neighbors']))

In [None]:
# time to use a model to predict our testing data

# choose "optimal_model" based on the matricies and visuals above

optimal_model = knn_15

plot_cm(y_test, optimal_model.predict(X_test))
plot_knn(X_test, y_test, optimal_model, title='K = {}'.format(optimal_model.get_params()['n_neighbors']))