## k-Nearest Neighbours (k-NN)

The task in this notebook is to try to predict the number of O-rings experiencing thermal distress given the temperature and pressure at launch. [More context.](https://en.wikipedia.org/wiki/Space_Shuttle_Challenger_disaster)

We've not properly introduced Pandas yet, but it's the easiest way of getting a data set from a CSV file into a nice form for sklearn.

In [None]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsRegressor
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt

In [None]:
challenger_df = pd.read_csv("challenger.csv")
challenger_df

For this example, we're only going to use the temperature and pressure as our two inputs. We're going to try to predict `distress_ct` as our target, which is the number of O-rings experiencing thermal distress.

In [None]:
X = challenger_df.drop(['distress_ct', 'o_ring_ct', 'launch_id'], axis=1)
X = np.array(X)
X

In [None]:
y = challenger_df['distress_ct']
y = np.array(y)
y

Let's now split our data set into a training set and test set. We put 50% of the data in the training set and 50% in the test set.

Note that we're using sklearn's built-in `train_test_split` function to perform a random split. Read its documentation to see exactly what it can do.

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.5, random_state=0)

In [None]:
for a, b in zip(X_train[:10], y_train[:10]):
    print(a, b)

## Fitting the model

In [None]:
model = KNeighborsRegressor(n_neighbors=3)
model.fit(X_train, y_train)

## Model evaluation
The nature of the target is such that standard metrics such as MSE and $R^2$ are less insightful than simply looking at the results. But let's output these metrics anyway, just for practice.

In [None]:
y_pred = model.predict(X_test)
y_pred = np.ceil(y_pred)
y_actual = y_test
mean_squared_error(y_actual, y_pred)

**Exercise**: Find out what `np.ceil` does. Why do you think we have used it here?

In [None]:
# Test R^2
print(model.score(X_test, y_actual))
plt.scatter(y_pred, y_actual, marker='.')
plt.xlabel('Predicted y')
plt.ylabel('Actual y')
plt.show()

Note that in this case, the scatter plot is not very useful as we're dealing with integer outputs -- so points tend to appear on top of each other. One way of dealing with this is to add a small amount of random "jitter" to each point (for visualisation purposes only!).

However, as the test set is so small it's easier to just print the whole thing instead.

In [None]:
for a, b in zip(y_pred, y_actual):
    print("Predicted = {0}, Actual = {1}".format(a, b))

## Exercises

1. Try adjusting $k$ (the number of neigbours) to see how this affects the results.
2. How often do we predict that an O-ring failed when in fact no O-rings have failed? (*false positives*. Note that a "positive" result is when an O-ring actually fails!)
3. How often do we predict that no O-ring has failed when in fact some O-rings have failed? (*false negatives*)
4. Both of the above two cases are examples of model mispredictions. In this particular case, what type of misprediction is worse: having a high number of false positives or a high number of false negatives?
5. We've only evaluated the performance of the model over the test set. Without changing any code, how would you expect k-NN to perform over the training set? Now try evaluating the performance of the model over the training set to see if you were correct.