# Kelleher 2015, Chapter 5, Exercise 3

In this exercise, we're predicting the level of corruption (**continuous variable**) in a country based on macroeconomic and social features.

The data are available here: http://bit.ly/kelleher2015-ch5-ex3

In [8]:
import numpy as np
import pandas as pd
from sklearn import neighbors

# Read in the training data AND the new data
input_file = "ch5ex3.csv"
df = pd.read_csv(input_file)

# Extract and process the new data (Russia)
new_data = df.tail(1)
new_data = new_data.drop("CPI", 1)
new_data = new_data.rename(new_data.Country)
new_data = new_data.drop("Country", 1)
df = df.drop(df.tail(1).index)

# Get the training data ready to go
target_colname = "CPI"
X = df.drop(target_colname, axis=1)
y = df[target_colname]

X = X.rename(X.Country)
X = X.drop("Country", axis=1)

## 3a) k=3, Euclidean Distance

In [9]:
# Configure the algorithm
k = 3
metric = "euclidean"

# Fit the model
clf_3a = neighbors.KNeighborsRegressor(n_neighbors=k, metric=metric)
clf_3a.fit(X, y)

# What's the predicted CPI for Russia?
clf_3a.predict(new_data)

array([ 4.58913333])

So we predict a CPI of approximately 4.5891 for Russia using the average CPIs of the $k=3$ nearest neighbors.

## 3b) k = 16, Euclidean distance, $w_i = \frac{1}{d_i^2}$

In [11]:
# Configure the algorithm
k = 16
metric = "euclidean"

# Fit the model
clf_3b = neighbors.KNeighborsRegressor(n_neighbors=k, metric=metric, weights="distance")
clf_3b.fit(X, y)
clf_3b.predict(new_data)

array([ 6.09378991])

The weighted kNN prediction moder predicts a CPI of approximately 6.0937 for Russia.

## 3c) k = 3, Euclidean distance, Normalized data

As we learned in the section, when you're doing distance-based work, the scale of the various variables is extremely important. A variable with a naturally larger scale can dominate a Euclidean distance calculation, for example - the example on page 205 does (in my opinion) a great job of illustrating this.

To get all of our variables on the same scale, we can normalize them using **range normalization**. After deciding the range of values we want each variable to span (here we'll use low=0, high=1), we normalize as follows:

$$a_i' = low + \frac{a_i - \min(a)}{\max(a) - \min(a)} \times (high - low)$$

In [6]:
# Normalize the data
from sklearn.preprocessing import MinMaxScaler
mm_scaler = MinMaxScaler()

X_scaled = pd.DataFrame(mm_scaler.fit_transform(X))
X_scaled.columns = list(X)

new_data_scaled = mm_scaler.transform(new_data)

# Configure the algorithm
k = 5
metric = "euclidean"
weights="distance"

# Fit the model
clf_2c = neighbors.KNeighborsClassifier(n_neighbors=k, metric=metric, weights=weights)
clf_2c.fit(X, y)
clf_2c.predict(new_data)

array([False], dtype=bool)

The interesting thing to note here is that **$k$ is equal to the total number of training observations we have**, so we're letting all the data we have vote on the classification, where their votes are weighted based on how far away from the new observation they are.

## 2d) k = 3, Manhattan distance

In [8]:
# Configure the algorithm
k = 3
metric = "manhattan"

# Fit the model
clf_2d = neighbors.KNeighborsClassifier(n_neighbors=k, metric=metric)
clf_2d.fit(X, y)
clf_2d.predict(new_data)

array([False], dtype=bool)

## 2e) k=3, Cosine Similarity

In [9]:
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(X, new_data)

array([[ 0.        ],
       [ 0.53033009],
       [ 0.28867513],
       [ 0.4330127 ],
       [ 0.8660254 ]])

So observations 2, 4, and 5 are most similar in terms of cosines. The majority of these have label "Ham," so that's the prediction we make with $k=3$.