In [None]:
import mglearn
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from IPython.display import display
from sklearn.model_selection import train_test_split
import scipy as scipy
import seaborn as sns

import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

# Exercise

# Getting the data

In [None]:
from sklearn.datasets import fetch_california_housing
data = fetch_california_housing(as_frame=True)

The keys of the object are:

In [None]:
data.keys()

Let's get a bit acquainted with the data

In [None]:
print(data['DESCR'][:1300])

In this exercise we will stick to the original target variable. Thus, the first task is to divide the data into a training set and a test set. Remember, the latter represents our future unseen examples

In [None]:
X_train, X_test, y_train, y_test = train_test_split(data['data'], data['target'], random_state=0)
print("Number of data points in training set and test set, respectively: {} and {}".format(X_train.shape[0], 
                                                                                          X_test.shape[0]))

# A first k-NN attempt at a model

In [None]:
from sklearn.neighbors import KNeighborsRegressor
knn = KNeighborsRegressor(n_neighbors=5)

Learn the model, which is this case simply means storing the training data

In [None]:
knn.fit(X_train, y_train)

In [None]:
print("Model score on test set: {}".format(knn.score(X_test, y_test)))

*Note again:* Accruracy is measured by the R^2 coefficient defined as (1 - u/v), where 
 * u is the residual sum of squares ((y_true - y_pred)^2).sum() 
 * v is the total sum of squares ((y_true - y_true.mean())^2).sum().
 
 The values are between 0 and 1, where higher is better.

### *Exercise:* 
* Calculate the performance of the model on the training data
* Try adusting the number of neighbors and see what impact is has on the two scores

# A second k-NN attempt at a model

Let's first take a slightly closer look at our data.

In [None]:
data_df = pd.DataFrame(X_train, columns=data['feature_names'])
data_df.describe()

A box plot can also provide a quick overview:

In [None]:
sns.boxplot(data=data_df, palette="vlag", orient='h',fliersize=1)

From the table and the plot we can see some potential issues with the data. Specifically, the scales of some of the features vary quite a lot. For example, the mean of 'Population' is 1426.28 but for 'AveBedrms' it is 1.096. As we saw on the slides, standard distance measures can have a hard time dealing with this. Thus, we resort to normalization (in this case Z-score): 

$$ \mathit{normalized(F)} = \frac{F-\mathit{mean}(F)}{\mathit{std}(F)}$$

Note that below:
* we only use data from the training set when performing the normalization
* the suffix '_n' added to the variables indicates that the features have been normalized.

In [None]:
X_train_n = (X_train-X_train.mean(axis=0))/X_train.std(axis=0)
X_test_n = (X_test-X_train.mean(axis=0))/X_train.std(axis=0)

The result of the normalization

In [None]:
sns.boxplot(data=pd.DataFrame(X_train_n, columns=data['feature_names']), palette="vlag", orient='h',fliersize=1)

Now fit the model to the transformed dataset and score the test set

In [None]:
knn.fit(X_train_n, y_train)
print("Model score on Z-score normalized test set: {}".format(knn.score(X_test_n, y_test)))

### *Exercise*: 
* Try making a min-max normalization of the data:
$$ \mathit{normalized(F)} = \frac{F-\mathit{min}(F)}{\mathit{max}(F)-\mathit{min}(F)}$$
* Make a boxplot of the normalized data and compare with the plot obtained from Z-score normalization
* Learn a kNN model and fit it to the new data. Is there a difference in score compared to what was achived using Z-score normalization? Why could that be the case?

# A third kNN model

The data analysis so far has only focused on the individual variables. Let's now look at the interaction between the variables.

In [None]:
sns.pairplot(data_df)

The covariance matrix captures some of the variability in the data. We can see this by plotting it as a heatmap

In [None]:
train_cov = np.cov(X_train_n, rowvar=False)
sns.heatmap(train_cov, 
        xticklabels=data['feature_names'],
        yticklabels=data['feature_names'])


The joint variability of the variables is not reflected in the Euclidean distance measure used so far (c.f. Slide 7). We may try to account for this variability using the Mahalanobis distance measure.

In [None]:
# We need to supply the Mahalanobis distance with the data covariance matrix  
knn = KNeighborsRegressor(n_neighbors=5, metric="mahalanobis", metric_params={'V': train_cov})

### *Exercise:*
* Fit the newly specified model (that relies on the Mahalanobis distance) using both the original data and the normalized data
* Evaluate the models with different number of neighbors and compare to the results previously obtained