$\def\*#1{\mathbf{#1}}$
$\DeclareMathOperator*{\argmax}{arg\,max}$

# Distance Methods

In [None]:
import numpy as np
import pandas

import matplotlib.pyplot as plt
%matplotlib notebook

from scipy.spatial.distance import pdist, squareform, cosine, correlation
from scipy import stats

from sklearn.preprocessing import StandardScaler

from sklearn.neighbors import KNeighborsClassifier

from sklearn.model_selection import cross_val_score
from sklearn.model_selection import cross_val_predict
from sklearn.model_selection import GridSearchCV

from sklearn import metrics

## Distance

Let us consider the following dataset :

| $\*x_i$      |   Age ($X_1$)     |   Income ($X_2$) | 
|------------|-------------------|------------------| 
| $\*x_1$      |     12            |     300          | 
| $\*x_2$      |     14            |     500          | 
| $\*x_3$      |     18            |     1000         | 
| $\*x_4$      |     23            |     2000         | 
| $\*x_5$      |     27            |     3500         | 
| $\*x_6$      |     28            |     4000         | 
| $\*x_7$      |     34            |     4300         | 
| $\*x_8$      |     37            |     6000         | 
| $\*x_9$      |     39            |     2500         | 
| $\*x_{10}$   |     40            |     2700         | 


In methods like classification and clustering, we have to compute de similarity (or  dissimilarity) between pairs of observations. For example, we could consider the euclidean distance to measure the dissimilarity between each pair of instances in this dataset. This leads to compute the so-called **distance matrix**.

Use the variable `x` to declare this data set as a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.html). Based on [pdist](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html) and [squareform](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html) compute the corresponding distance matrix and display its content.

In [None]:
x  = [[12, 300], [14, 500], [18, 1000], [23, 2000], [27, 3500], [28, 4000],
[34, 4300], [37, 6000], [39, 2500], [40, 2700]]

# ...

## Normalization

The two attributes in this data set have very different scales. The sample range (*i.e.*, the difference between the maximal and minimal value) for $X_1$ is $\hat{r} = 40 - 12 = 28$ and the sample range for $X_2$ is $\hat{r} = 2700 - 300 = 2400$. For example, the euclidean distance between $\*x_1$ and $\*x_2$ is $\sqrt{2^2 + 200^2} = 200.01$. As you can see, the contribution of these variables in the dissimilarity measure depends on their scale. The contribution of $X_1$ is therefore overshadowed by the contribution of $X_2$.

Apply the [standard score normalisation](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) on this data set. With the help of [pandas.DataFrame.describe](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.describe.html) compare the statistics of each attribute before and after applying this preprocessing. What do you conclude ?

In [None]:
# ...

Compute the distance matrix for the resulting data frame. Compare the two distance matrices visually with the help of [matshow](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.matshow.html) in the same figure (two collumns).

In [None]:
# ... 

Show by an example that [cosine distance](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.cosine.html) is not a true distance metric.

In [None]:
# ...

## Working in higher dimensions

Execute the following code. What do you conclude from it ?

In [None]:
# see "On the Surprising Behavior of Distance Metrics in High Dimensional Space"
# by Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim

p_norms = [np.inf, 1, 2, 10]
p_norms_symbols = list(p_norms)
p_norms_symbols[0] = '\\infty'

dimensions = [10, 20, 30, 40, 50, 100, 200, 500, 1000]

relative_contrasts = np.zeros((len(dimensions), len(p_norms)))

def relative_contrast(p_norm, points):
    dists = np.linalg.norm(points, axis=1, ord=p_norm)
    return (max(dists) - min(dists))/min(dists)

vect_relative_contrast = np.vectorize(relative_contrast, excluded=['points'])

for i, d in enumerate(dimensions):
    relative_contrasts_d = np.zeros((30, len(p_norms)))
    for j in range(30):
        points = np.random.rand(100, d)
        relative_contrasts_d[j, :] = vect_relative_contrast(p_norms, points=points)
            
    relative_contrasts[i, :] = np.mean(relative_contrasts_d, axis=0)

colors = ['r', 'g', 'b', 'black']
fig, ax = plt.subplots()
for i, c in enumerate(colors):
    ax.plot(dimensions, relative_contrasts[:,i], color=c, label=f'$l_{{{p_norms_symbols[i]}}}$')
    ax.plot(dimensions, relative_contrasts[:,i], color=c, marker='.')

ax.legend()
ax.set_ylabel('Relative contrast')
ax.set_xlabel('Data dimensionality')

## Distances between Probability distributions

Consider the following dataset composed of two numerical attributes. How would you describre the distributions of these two variables ?

In [None]:
data = {'x' : [ 0.43517632, 0.47641718, 0.60575377, 0.46470922, 0.51737833, 0.47655526,
                0.51258292, 0.60819913, 0.5356796,  0.33825441, 0.4247407,  0.53410331,
                0.45459079, 0.52349406, 0.52076398, 0.59078815, 0.61926123, 0.54406685,
                0.60063367, 0.66453623, 0.5336054,  0.59196106, 0.53964875, 0.51487448,
                0.62458555, 0.64588703, 0.31222375, 0.66876766, 0.56189382, 0.61496229,
                0.73412465, 0.52608607, 0.47879383, 0.68361014, 0.42019189, 0.35498482,
                0.44017255, 0.43093393, 0.38340041, 0.56175196, 0.37891924, 0.37068189,
                0.570468,   0.43602521, 0.3214148,  0.51539918, 0.40666925, 0.47961649,
                0.54179767, 0.37537615, 0.33022805, 0.33208193, 0.46983545, 0.63879754,
                0.46640275, 0.50345714, 0.30398865, 0.42006949, 0.53104415, 0.48294933,
                0.44317104, 0.43342254, 0.28071584, 0.63962758, 0.61057127, 0.55960305,
                0.50020467, 0.46448242, 0.37259858, 0.65686464, 0.65203484, 0.52880382,
                0.28391691, 0.35041947, 0.51878248, 0.45755531, 0.64389836, 0.62504769,
                0.50432199, 0.43340175, 0.48876527, 0.56172746, 0.36887719, 0.65098322,
                0.54340335, 0.63703311, 0.46468021, 0.5337375,  0.5074945,  0.58994249,
                0.54475363, 0.64021255, 0.59169501, 0.38793481, 0.64764372, 0.57943197,
                0.53000465, 0.48004527, 0.53519401, 0.46445173],
        'y' : [ 0.04161744, 0.85724693, 0.88538545, 0.49085248, 0.76853788, 0.37776892,
                0.83086266, 0.88349682, 0.5046776,  0.28957081, 0.34705905, 0.36218001,
                0.02318062, 0.86876835, 0.88041603, 0.64209743, 0.75309349, 0.63985043,
                0.99383177, 0.58665476, 0.95352042, 0.5491464,  0.34533553, 0.41832789,
                0.87301048, 0.51067468, 0.91975204, 0.28539023, 0.19475197, 0.96762586,
                0.87386643, 0.70725661, 0.27803115, 0.78599879, 0.33253974, 0.06730955,
                0.8579904,  0.70883276, 0.41198892, 0.07861203, 0.5781772,  0.86368116,
                0.50329431, 0.58198719, 0.73229438, 0.62457685, 0.33725423, 0.69671389,
                0.96264407, 0.06124825, 0.21348643, 0.90350953, 0.21741805, 0.83571623,
                0.71779197, 0.13182516, 0.94561299, 0.86705673, 0.63058087, 0.67500915,
                0.73819059, 0.52762448, 0.58441263, 0.11107409, 0.62566132, 0.52100321,
                0.85615609, 0.56518927, 0.27547012, 0.11970483, 0.14742836, 0.97487006,
                0.80213574, 0.23882089, 0.68710164, 0.8203038,  0.41653959, 0.67386978,
                0.12651408, 0.53003848, 0.11002693, 0.54582815, 0.30474073, 0.46919815,
                0.24471064, 0.00883416, 0.12311192, 0.53539533, 0.47142311, 0.09704699,
                0.86414417, 0.9913629,  0.50587921, 0.8392211,  0.16903465, 0.94725847,
                0.14263359, 0.92539322, 0.96124385, 0.00939541]}

df = pandas.DataFrame(data)

x_relfreq = stats.relfreq(df.x, numbins=10)
y_relfreq = stats.relfreq(df.y, numbins=10)


def relfreq_hist(ax, relfreq):
    x = relfreq.lowerlimit + np.linspace(0, relfreq.binsize * relfreq.frequency.size, relfreq.frequency.size)
    ax.bar(x, relfreq.frequency, width=relfreq.binsize)

fig, (ax1, ax2) = plt.subplots(1, 2)

fig.suptitle('Relative frequency histograms')

relfreq_hist(ax1, x_relfreq)
ax1.set_title('x')

relfreq_hist(ax2, y_relfreq)
ax2.set_title('y')

Based on the use of distance metrics and the distribution of these attributes, determine which attribute follows a [normal distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html).

In [None]:
# ...

## $k$-Nearest Neighbors

Build a [$k$-nearest neighbors classifier](http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) for the [iris dataset](https://archive.ics.uci.edu/ml/datasets/iris) with $k = 3$ (see, `iris.data` in `/datasets`).

In [None]:
# ...

Evaluate the performance of this classifier with a [K-Folds cross-validator](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_validate.html) by using 3 splits.

In [None]:
# ...

Use [cross_val_predict](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict), that returns the value predicted for each item when it was part of the test set, to compute the [accuracy score](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

In [None]:
# ...

Compute the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for this model.

In [None]:
# ...

The k-NN classifier depends on several parameters such as the value of $p$ in the Minkowski metric and the number of neighbors. These perameters as the so-called **hyper parameters** that are not directly learnt within estimators (see, [tuning the hyper-parameters of an estimator](http://scikit-learn.org/stable/modules/grid_search.html)). Based on the use of [sklearn.model_selection.GridSearchCV](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV) determine the best value of $k \in \{1,2,\ldots,30\}$ and $p \in \{1, 2, 3, \infty\}$ for this data set. The base values are in the `best_params_` attribute.

In [None]:
# ...

Based on the `best_estimator_` attribute of `GridSearchCV`, evaluate the accuracy score of the best k-NN classifier with cross validation using 2 splits. 

In [None]:
# ...