$\def\*#1{\mathbf{#1}}$
$\DeclareMathOperator*{\argmax}{arg\,max}$

# Distance Methods

In [None]:
import numpy as np

import matplotlib.pyplot as plt
%matplotlib notebook

from scipy.spatial.distance import pdist, squareform

## Distance

Let us consider the following dataset :

| $\*x_i$      |   Age ($X_1$)     |   Income ($X_2$) | 
|------------|-------------------|------------------| 
| $\*x_1$      |     12            |     300          | 
| $\*x_2$      |     14            |     500          | 
| $\*x_3$      |     18            |     1000         | 
| $\*x_4$      |     23            |     2000         | 
| $\*x_5$      |     27            |     3500         | 
| $\*x_6$      |     28            |     4000         | 
| $\*x_7$      |     34            |     4300         | 
| $\*x_8$      |     37            |     6000         | 
| $\*x_9$      |     39            |     2500         | 
| $\*x_{10}$   |     40            |     2700         | 


In methods like classification and clustering, we have to compute de similarity (or  dissimilarity) between pairs of observations. For example, we could consider the euclidean distance to measure the dissimilarity between each pair of instances in this dataset. This leads to compute the so-called **distance matrix**.

**Exercise** - Declare this data set as a Pandas DataFrame. Based on [pdist](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html) and [squareform](http://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.squareform.html) compute the corresponding distance matrix.

In [None]:
# ...

## Normalization

The two attributes in this data set have very different scales. The sample range for $X_1$ is $\hat{r} = 40 - 12 = 28$ and the sample range for $X_2$ is $\hat{r} = 2700 - 300 = 2400$. For example, the euclidean distance between $\*x_1$ and $\*x_2$ is $\sqrt{2^2 + 200^2} = 200.01$. As you can see, the contribution of these variables in the dissimilarity measure depends on their scale. The contribution of $X_1$ is therefore overshadowed by the contribution of $X_2$.

**Exercise** - Apply the [standard score normalisation](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html) on this data set and compute the resulting mean and standard deviation.

In [None]:
# ... 

**Exercise** - Compute the distance matrix for the resulting data frame. Compare the two distance matrices visually with the help of [pcolor](https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.pcolor.html).

In [None]:
# ... 

**Exercise** - Show by an example that *cosine distance* is not a true distance metric. This function is defined as follow :
$$d(p, q) = 1 - |cos(p, q)| = 1 - \left|\frac{p \cdot q}{\|p\|\|q\|}\right|.$$

In [None]:
# ...

## Working in higher dimensions

Execute the following code. What do you conclude from it ?

In [None]:
# see "On the Surprising Behavior of Distance Metrics in High Dimensional Space"
# by Charu C. Aggarwal, Alexander Hinneburg, and Daniel A. Keim

dimensions = [10, 20, 30, 40, 50, 100, 200, 500, 1000]
p_norms = [1, 2, 10]

relative_contrasts = np.zeros((len(dimensions), len(p_norms)))

for i, d in enumerate(dimensions):
    relative_contrasts_d = np.zeros((30, len(p_norms)))
    for j in range(30):
        points = np.random.rand(100, d)
        for k, p in enumerate(p_norms):
            dists = np.linalg.norm(points, axis=1, ord=p)
            relative_contrasts_d[j, k] = (max(dists) - min(dists))/min(dists)
    for k, p in enumerate(p_norms):
        relative_contrasts[i, k] = np.mean(relative_contrasts_d[:,k])

colors = ['r', 'g', 'b']
for i, color in enumerate(colors):
    plt.plot(dimensions, relative_contrasts[:,i], color + '-')
    plt.plot(dimensions, relative_contrasts[:,i], color + '.')

plt.ylabel('Relative contrast')
plt.xlabel('Data dimensionality')
plt.show()

## Distances between Probability distributions

Consider the following dataset composed of two numerical attributes. Based on the use of distance metrics and the distribution of these attributes, determine which attribute follows a [normal distribution](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.norm.html).

In [None]:
# ...

## $k$-Nearest Neighbors

Build a $k$-nearest neighbors classifier for the [iris dataset](https://archive.ics.uci.edu/ml/datasets/iris) with $k = 3$.

In [None]:
# ...

Evaluate the performance of this classifier with a [K-Folds cross-validator](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html#sklearn.model_selection.KFold) by using 10 splits. Do you oberver different resuls by settin the `shuffle` parameter to `True`.

In [None]:
# ...

Use [cross_val_predict](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_predict.html#sklearn.model_selection.cross_val_predict) to compute the [accuracy score](scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html).

In [None]:
# ...

Compute the [confusion matrix](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html) for this model.

In [None]:
# ...

Create a figure that present the evolution of the *accuracy score* with respect to the value of $k \in [1, 10]$.

In [None]:
# ...