# 4.3 The Distance Matrix

In many applications, we need the distance between every pair of observations ${\bf x}_i$ and ${\bf x}_j$ in a data set. How do we represent this information? The most common way is to use an $n \times n$ matrix, where the $(i, j)$th entry is the distance between ${\bf x}_i$ and ${\bf x}_j$. That is,

$$ D = \begin{pmatrix} 
d({\bf x}_1, {\bf x}_1) & d({\bf x}_1, {\bf x}_2) & \cdots & d({\bf x}_1, {\bf x}_n) \\ 
d({\bf x}_2, {\bf x}_1) & d({\bf x}_2, {\bf x}_2) & \cdots & d({\bf x}_2, {\bf x}_n) \\ 
\vdots & \vdots & \ddots & \vdots \\
d({\bf x}_n, {\bf x}_1) & d({\bf x}_n, {\bf x}_2) & \cdots & d({\bf x}_n, {\bf x}_n)
\end{pmatrix}. $$

There are a few things we can say about the $n\times n$ distance matrix $D$.

1. All of the entries of $D$ are non-negative.
2. Because the distance between any observation and itself, $d({\bf x}_i, {\bf x}_i)$, is always zero, the _diagonal_ elements of this matrix, $D_{ii}$ are all equal to 0.
3. For many distance metrics, including Euclidean and Manhattan distance, $d$ is symmetric, meaning that $d({\bf x}_i, {\bf x}_j) = d({\bf x}_i, {\bf x}_j)$. Therefore, the matrix $D$ will also be symmetric; that is, the values in the upper triangle will match their reflection in the lower triangle.

How do we calculate the distance matrix for a `DataFrame` consisting of all quantitative variables? For example, suppose we want to calculate the matrix of distances between each of the houses in the Ames housing data set, based on the number of bedrooms, number of bathrooms, and the living area (in square feet).

In [5]:
%matplotlib inline
import numpy as np
import pandas as pd
pd.options.display.max_rows = 6
pd.options.display.max_columns = 6

housing_df = pd.read_csv("https://raw.githubusercontent.com/dlsun/data-science-book/master/data/AmesHousing.txt",
                         sep="\t")

# extract 3 quantitative variables
housing_df_quant = housing_df[["Bedroom AbvGr", "Gr Liv Area"]].copy()
housing_df_quant["Bathrooms"] = (
    housing_df["Full Bath"] + 
    0.5 * housing_df["Half Bath"]
)
housing_df_quant

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,3,1656,1.0
1,2,896,1.0
2,3,1329,1.5
...,...,...,...
2927,3,970,1.0
2928,2,1389,1.0
2929,3,2000,2.5


_The Long Way:_ It is possible to create the distance matrix entirely in `pandas`. The idea is to first define a function that calculates the distances between a given observation and all of the other observations:

In [6]:
def get_euclidean_dists_from_obs(obs):
    return np.sqrt(
        ((housing_df_quant - obs) ** 2).sum(axis=1)
    )

get_euclidean_dists_from_obs(housing_df_quant.loc[0])

0         0.000000
1       760.000658
2       327.000382
           ...    
2927    686.000000
2928    267.001873
2929    344.003270
Length: 2930, dtype: float64

The code for this function is very similar to the code that we wrote for Exercise 5 at the end of Section 4.1.

Now, to get a matrix of distances $D$, we simply need to apply this function to every row of the `DataFrame`. To achieve this, we use the `.apply()` method with `axis=1`:

In [7]:
D = housing_df_quant.apply(
    get_euclidean_dists_from_obs,
    axis=1
)
D

Unnamed: 0,0,1,2,...,2927,2928,2929
0,0.000000,760.000658,327.000382,...,686.000000,267.001873,344.003270
1,760.000658,0.000000,433.001443,...,74.006756,493.000000,1104.001472
2,327.000382,433.001443,0.000000,...,359.000348,60.010416,671.000745
...,...,...,...,...,...,...,...
2927,686.000000,74.006756,359.000348,...,0.000000,419.001193,1030.001092
2928,267.001873,493.000000,60.010416,...,419.001193,0.000000,611.002660
2929,344.003270,1104.001472,671.000745,...,1030.001092,611.002660,0.000000


Notice that this is a $2930 \times 2930$ symmetric matrix of non-negative numbers, with zeroes along the diagonal, just as we predicted.

_The Short Way_: There are many packages in Python that calculate distance matrices. One such package is scikit-learn, a machine learning package in Python. Machine learning will be discussed in depth in Chapters 5-8, and we will explore the features of scikit-learn extensively in those chapters. Because distance matrices are important in machine learning, scikit-learn provides functions for calculating distance matrices.

For example, the following code calculates the (Euclidean) distance matrix between all of the houses in the Ames housing data set:

In [8]:
from sklearn.metrics import pairwise_distances

D_ = pairwise_distances(housing_df_quant, metric="euclidean")
D_

array([[    0.        ,   760.00065789,   327.00038226, ...,
          686.        ,   267.00187265,   344.00327033],
       [  760.00065789,     0.        ,   433.00144342, ...,
           74.00675645,   493.        ,  1104.00147192],
       [  327.00038226,   433.00144342,     0.        , ...,
          359.00034819,    60.01041576,   671.00074516],
       ..., 
       [  686.        ,    74.00675645,   359.00034819, ...,
            0.        ,   419.00119332,  1030.00109223],
       [  267.00187265,   493.        ,    60.01041576, ...,
          419.00119332,     0.        ,   611.00265957],
       [  344.00327033,  1104.00147192,   671.00074516, ...,
         1030.00109223,   611.00265957,     0.        ]])

Notice that the return type is a `numpy` array, instead of a `pandas` `DataFrame`. That is because scikit-learn was designed to work with `numpy` arrays. Although it will accept `pandas` `DataFrame`s as arguments, scikit-learn will convert them `numpy` arrays underneath the hood and return `numpy` arrays.

Fortunately, many of the usual `pandas` operations work on `numpy` arrays as well. For example, to get the maximum value in each row, we can use the `.max()` method with `axis=1`.

In [9]:
D_.max(axis=1)

array([ 3986.00028224,  4746.00034239,  4313.00011593, ...,  4672.0002408 ,
        4253.00038208,  3642.        ])

# Exercises

Exercises 1-3 ask you to work with a data set that describes the chemical composition of 1599 red wines (`https://raw.githubusercontent.com/dlsun/data-science-book/master/data/wines/reds.csv`). All 12 variables in this data set are quantitative.

**Exercise 1.** Calculate the distance between every pair of wines in this data set.

In [142]:
wine_df = pd.read_csv("https://raw.githubusercontent.com/dlsun/data-science-book/master/data/wines/reds.csv", sep=";")

wine_df_dists = pairwise_distances(wine_df, metric="euclidean")
wine_df_dists

array([[  6.74349576e-07,   3.58601922e+01,   2.04097050e+01, ...,
          1.91056851e+01,   2.33225970e+01,   1.10366429e+01],
       [  3.58601922e+01,   0.00000000e+00,   1.64045889e+01, ...,
          2.73859012e+01,   2.41316795e+01,   2.61010203e+01],
       [  2.04097050e+01,   1.64045889e+01,   0.00000000e+00, ...,
          1.99197504e+01,   1.98237135e+01,   1.26797093e+01],
       ..., 
       [  1.91056851e+01,   2.73859012e+01,   1.99197504e+01, ...,
          0.00000000e+00,   5.18964605e+00,   1.12669730e+01],
       [  2.33225970e+01,   2.41316795e+01,   1.98237135e+01, ...,
          5.18964605e+00,   0.00000000e+00,   1.42996395e+01],
       [  1.10366429e+01,   2.61010203e+01,   1.26797093e+01, ...,
          1.12669730e+01,   1.42996395e+01,   0.00000000e+00]])

**Exercise 2.** Using the distance matrix that you calculated in Exercise 1, calculate the distance of the wine that is most similar to each wine.

In [144]:
np.fill_diagonal(wine_df_dists, np.nan)
wine_min_dists = pd.DataFrame(wine_df_dists).min()
wine_min_dists

0       6.743496e-07
1       1.543859e+00
2       1.250569e+00
            ...     
1596    0.000000e+00
1597    4.638430e-01
1598    1.893854e+00
Length: 1599, dtype: float64

**Exercise 3.** Using the distance matrix that you calculated in Exercise 1, determine the identity of the wine that is most similar to each wine.

In [146]:
pd.DataFrame(wine_df_dists).idxmin()

0          4
1        752
2        196
        ... 
1596    1592
1597    1594
1598     569
Length: 1599, dtype: int64