<a href="https://colab.research.google.com/github/dlsun/pods/blob/master/03-Quantitative-Data/3.7%20Distance%20Metrics.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 3.7 Distance Metrics

So far, this chapter has been about ways to measure relationships between variables, or the _columns_ of a `DataFrame`. This lesson is about how to measure relationships between observations, or the _rows_ of a `DataFrame`.

How do we quantify how "similar" two observations are? We will use the Ames housing data set, but to keep things simple, we will work with just three quantitative variables from that data set: the number of bedrooms, the number of bathrooms, and the living area (in square feet).

In [0]:
import numpy as np
import pandas as pd

data_dir = "https://dlsun.github.io/pods/data/"
df_housing = pd.read_csv(data_dir + "AmesHousing.txt", sep="\t")

# extract 3 quantitative variables
df_housing_quant = df_housing[["Bedroom AbvGr", "Gr Liv Area"]].copy()
df_housing_quant["Bathrooms"] = (
    df_housing["Full Bath"] + 
    0.5 * df_housing["Half Bath"]
)
df_housing_quant

Shown below is a (three-dimensional) scatterplot of these variables. Consider the two observations connected by a red line. (The label next to each point is its index in the `DataFrame`.) To measure how similar they are, we can calculate the distance between the two points.

![](https://github.com/dlsun/pods/blob/master/03-Quantitative-Data/distance.png?raw=1)

Calculating the distance between two points is not as straightforward as it might seem because there is more than one way to define distance. The most familiar distance metric is probably _Euclidean distance_, which is the straight-line distance ("as the crow flies") between the two points. The formula for calculating this distance is a generalization of the Pythagorean theorem:

$$ d({\bf x}, {\bf x'}) = \sqrt{\sum_{j=1}^D (x_j - x'_j)^2} $$

In [0]:
x = df_housing_quant.loc[2927]
x1 = df_housing_quant.loc[2928]

x - x1

In [0]:
(x - x1) ** 2

In [0]:
np.sqrt(((x - x1) ** 2).sum())

The beauty of this definition is that it generalizes to more than three dimensions. Even though it is difficult to visualize points in 100-dimensional space, we can calculate distances between them in exactly the same way.

However, Euclidean distance is not the only way to measure how far apart two points are. There is also [**Manhattan distance**](https://en.wikipedia.org/wiki/Taxicab_geometry) (also called _taxicab distance_), which measures the distance a taxicab in Manhattan would have to drive to travel from A to B. In Manhattan, taxicabs cannot travel in a straight line (i.e., the green path below) because they have to follow the street grid. But there are multiple paths along the street grid that all have exactly the same length (i.e., the red, yellow, and blue paths below); the Manhattan distance is the length of any one of these shortest paths.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Manhattan_distance.svg/283px-Manhattan_distance.svg.png)

The formula for Manhattan distance is actually quite similar to the formula for Euclidean distance. Instead of squaring the differences and taking the square root at the end (as in Euclidean distance), we simply take absolute values:
$$ d({\bf x}, {\bf x'}) = \sum_{j=1}^D |x_j - x'_j|. $$

The following code calculates Manhattan distance:

In [0]:
((x - x1).abs()).sum()

In general, we can raise the absolute difference to any power $p$ and take the $p$th root. 
$$ d({\bf x}, {\bf x'}) = \left(\sum_{j=1}^D |x_j - x'_j|^p\right)^{1/p}. $$
This is called _Minkowski distance_. Manhattan distance and Euclidean distance are special cases of Minkowski distance for $p=1$ and $p=2$, respectively.

### Comparison of Euclidean and Manhattan distance

The Euclidean distance was essentially just the largest difference. This is because Euclidean distance first _squares_ the differences. The squaring operation has a "rich get richer" effect; larger values get magnified by more than smaller values. As a result, the largest differences tend to dominate the Euclidean distance.

On the other hand, Manhattan distance treats all differences equally. So Manhattan distance is preferred if we are concerned that an outlier in one variable might dominate the distance metric.

## The Importance of Scaling

Here's something to ponder. There are two pairs of observations in the figure below, one connected by a red line, the other connected by an orange line. Which pair of observations is more similar (assuming we use Euclidean distance)?

![](https://github.com/dlsun/pods/blob/master/03-Quantitative-Data/closer.png?raw=1)

Let's actually calculate these two distances.

In [0]:
# Distance between two points connected by red line
x = df_housing_quant.loc[2927]
x1 = df_housing_quant.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

In [0]:
# Distance between two points connected by orange line
x = df_housing_quant.loc[2498]
x1 = df_housing_quant.loc[290]

np.sqrt(((x - x1) ** 2).sum())

Surprised by the answer? The scatterplot is deceiving because it automatically scales the variables to make the points fit on the same plot. In reality, the variables are on very different scales. The number of bedrooms and bathrooms range from 0 to 6, while living area is in the thousands. When variables are on such different scales, the variable with the largest variability will dominate the distance metric.

The plot below shows the same data, but drawn to scale. We can see that differences in the number of bedrooms and the number of bathrooms hardly matter at all; only the variability in the living area matters.

![](https://github.com/dlsun/pods/blob/master/03-Quantitative-Data/closer_rescaled.png?raw=1)

To obtain distances that agree more with our intuition---and that do not give too much weight to any one variable---we transform the variables to be on the same scale. There are several ways to _scale_ a variable ${\bf x} = (x_1, \ldots, x_n)$:

- _standardizing_: subtract each value by the mean, then divide by the standard deviation, 
$$ x_i \leftarrow \frac{x_i - \bar {\bf x}}{\text{SD}({\bf x})} $$
- _normalizing_: scale each value so that the variable has length (or "norm") 1, 
$$ x_i \leftarrow \frac{x_i}{\sqrt{\sum_{i=1}^n x_i^2}} $$
- _min/max scaling_: scale each value so that all values are between 0 and 1, 
$$x_i \leftarrow \frac{x_i - \min({\bf x})}{\max({\bf x}) - \min({\bf x})}.$$

The figure below illustrates what each of these scaling methods do to a synthetic data set with two variables. All three methods scale the variables in similar (but slightly different) ways, resulting in figure-eights with different aspect ratios.  Standardizing also moves the data to be centered around the origin, while min-max scaling moves the data to be in a box whose corners are $(0, 0)$ and $(1, 1)$.

![](https://github.com/dlsun/pods/blob/master/03-Quantitative-Data/scaling.png?raw=1)

Let's standardize the Ames housing data, and see how it affects the distance metric.

In [0]:
df_housing_st = (
    (df_housing_quant - df_housing_quant.mean()) / 
    df_housing_quant.std()
)
df_housing_st

Notice that the resulting `DataFrame` contains negative values. This makes sense because standardizing makes the mean of every variable equal to 0. If the mean is 0, then some values must be negative.

The above command is deceptively simple. We actually subtracted a `Series` from a `DataFrame`, then divided the resulting `DataFrame` by another `Series`. We relied on `pandas` to broadcast each `Series` over the right dimension of the `DataFrame`. To be more explicit about the broadcasting, we could have also used the `.sub()` and `.divide()` methods (instead of `-` and `/`) and been explicit about the axis:

In [0]:
df_housing_st = (df_housing_quant.
                  sub(df_housing_quant.mean(), axis=1).
                  divide(df_housing_quant.std(), axis=1))
df_housing_st

Now let's recalculate the distances using this standardized data and see if our conclusions change.

In [0]:
# Distance between two points connected by red line
x = df_housing_st.loc[2927]
x1 = df_housing_st.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

In [0]:
# Distance between two points connected by orange line
x = df_housing_st.loc[2498]
x1 = df_housing_st.loc[290]

np.sqrt(((x - x1) ** 2).sum())

So, if we first standardize the data, then the pair of observations connected by the red line are more similar than the pair connected by the orange line, which matches our intuition. It is (almost) always a good idea to scale the variables before calculating distances.

## The Scikit-Learn API

Scikit-Learn is a machine learning library in Python that we will use extensively in Part II of this book. Since scaling data and calculating distances are essential tasks in machine learning, scikit-learn has built-in functions for carrying out these common tasks.

To scale a variable in scikit-learn, there are three steps:

1. First, we declare the scaler that we want to use.
2. Next, we "fit" the scaler to data. For example, in the case of standardization, this simply calculates and stores the mean and standard deviation to use for standardization.
3. Finally, we transform the data. This actually applies the scaling to the data.

To standardize data, we use the `StandardScaler`, and there is also a `MinMaxScaler` for min-max scaling. (Unfortunately, the `Normalizer` scaler normalizes the *rows* to be length 1, rather than the columns, so we use the `normalize` function with parameter `axis=0` instead.) See [here](https://scikit-learn.org/stable/modules/classes.html#module-sklearn.preprocessing) for a complete list of scalers and other preprocessing functions.

In [0]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(df_housing_quant)
df_housing_st = scaler.transform(df_housing_quant)

df_housing_st

Notice that scikit-learn returns the standardized data as a plain `numpy` array, rather than a `pandas` `DataFrame`. To ensure that the scaler returns a `DataFrame`, we use its `set_output` method.

You might wonder why scikit-learn divides scaling into three separate steps. For example, why is it necessary to separate the fitting step from the transformation step? The reason is that a scaler can be fit to one data set and then used to transform many different data sets, not just the original data set to which it was fit. Since the scaler is fit only once, this guarantees that all subsequent data sets will be scaled in exactly the same way (i.e., with respect to the same mean and standard deviation if using the `StandardScaler`).

Scikit-Learn also has built-in functions for calculating distances. For example, to calculate all pairwise distances between observations (2927, 2498) and (2928, 290), we can use the `euclidean_distances` function. There are also other distance metrics available, such as `manhattan_distances`.

In [0]:
from sklearn.metrics.pairwise import euclidean_distances

x = df_housing_st[[2927, 2498], :]
x1 = df_housing_st[[2928, 290], :]

euclidean_distances(x, x1)

The upper left entry of this matrix represents the distance between observations 2927 and 2928, while the lower right entry represents the distance between observations 2498 and 290. Check that they match the distances we calculated earlier using `pandas`.

## Exercises

1\. Instead of standardizing the three variables from the Ames housing data set, normalize them. Then, recompute the distances between the two pairs of points above. Does your conclusion change?

2\. Instead of standardizing the three variables from the Ames housing data set, apply a min-max scaling to them. Then, recompute the distances between the two pairs of points above. Does your conclusion change?

3\. Suppose that you really like house 0 in the data set, but it is too expensive. Find cheaper homes that are similar to it, by calculating distances after encoding categorical variables as dummy variables. Be sure to actually look at the profiles of the homes that your algorithm picked out as most similar. Do they make sense?

Try different distance metrics and different scaling methods. How sensitive are your results to these choices?

_Think:_ If the goal is to find a "good deal" on a similar house, should sale price be included as a variable in your distance metric? 

_Hint:_ There are too many variables in the data set. Do not attempt to call `pd.get_dummies()` on the entire `DataFrame`! You will want to pare down the number of variables, but be sure to include a mixture of categorical and quantitative variables. Refer to the [data documentation](https://ww2.amstat.org/publications/jse/v19n3/decock/DataDocumentation.txt) for information about the variables.