# Chapter 4. Relationships between Observations

The previous chapter discussed ways to measure relationships between variables, or the _columns_ of a `DataFrame`. This chapter is about how to measure relationships between observations, or the _rows_ of a `DataFrame`.

# Chapter 4.1 Distance Metrics

How do we quantify how "similar" two observations are? We will use the Ames housing data set, but to keep things simple, we will work with just three quantitative variables from that data set: the number of bedrooms, the number of bathrooms, and the living area (in square feet).

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
pd.options.display.max_rows = 5

housing_df = pd.read_csv("https://raw.githubusercontent.com/dlsun/data-science-book/master/data/AmesHousing.txt",
                         sep="\t")

# extract 3 quantitative variables
housing_df_quant = housing_df[["Bedroom AbvGr", "Gr Liv Area"]].copy()
housing_df_quant["Bathrooms"] = (
    housing_df["Full Bath"] + 
    0.5 * housing_df["Half Bath"]
)
housing_df_quant

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,3,1656,1.0
1,2,896,1.0
...,...,...,...
2928,2,1389,1.0
2929,3,2000,2.5


Shown below is a (three-dimensional) scatterplot of these variables. Consider the two observations connected by a red line. (The label next to each point is its index in the `DataFrame`.) To measure how similar they are, we can calculate the distance between the two points.

<img src="distance.png">

Calculating the distance between two points is not as straightforward as it might seem because there is more than one way to define distance. The one most familiar to you is probably **Euclidan distance**, which is the straight-line distance ("as the crow flies") between the two points. The formula for calculating this distance is a generalization of the Pythagorean theorem:

$$ d({\bf x}, {\bf x'}) = \sqrt{\sum_{j=1}^D (x_j - x'_j)^2} $$

In [2]:
x = housing_df_quant.loc[2927]
x1 = housing_df_quant.loc[2928]

x - x1

Bedroom AbvGr      1.0
Gr Liv Area     -419.0
Bathrooms          0.0
dtype: float64

In [3]:
(x - x1) ** 2

Bedroom AbvGr         1.0
Gr Liv Area      175561.0
Bathrooms             0.0
dtype: float64

In [4]:
np.sqrt(((x - x1) ** 2).sum())

419.00119331572313

The beauty of this definition is that it generalizes to more than three dimensions. Even though it is difficult to visualize points in 100-dimensional space, we can calculate distances between them in exactly the same way.

However, Euclidean distance is not the only way to measure how far apart two points are. There is also [**Manhattan distance**](https://en.wikipedia.org/wiki/Taxicab_geometry) (also called _taxicab distance_), which measures the distance a taxicab in Manhattan would have to drive to travel from A to B. Taxicabs are not able to travel in a straight line (i.e., the green path below) because they have to follow the street grid. But there are multiple paths along the street grid that all have exactly the same length (i.e., the red, yellow, and blue paths below); the Manhattan distance is the length of any one of these shortest paths.

![](https://upload.wikimedia.org/wikipedia/commons/thumb/0/08/Manhattan_distance.svg/283px-Manhattan_distance.svg.png)

The formula for Manhattan distance is actually quite similar to the formula for Euclidean distance. Instead of squaring the differences and taking the square root at the end (as in Euclidean distance), we simply take absolute values:
$$ d({\bf x}, {\bf x'}) = \sum_{j=1}^D |x_j - x'_j|. $$

The following code calculates Manhattan distance:

In [5]:
((x - x1).abs()).sum()

420.0

### Comparison of Euclidean and Manhattan distance

The Euclidean distance was essentially just the largest difference. This is because Euclidean distance first _squares_ the differences. The squaring operation has a "rich get richer" effect; larger values get magnified by more than smaller values. As a result, the largest differences tend to dominate the Euclidean distance.

On the other hand, Manhattan distance treats all differences equally. So Manhattan distance is preferred if you are concerned that an outlier in one variable might dominate the distance metric.

## The Importance of Scaling

Here's a quiz. There are two pairs of observations in the figure below, one connected by a red line, the other connected by an orange line. Which pair of observations is more similar (assuming we use Euclidean distance)?

![](closer.png)

Let's actually calculate these two distances.

In [None]:
# Distance between two points connected by red line
x = housing_df_quant.loc[2927]
x1 = housing_df_quant.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

In [None]:
# Distance between two points connected by orange line
x = housing_df_quant.loc[2498]
x1 = housing_df_quant.loc[290]

np.sqrt(((x - x1) ** 2).sum())

Surprised by the answer? The scatterplot is deceiving because it automatically scales the variables to make the points fit on the same plot. In reality, the variables are on very different scales. The number of bedrooms and bathrooms range from 0 to 6, while living area is in the thousands. When variables are on such different scales, the variable with the largest variability will dominate the distance metric.

The plot below shows the same data, but drawn to scale. You can see that differences in the number of bedrooms and the number of bathrooms hardly matter at all; only the variability in the living area matters.

![](closer_rescaled.png)

To obtain distances that agree more with our intuition---and that do not give too much weight to any one variable---we transform the variables to be on the same scale. There are a few ways to **scale** a variable:

- **standardizing**: subtract each variable by its mean, then divide by its standard deviation, 
$$ x_i \leftarrow \frac{x_i - \text{mean}[X]}{\text{SD}[X]} $$
- **normalizing**: scale each variable to have length (or "norm") 1, 
$$ x_i \leftarrow \frac{x_i}{\sqrt{\sum_{i=1}^n x_i^2}} $$
- **min/max scaling**: scale each variable so that all values are between 0 and 1, 
$$x_i \leftarrow \frac{x_i - \min[X]}{\max[X] - \min[X]}.$$

The figure below illustrates what each of these scaling methods do to a synthetic data set with two variables. All three methods scale the variables in similar (but slightly different) ways, resulting in figure-eights with different aspect ratios.  Standardizing also moves the data to be centered around the origin, while min-max scaling moves the data to be in a box whose corners are $(0, 0)$ and $(1, 1)$.

![](scaling.png)

Let's standardize the Ames housing data, and see how it affects the distance metric.

In [6]:
housing_df_std = (
    (housing_df_quant - housing_df_quant.mean()) / 
    housing_df_quant.std()
)
housing_df_std

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,0.176064,0.309212,-1.176462
1,-1.032058,-1.194223,-1.176462
...,...,...,...
2928,-1.032058,-0.218968,-1.176462
2929,0.176064,0.989715,1.156819


Notice that the resulting `DataFrame` contains negative values. This makes sense because standardizing makes the mean of every variable equal to 0. If the mean is 0, then some values must be negative.

The above command is deceptively simple. We actually subtracted a `DataFrame` by a `Series`, then divided the resulting `DataFrame` by another `Series`. We relied on `pandas` to broadcast each `Series` over the right dimension of the `DataFrame`. To be more explicit about the broadcasting, we could have also used the `.sub()` and `.divide()` methods (instead of `-` and `/`) and been explicit about the axis:

In [7]:
housing_df_std = (housing_df_quant.
                  sub(housing_df_quant.mean(), axis=1).
                  divide(housing_df_quant.std(), axis=1))
housing_df_std

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,0.176064,0.309212,-1.176462
1,-1.032058,-1.194223,-1.176462
...,...,...,...
2928,-1.032058,-0.218968,-1.176462
2929,0.176064,0.989715,1.156819


Now let's recalculate the distances using this standardized data and see if our conclusions change.

In [16]:
# Distance between two points connected by red line
x = housing_df_std.loc[2927]
x1 = housing_df_std.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

1.4651211129695825

In [17]:
# Distance between two points connected by orange line
x = housing_df_std.loc[2498]
x1 = housing_df_std.loc[290]

np.sqrt(((x - x1) ** 2).sum())

3.9440754446060033

So, if we first standardize the data, then the pair of observations connected by the red line are more similar than the pair connected by the orange line, which matches our intuition. It is (almost) always a good idea to scale your variables before calculating distances.

Now that you've seen how to implement one scaling method (standardization), you will implement two more (normalization and min-max scaling) in Exercises 1 and 2 below.

# Exercises

**Exercise 1.** Instead of standardizing the three variables from the Ames housing data set, normalize them. Then, recompute the distances between the two pairs of points above. Does your conclusion change?

In [137]:
housing_df_norm = (housing_df_quant.divide(
    (np.sqrt((housing_df_quant ** 2).sum())), axis=1))
housing_df_norm

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,0.018649,0.019331,0.009878
1,0.012433,0.010460,0.009878
...,...,...,...
2928,0.012433,0.016215,0.009878
2929,0.018649,0.023347,0.024695


In [138]:
x = housing_df_norm.loc[2927]
x1 = housing_df_norm.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

0.0079100215088419978

In [139]:
x = housing_df_norm.loc[2498]
x1 = housing_df_norm.loc[290]

np.sqrt(((x - x1) ** 2).sum())

0.021103948426701397

**Exercise 2.** Instead of standardizing the three variables from the Ames housing data set, apply a min-max scaling to them. Then, recompute the distances between the two pairs of points above. Does your conclusion change?

In [124]:
housing_df_mm = (housing_df_quant.
                  sub(housing_df_quant.min(), axis=1).
                  divide(housing_df_quant.max() - housing_df_quant.min(), axis=1))
housing_df_mm

Unnamed: 0,Bedroom AbvGr,Gr Liv Area,Bathrooms
0,0.375,0.249058,0.2
1,0.250,0.105878,0.2
...,...,...,...
2928,0.250,0.198757,0.2
2929,0.375,0.313866,0.5


In [125]:
x = housing_df_mm.loc[2927]
x1 = housing_df_mm.loc[2928]

np.sqrt(((x - x1) ** 2).sum())

0.14783815972387498

In [126]:
x = housing_df_mm.loc[2498]
x1 = housing_df_mm.loc[290]

np.sqrt(((x - x1) ** 2).sum())

0.42500066809602399

Exercises 3-5 ask you to work with a data set that describes the chemical composition of 1599 red wines (`https://raw.githubusercontent.com/dlsun/data-science-book/master/data/wines/reds.csv`). There are 12 variables in this data set, all of which are quantitative (so each observation is a point in 12-dimensional space).

**Exercise 3.** Which red wine is more similar to wine 0 in the `DataFrame`: wine 6 or wine 36? (Do not scale the variables.) Does your answer depend on which distance metric you use to measure "similarity"?

In [131]:
wine_df = pd.read_csv("https://raw.githubusercontent.com/dlsun/data-science-book/master/data/wines/reds.csv", sep=';')

x = wine_df.loc[0]
x1 = wine_df.loc[6]
x2 = wine_df.loc[36]

print(np.sqrt(((x - x1) ** 2).sum()))
print(np.sqrt(((x - x2) ** 2).sum()))


25.3260291195
20.6980530507


**Exercise 4.** Now suppose we agree to measure similarity using Euclidean distance, and we wish to investigate the effect of scaling the variables. Which red wine is more similar to wine 0: wine 6 or wine 36? Does the answer depend on whether the variables are scaled or not? Does it depend on the choice of scaling?

In [132]:
wine_df_std = (wine_df.
                  sub(wine_df.mean(), axis=1).
                  divide(wine_df.std(), axis=1))
x = wine_df_std.loc[0]
x1 = wine_df_std.loc[6]
x2 = wine_df_std.loc[36]

print(np.sqrt(((x - x1) ** 2).sum()))
print(np.sqrt(((x - x2) ** 2).sum()))

2.00721866628
2.37710757754


**Exercise 5.** Which wine is most similar to wine 267? Try different distance metrics and different scaling methods. How sensitive is your conclusion to the choice of distance metric and scaling method?

_Hint:_ You can do this without a `for` loop. Take advantage of broadcasting!

In [133]:
wine_df_norm = (wine_df / (np.sqrt((wine_df ** 2).sum())))

In [134]:
wine_df_mm = (wine_df.
                  sub(wine_df.min(), axis=1).
                  divide(wine_df.max() - wine_df.min(), axis=1))

In [147]:
eu_dist = lambda df, x: np.sqrt(((df - df.loc[x]) ** 2).sum(axis=1))
manh_dist = lambda df, x: ((df - df.loc[x]).abs()).sum(axis=1) 
#manh_dist
for dist_n, dist_f in {"Euclidean": eu_dist, "Manhattan": manh_dist}.items():
    for name, df in {"STD": wine_df_std, "NORM": wine_df_norm, "MM": wine_df_mm}.items():
        print("(", dist_n, name, "):", dist_f(df, 267).drop(index=267).idxmin())

( Euclidean STD ): 606
( Euclidean NORM ): 944
( Euclidean MM ): 606
( Manhattan STD ): 606
( Manhattan NORM ): 896
( Manhattan MM ): 606
