All instructions are provided for R. I am going to reproduce them in Python as best as I can.

# Preface

From the textbook, p. 416:
> In the chapter, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centered to have mean zero and standard deviation one, and if we let $r_{ij}$ denote the correlation between the i-th and j-th observations, then the quantity $1 âˆ’ r_{ij}$ is proportional to the squared Euclidean distance between the i-th and j-th observations. On the `USArrests` data, show that this proportionality holds. <br><br>*Hint: The Euclidean distance can be calculated using the `dist()` function, and correlations can be calculated using the `cor()` function.*

In [28]:
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist
import seaborn as sns
from sklearn.preprocessing import StandardScaler


sns.set()
%matplotlib inline

In [29]:
usarrests = pd.read_csv(
                        'https://raw.githubusercontent.com'
                        '/dsnair/ISLR/master/data/csv/USArrests.csv'
                       ).set_index('State')
usarrests.head(3)

Unnamed: 0_level_0,Murder,Assault,UrbanPop,Rape
State,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,13.2,236,58,21.2
Alaska,10.0,263,48,44.5
Arizona,8.1,294,80,31.0


It took me a while to get it, but when you use SciPy's `pdist()` with `metric='correlation'`, the distances are **already** $1 - r_{ij}$. [Documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.spatial.distance.pdist.html).

In [30]:
scaled = StandardScaler().fit_transform(usarrests.T).T
euclidian_dist = pdist(scaled, metric='euclidean')**2
correlation_dist = pdist(scaled, metric='correlation')

coefs = euclidian_squares / correlation_dist
print(coefs)
(abs(coefs - coefs[0]) < 1e-10).all()  # all equal up to numeric precision

[8. 8. 8. ... 8. 8. 8.]


True

The explanation of where those 8's come from, borrowed from [this thread](https://stackoverflow.com/questions/40823581/squared-euclidean-distance-and-correlation-between-two-normalized-variables-a-p) on StackOverflow.

1. All **observations** are standardized. This means that $ \bar{x}_i = 0 $, and $\text{var} (x_i) = 1$, and $ x_i^T x_i = p $ where $x_i$ is the i-th observation and $p$ is the number of predictors.
1. Covariation:
$$ \text{cov}(x_i, x_j) = \frac{(x_i - \bar{x}_i)^T (x_j - \bar{x}_j)}{p} = \frac{x_i^T x_j}{p}$$
1. Correlation:
$$ r_{ij} = \frac{\text{cov}(x_i, x_j)}{\sqrt{\text{var} (x_i) \cdot \text{var} (x_j)}} = \text{cov}(x_i, x_j) = \frac{x_i^T x_j}{p} $$
1. Square of the Euclidean distance is:
$$
\begin{align}
  ||x_i - x_j||^2 &= (x_i - x_j)^T (x_i - x_j) \\
                  &= x_i^T x_i + x_j^T x_j - 2 x_i^T x_j^T \\
                  &= p + p - 2p \cdot r_{ij} \\
                  &= 2p \cdot (1 - r_{ij})
\end{align}
$$

This means, that squared Euclidean distance is proportional to $1 - r_{ij}$. The coefficient in this proportion is $2p$ which is twice the number of predictors. There are four predictors in the `USArrests` dataset; so $2p = 8$, and that's what we've seen above: two hundred 8's.
