##### 7. In this chapter, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centered to have mean zero and standard deviation one, and if we let $r_{ij}$ denote the correlation between the $i_{th}$ and $j_{th}$ observations, then the quantity $1 − r_{ij}$ is proportional to the squared Euclidean distance between the $i_{th}$ and $j_{th}$ observations. On the USArrests data, show that this proportionality holds. Hint: The Euclidean distance can be calculated using the `pairwise_distances()` function from the `sklearn.metrics` module, and correlations can be calculated using `np.corrcoef()`

In [86]:
import numpy as np
from statsmodels.datasets import get_rdataset
from sklearn.metrics import pairwise_distances
from sklearn.preprocessing import StandardScaler

In [87]:
df = get_rdataset('USArrests').data
df.head()

Unnamed: 0_level_0,Murder,Assault,UrbanPop,Rape
rownames,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Alabama,13.2,236,58,21.2
Alaska,10.0,263,48,44.5
Arizona,8.1,294,80,31.0
Arkansas,8.8,190,50,19.5
California,9.0,276,91,40.6


In [88]:
df_stand = StandardScaler().fit_transform(df)

In [89]:
r = np.corrcoef(df_stand)

In [90]:
ed = pairwise_distances(df_stand, metric='euclidean')

In [91]:
r.shape, ed.shape

((50, 50), (50, 50))

In [92]:
ed2 = ed**2
r2 = 1-r

In [94]:
percent = np.sum(np.where(ed2 == r2)) / (r.shape[0]*r.shape[1])
print(f'Percent the same: {percent}')

Percent the same: 0.8288
