# Applied exercises

### Exercise 7

In the chapter, we mentioned the use of correlation-based distance and Euclidean distance as dissimilarity measures for hierarchical clustering. It turns out that these two measures are almost equivalent: if each observation has been centered to have mean zero and standard deviation one, and if we let rij denote the correlation between the ith and jth observations, then the quantity 1−rij is proportional to the squared Euclidean distance between the ith and jth observations. On the USArrests data, show that this proportionality holds. Hint: The Euclidean distance can be calculated using the dist() function, and correlations can be calculated using the cor() function.

#### Answer

In [178]:
import numpy as np
import pandas as pd
from scipy.spatial.distance import pdist
from sklearn.preprocessing import StandardScaler

In [179]:
# Load USA arrests data
data_usarrests = pd.read_csv('../data/usa_arrest.csv')

# Define X
features = ['Murder', 'Assault', 'UrbanPop', 'Rape']
X = data_usarrests.loc[:, features].values

# Standarize data
scaler = StandardScaler()
X = scaler.fit_transform(X)

Let's make sure we have transformed the data to have mean zero and std 1:

In [180]:
# Take mean and std of each column (feature)
mean = np.mean(X,axis=0)
print("These are the columns means: \n" + str(mean))
std = np.std(X, axis=0)
print("\nThese are the columns std: \n" + str(std))

These are the columns means: 
[-7.10542736e-17  1.38777878e-16 -4.39648318e-16  8.59312621e-16]

These are the columns std: 
[1. 1. 1. 1.]


Now we calculate the eucledian distance and the 1- corrlation values for the paiwise comparisons of each row in X:

In [181]:
euclidean = pdist(X, metric='seuclidean') # calculate squared euclidean distance
correlation = 1 - pdist(X, metric='correlation') # calculate 1 -correlation 

Then we take the ...

In [182]:
constant = correlation/euclidean
median_constant = np.median(constant)

proportion = correlation - median_constant*euclidean
median_proportion = np.median(proportion)
print('The median is: ' + str(median_proportion))

The median is: 0.0


### Exercise 8

In Section 10.2.3, a formula for calculating PVE was given in Equation 10.8. We also saw that the PVE can be obtained using the sdev output of the prcomp() function.

On the USArrests data, calculate PVE in two ways:
- (a) Using the sdev output of the prcomp() function, as was done in Section 10.2.3.
- (b) By applying Equation 10.8 directly. That is, use the prcomp() function to compute the principal component loadings. Then, use those loadings in Equation 10.8 to obtain the PVE.

These two approaches should give the same results.

Hint: You will only obtain the same results in (a) and (b) if the same data is used in both cases. For instance, if in (a) you performed prcomp() using centered and scaled variables, then you must center and scale the variables before applying Equation 10.3 in (b).

#### Answer

In [183]:
from sklearn.decomposition import PCA

Let's compute PCA over scaled data:

In [184]:
# Compute PCA
pca = PCA()
pca_data = pca.fit_transform(X)

Let's compute the proportion of variance explained by each component using the output explained_variance_ratio_ of scikit learn:

In [185]:
# Print explained variance by each component
variance_ratio = pca.explained_variance_ratio_

columns_pca = ['PCA_1', 'PCA_2', 'PCA_3', 'PCA_4']
variance_df = pd.DataFrame([variance_ratio], index = ['Variance_ratio'], columns=columns_pca)
print(variance_df)

                  PCA_1     PCA_2     PCA_3     PCA_4
Variance_ratio  0.62006  0.247441  0.089141  0.043358


Let's compute the PVE using Equation 10.8. 

In [186]:
# Define PCA loadings
pca_loadings = (pca.components_).T
pca_loadings_df = pd.DataFrame(pca_loadings, index=features, columns=columns_pca)

In [189]:
# Compute denominator (this is going to be the same for every pca)
matrix_sqr = X**2 
sum_denominator = np.sum(matrix_sqr)

# Compute numerator for each pca
for pca in range(0,len(columns_pca)):
    
    # import corresponding loadings
    loadings = pca_loadings[:,pca]

    # make sure to grab correct loadings (their sum of squares should approximate 1)
    assert np.sum(loadings**2) > 0.99
    
    # compute numerator
    matrix = X * loadings
    sum_pred_n = np.sum(matrix, axis=1)
    sum_pred_n_sqr = sum_pred_n ** 2
    sum_numerator = np.sum(sum_pred_n_sqr)
    
    answer = sum_numerator/sum_denominator
    print('PCA_' + str(pca) + ' PVE :' + str(answer))


PCA_0 PVE :0.6200603947873732
PCA_1 PVE :0.24744128813496016
PCA_2 PVE :0.0891407951452075
PCA_3 PVE :0.0433575219324588


There is a easier way to do this:
   

In [None]:
#######

### Exercise 9

Consider the USArrests data. We will now perform hierarchical clustering on the states.

- (a) Using hierarchical clustering with complete linkage and Euclidean distance, cluster the states.
- (b) Cut the dendrogram at a height that results in three distinct clusters. Which states belong to which clusters?
- (c) Hierarchically cluster the states using complete linkage and Euclidean distance, after scaling the variables to have standard deviation one.
- (d) What effect does scaling the variables have on the hierarchical clustering obtained? In your opinion, should the variables be scaled before the inter-observation dissimilarities are computed? Provide a justification for your answer.

#### Answer:

Let's do hierarchical clutering with complete linkage and Euclidean distance using Scipy:

In [None]:
from scipy.cluster.hierarchy import dendrogram, linkage, cut_tree

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
# Define non-scaled data
X_raw = data_usarrests.loc[:, features].values

# Perform hierarchical clustering
hc_usarrests = linkage(X_raw, method='complete', metric='euclidean')

# Plot dedrogram
states = data_usarrests.index.get_values()

plt.figure(figsize=(25, 10))
plt.title('Hierarchical clustering with complete linkage and euclidean distance - NON SCALED DATA', size=25)
plt.xlabel('city', size=20)
plt.ylabel('distance', size=20)
dendrogram(hc_usarrests,leaf_rotation=90., leaf_font_size=15, labels=states)
plt.show()

In [None]:
# Cut tree with three clusters
clusters_3 = cut_tree(hc_usarrests, n_clusters=3)

# Display which states belong to which clusters
clusters_3_df = pd.DataFrame(clusters_3, index=states, columns=['Cluster'])
print(clusters_3_df)

Now we repeat these steps using scaled data:

In [None]:
# Perform hierarchical clustering on scaled data
hc_usarrests_scaled = linkage(X, method='complete', metric='euclidean')

# Plot dedrogram
states = data_usarrests.index.get_values()

plt.figure(figsize=(25, 10))
plt.title('Hierarchical clustering with complete linkage and euclidean distance - SCALED DATA', size=25)
plt.xlabel('city', size=20)
plt.ylabel('distance', size=20)
dendrogram(hc_usarrests,leaf_rotation=90., leaf_font_size=15, labels=states)
plt.show()

# Cut tree with three clusters
clusters_3_scaled = cut_tree(hc_usarrests_scaled, n_clusters=3)

In [202]:
# Display which states belong to which clusters, comparing scaled and non-scaled data
clusters = np.append(clusters_3, clusters_3_scaled, axis=1)

clusters_3_df = pd.DataFrame(clusters,
                             index = states,
                             columns = ['Non-scaled cluster', 'Scaled cluster']
                             )

print(clusters_3_df)

                Non-scaled cluster  Scaled cluster
Alabama                          0               0
Alaska                           0               0
Arizona                          0               1
Arkansas                         1               2
California                       0               1
Colorado                         1               1
Connecticut                      2               2
Delaware                         0               2
Florida                          0               1
Georgia                          1               0
Hawaii                           2               2
Idaho                            2               2
Illinois                         0               1
Indiana                          2               2
Iowa                             2               2
Kansas                           2               2
Kentucky                         2               2
Louisiana                        0               0
Maine                          