# Lab 8: Cluster Analysis

In this lab, you will utilize a 24-hour measurement of spatial accessibility in case the time-variant data was incorporated into the measurement. We name the spatial accessibility measurement investigated in the previous labs as **static measurement** and the 24-hour changes of spatial accessibility as **dynamic measurement**. 

<img src="./data/dyn_acc.jpg" style="width: 900px;"/>

The procedure of this lab. <br>
First, we employ **Pearson's correlation** coefficient to explore at which hour the static measurement fails to reflect the 24-hour dynamic measurement (i.e., when the correlation coefficient is low). <br>
Second, we want to temporally cluster the 24-hour variation based on their distribution characteristics (i.e., median and median absolute deviation). Here, we implement **K-means clustering** to overcome the verbose of 24-hour measurement. <br>

Data source: https://www.tandfonline.com/doi/full/10.1080/13658816.2021.1978450

### Notes:
**Before you submit your lab, make sure everything runs as expected WITHOUT ANY ERROR.** <br>
**Make sure you fill in any place that says `YOUR CODE HERE` or `YOUR ANSWER HERE`:**

In [None]:
FULL_NAME = ""

In [None]:
# Import necessary packages
import geopandas as gpd
import pandas as pd
from scipy.stats import pearsonr
import matplotlib.pyplot as plt
from sklearn.metrics import silhouette_score
from sklearn.cluster import KMeans, AgglomerativeClustering
from scipy.cluster import hierarchy
import numpy as np
import mapclassify
from matplotlib.colors import LinearSegmentedColormap

## 1. (0.5 point) Import Data

In this lab, you will be using the following two datasets. `dyn_acc` variable is the dynamic measurement, which represents spatial accessibility to EV charging stations every hour, and `static_acc` variable is the static measurement that doesn't consider the temporal changes of spatial accessibility to EV charging stations. 

**1.1.** Load `dynamic_access.shp` in the data folder as the name of `dyn_acc` with GeoPandas.<br>
**1.2.** Load `static_access.shp` in the data folder as the name of `static_acc` with GeoPandas.<br>

In [None]:
# Your code here
dyn_acc = gpd.read_file('./data/dynamic_access.shp')
static_acc = gpd.read_file('./data/static_access.shp')

Use the following two cells and investigate the structure of the GeoDataFrames.

In [None]:
dyn_acc.head(3)

In [None]:
static_acc.head(3)

## 2. (1 point) Pearson's correlation coefficient

Use Pearson's correlation coefficient (i.e., `pearsonr()` function) and investigate the correlation between the measure of accessibility at each hour and the static accessibility. If the correlation coefficient is **high and closed to 1**, it proves the static accessibility measurement and the dynamic accessibility at the specific hour is **similar**. This means that the static measurement can be used to estimate the accessibility at a certain hour. <br>
**However**, if the correlation coefficient is **low**, this means that the conventional static measurement **fails to estimate** the accessibility at a certain hour. 

**2.1.** Create a for loop that iterates through the hours from `0` to `23`. <br>
**2.2.** Calculate the correlation coefficient of each hour in `dyn_acc` and the static measurement in `static_acc`. <br>
**2.3.** If the correlation coefficient is statistically significant, append the resulted correlation coefficient to the list `corr_list`. If not, append `0` to the list `corr_list`. <br>
**2.4.** The populated `corr_list` will have the list of correlation coefficients with the order of hours from 0 to 23. 

In [None]:
# Your code here
corr_list = []


    
corr_list

In [None]:
""" Test code for the previous code. This cell should NOT give any errors when it is run."""

assert round(corr_list[0], 3) == 0.934
assert round(corr_list[6], 3) == 0.981
assert round(corr_list[12], 3) == 0.64
assert round(corr_list[18], 3) == 0.369
assert round(corr_list[21], 3) == 0.727

print('Success!')

**2.5.** Create a line graph that has 24 hours as the x-axis and correlation coefficient as the y-axis. <br>
**2.6.** Use a `print()` function and print the hour of the lowest correlation coefficient and its correlation coefficient. 

In [None]:
# Your code here



## 3. (2.5 points) K-Means clustering for temporal clustering

We now know that the temporal changes embedded in 24-hour spatial accessibility measurement are pretty significant. Therefore, we want to temporally cluster the hourly measurement into a few temporal groups so that we can see their distinctive temporal changes. 
<br>

BEFORE CLUSTERING | AFTER CLUSTERING
:-: | :-:
![alt](./data/scatter_before_cluster.png) | ![alt](./data/scatter_after_cluster.png)

### 3.1. (1 point) Place the accessibility of each hour into a two-dimensional plane

First, we need to place the accessibility of each hour into a two-dimensional plane based on their representative value, median, and median absolute deviation. <br>

**3.1.1.** Create a `for-loop` that iterates through the hours from `0` to `23`. <br>
**3.1.2.** Calculate the median of accessibility at each hour with <a href=https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.median.html>`pandas.DataFrame.median()`</a> function, and append the result to `median_acc` list. <br>
**3.1.3.** Calculate the median absolute deviation of accessibility at each hour with <a href=https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.mad.html>`pandas.DataFrame.mad()`</a> function and append the result to `mad_acc` list. <br>

In [None]:
median_acc = []  # Median accessibility of each hour
mad_acc = []  # Median absolute deviation of the accessibility each hour

# Your code here




Now, we will incorporate the two lists into a DataFrame `summary_df`. 

**3.1.4.** Create a column `hours` in the `summary_df` DataFrame. <br>
**3.1.5.** Create a column `median` in the `summary_df` DataFrame, and populate the column with `median_acc` list. <br>
**3.1.6.** Create a column `mad` in the `summary_df` DataFrame, and populate the column with `mad_acc` list. <br>

The following is the expected result of the `summary_df` DataFrame. 

<img src="./data/summary_df.jpg" style="width: 200px;"/>

In [None]:
summary_df = pd.DataFrame()

# Your code here



In [None]:
""" Test code for the previous code. This cell should NOT give any errors when it is run."""

assert ('hours' in summary_df.columns) & ('median' in summary_df.columns) & ('mad' in summary_df.columns)
assert summary_df.at[7, 'median'].round(4) == 3.3287
assert summary_df.at[7, 'mad'].round(4) == 1.5143
assert summary_df.at[19, 'mad'].round(4) == 2.4524
assert summary_df.at[19, 'median'].round(4) == 3.3014

print('Success!')

In [None]:
""" Check your answer here. This cell should show a figure that looks the same as `BEFORE CLUSTERING` above."""

fig, ax = plt.subplots(figsize=(7, 7))

# Place the accessibility measures of each hour on a two-dimensional plane 
ax.scatter(summary_df['median'], summary_df['mad'], s = 50)

# Create label of axis
ax.set_ylabel(f'Median Absolute Deviation (MAD) \n of Accessibility', rotation='vertical', fontsize=15)
ax.set_xlabel('Median of Accessibility ', fontsize=15)

# Annotate the representative value of each hour accessibility measurement
for t in range(24):
    ax.annotate(text=f'{t}',
                xy=(summary_df.loc[summary_df['hours'] == t, 'median'], 
                    summary_df.loc[summary_df['hours'] == t, 'mad']
                   ),
                textcoords = 'offset points',
                xytext=(0, 14),
                ha='center',
                size=18
               )
plt.show()


### 3.2. (0.5 point) K-Means: Allocate observations into three clusters

You are about to group the hours in `summary_df` into **three clusters** based on the `median` and `mad` of accessibility at each hour. 

**3.2.1.** Initiate K-Means clustering process with <a href=https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html>`KMeans()`</a> function. Use the number `3` as the number of clusters (`n_clusters`) and assign the result as variable `kmeans_result`. <br>
**3.2.2.** Call <a href=https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit>`fit()`</a> function of the KMeans instance `kmeans_result`. Feed the function with `median` and `mad` columns in `summary_df`. <br>
**3.2.3.** Investigate the clustering result in `.labels_` attribute of the KMeans instance `kmeans_result`. Assign the clustering result as `cluster` column of `summary_df`. <br>
**3.2.4.** Assign the color of each clutsr as `color` column in `summary_df` DataFrame, based on `colors` list below.<br>
```python

colors = ['#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f','#ff7f00','#cab2d6']

```
**Note** Given that the K-means is simulation-based, the cluster label will be different for each try. But, the set of observations associated with a certain label will remain the same. <br>

The expected result of this task looks like the below.
<img src="./data/k_means_3.jpg" style="width: 300px;"/>



In [None]:
# Your code here
# Initiate KMeans instance


# Feed the observation into the KMeans instance and conduct clustering


# Assign the clustering result back to `summary_df` DataFrame. 


# Assign the color based on the clustering result
colors = ['#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f','#ff7f00','#cab2d6']


summary_df.head()

In [None]:
""" Test code for the previous code. This cell should NOT give any errors when it is run."""

assert len(summary_df['cluster'].unique()) == 3
assert summary_df.loc[summary_df['hours'] == 18, 'cluster'].values[0] == summary_df.loc[summary_df['hours'] == 19, 'cluster'].values[0]
assert summary_df.loc[summary_df['hours'] == 17, 'cluster'].values[0] == summary_df.loc[summary_df['hours'] == 20, 'cluster'].values[0]

print('Success!')

In [None]:
""" Check your answer here. This cell should show three clusters."""
fig, ax = plt.subplots(figsize=(7, 7))

# Place the accessibility measures of each hour on a two-dimensional plane 
ax.scatter(summary_df['median'], summary_df['mad'], s = 50, c=summary_df['color'])

# Place the center of each cluster
ax.scatter(kmeans_result.cluster_centers_.transpose()[0], 
           kmeans_result.cluster_centers_.transpose()[1], 
           color=colors[0: len(kmeans_result.cluster_centers_)], 
           s=500,
           marker='*',
           edgecolors='black'
          )

# Create label of axis
ax.set_ylabel(f'Median Absolute Deviation (MAD) \n of Accessibility', rotation='vertical', fontsize=15)
ax.set_xlabel('Median of Accessibility ', fontsize=15)

# Annotate the representative value of each hour accessibility measurement
for t in range(24):
    ax.annotate(text=f'{t}',
                xy=(summary_df.loc[summary_df['hours'] == t, 'median'], 
                    summary_df.loc[summary_df['hours'] == t, 'mad']
                   ),
                textcoords = 'offset points',
                xytext=(0, 14),
                ha='center',
                size=15
               )


### 3.3. (0.5 point) Find the optimal number of clusters with the Silhouette method

The quality of clustering is indicated by the number of clusters (i.e. K) as a direct correlation. Therefore, we need to find the optimal number of clusters. The following function `determine_number_of_cluster` can take DataFrame or Array as inputs and will provide the Silhouette scores based on the current partitioning with the cluster number `i`. 
```python
def determine_number_of_cluster(array):
    km_silhouette = []

    # The number of clusters 
    for i in range(2, 11):
        KM = KMeans(n_clusters=i, max_iter=999)  # Initiate KMeans instance
        KM.fit(array)  # Feed the observation into the KMeans instance and conduct clustering
        cluster_results = KM.labels_  # Clustering results
        silhouette = silhouette_score(array, cluster_results) # Calculate Silhouette Scores
        km_silhouette.append(silhouette)

    return km_silhouette
```


**3.3.1.** Feed `determine_number_of_cluster()` with `median` and `mad` columns in `summary_df`. Save the result as `k_means_silhouette`. <br>
**3.3.2.** Create a line graph that has the number of clusters as the x-axis and silhouette score as the y-axis.<br>
**3.3.3.** Use a `print()` function and print the number of cluster that provides the highest silhouette score, which indicates the optimal number of clusters. 

**Note**: Silhouette scores are only available from cluster count **2** and more. 

In [None]:
def determine_number_of_cluster(array):
    km_silhouette = []

    # The number of clusters 
    for i in range(2, 11):
        KM = KMeans(n_clusters=i, max_iter=999)  # Initiate KMeans instance
        KM.fit(array)  # Feed the observation into the KMeans instance and conduct clustering
        cluster_results = KM.labels_  # Clustering results
        silhouette = silhouette_score(array, cluster_results) # Calculate Silhouette Scores
        km_silhouette.append(silhouette)

    return km_silhouette

In [None]:
# Your code here



In [None]:
""" Test code for the previous code. This cell should NOT give any errors when it is run."""

assert round(k_means_silhouette[0], 3) == 0.569
assert round(k_means_silhouette[2], 3) == 0.613
assert round(k_means_silhouette[7], 3) == 0.542

print('Success!')

### 3.4. (0.5 point) Update K-means clustering result with the optimal number of clusters

Now we know that `4` clusters are optimal for our temporal clustering implementation. Update `cluster` and `color` column of `summary_df` based on the number of clusters 4. 

**3.4.1.** (0.2 point) Initiate K-Means clustering process with <a href=https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html>`KMeans()`</a> function. Use the number `4` as the number of clusters (`n_clusters`) and assign the result as variable `kmeans_optimal`. <br>
**3.4.2.** Call <a href=https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html#sklearn.cluster.KMeans.fit>`fit()`</a> function of the KMeans instance `kmeans_optimal`. Feed the function with `median` and `mad` columns in `summary_df`. <br>
**3.4.3.** Investigate the clustering result in `.labels_` attribute of the KMeans instance `kmeans_optimal`. Assign the clustering result as `cluster` column of `summary_df`. <br>
**3.4.4.** Assign the color of each clutsr as `color` column in `summary_df` DataFrame, based on `colors` list below.<br>
```python

colors = ['#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f','#ff7f00','#cab2d6']

```
The expected result of this task looks like the below (Contents of the DataFrame below may be different from your implementation).
<img src="./data/k_means_3.jpg" style="width: 300px;"/>

In [None]:
# Your code here
# Initiate KMeans instance


# Feed the observation into the KMeans instance and conduct clustering


# Assign the clustering result back to `summary_df` DataFrame. 


# Assign the color based on the clustering result
colors = ['#a6cee3','#1f78b4','#b2df8a','#33a02c','#fb9a99','#e31a1c','#fdbf6f','#ff7f00','#cab2d6']


summary_df

In [None]:
""" Test code for the previous code. This cell should NOT give any errors when it is run."""

assert len(summary_df['cluster'].unique()) == 4
assert summary_df.loc[summary_df['hours'] == 18, 'cluster'].values[0] == summary_df.loc[summary_df['hours'] == 19, 'cluster'].values[0]
assert summary_df.loc[summary_df['hours'] == 17, 'cluster'].values[0] != summary_df.loc[summary_df['hours'] == 20, 'cluster'].values[0]

print('Success!')

In [None]:
""" Check your answer here. This cell should show four clusters."""

fig, ax = plt.subplots(figsize=(7, 7))

ax.scatter(summary_df['median'], summary_df['mad'], s = 50, c=summary_df['color'])

ax.scatter(kmeans_result.cluster_centers_.transpose()[0], 
           kmeans_result.cluster_centers_.transpose()[1], 
           color=colors[0: len(kmeans_result.cluster_centers_)], 
           s=500,
           marker='*',
           edgecolors='black'
          )

ax.set_ylabel(f'Median Absolute Deviation (MAD) \n of Accessibility', rotation='vertical', fontsize=15)
ax.set_xlabel('Median of Accessibility ', fontsize=15)


for t in range(24):
    ax.annotate(text=f'{t}',
                xy=(summary_df.loc[summary_df['hours'] == t, 'median'], 
                    summary_df.loc[summary_df['hours'] == t, 'mad']
                   ),
                textcoords = 'offset points',
                xytext=(0, 14),
                ha='center',
                size=18
               )

plt.show()

## 4. (1 point) Temporally clustered 24-hour spatial accessibility

As we know which hour is associated with which temporal cluster, we now move on to make maps of temporally clustered accessibility.

**4.1.** Find which hour is associated with a certain cluster, based on the `cluster` and `hours` columns of `summary_df` DataFrame. <br>
**4.2.** Calculate the average of accessibility based on the hours associated with each temporal cluster. For example, suppose cluster 0 has the hours 0, 1, and 2. You will calculate the average of the accessibility measured at hours 0, 1, and 2 for every location. <br> 
**4.3.** Enter the averaged measures of spatial accessibility to `dyn_acc_plot` GeoDataFrame, which is a copy of `dyn_acc` GeoDataFrame and only has `geometry` column. The averaged measures will have its name as `kmeans_c0`, `kmeans_c1`, `kmeans_c2`, and `kmeans_c3`. Again, the order of the number can be different. <br>

The final product should look like the below. 

<img src="./data/cluster_gdf.jpg" style="width: 700px;"/>

In [None]:
# Your code here
dyn_acc_plot = dyn_acc[['geometry']].copy()

    
dyn_acc_plot

In [None]:
""" 
Check your answer here. This cell will demonstrate how the temporally clustered accessibility look like. 
If there are four maps with the name of K-means: C0, C1, C2, and C3, and 
each map has different distribution of accessibility, You are good to go.  
"""

color_map = ['#f7f7f7', '#d9d9d9', '#bdbdbd', '#969696', '#636363', '#252525']
cm = LinearSegmentedColormap.from_list('cb_', color_map, N=5)

map_class = mapclassify.FisherJenks(dyn_acc[[f'hour_{t}' for t in range(24)]], k=5)

fig, axes = plt.subplots(1, 4, figsize=(20, 8))

for idx, ax in enumerate(axes):
    dyn_acc_plot.plot(f'kmeans_c{idx}', 
                      scheme='user_defined', # To use different (not predefined) bins, we need to call it as 'user_defined'
                      classification_kwds={'bins':map_class.bins}, # then speicfy class here. 
                      cmap=cm,
                      ax=ax
                     )
    ax.set_title(f'K-Means: C{idx}', size=20)
    
    gpd.GeoSeries(dyn_acc_plot.unary_union, crs='EPSG:4326').boundary.plot(color='black', ax=ax, linewidth=0.5)
    ax.get_xaxis().set_visible(False)
    ax.get_yaxis().set_visible(False)
    
plt.show()

### *You have finished Lab 8: Cluster Analysis*

Please name your jupyter notebook as `GEOG489_Lab8_[YOUR_NET_ID].ipynb`, and upload it to https://learn.illinois.edu.