# **Lab2A: K-means clustering**
## The Country Risk case from Hull Section 2.5

**WHAT** This nonmandatory lab is a slightly expanded version of Hull's `CountryRisk_2019_kmeans_results.ipynb` and has a few insight exercises/questions.

**WHY** It shows an application of the K-means clustering algorithm (Hull Sections 2.2, 2.3, and 2.5) and we expect you to know/learn how this is done, and why.

**HOW** Follow the steps and exercises in this notebook either on your own or with a fellow student.  
**Optional:** if you'd like us to check your work, submit the completed notebook through Brightspace assignments, and you will receive feedback.


$\newcommand{\q}[1]{\rightarrow \textbf{Question #1}}$
$\newcommand{\ex}[1]{\rightarrow \textbf{Exercise #1}}$

In [None]:
# loading packages

import os

import pandas as pd
import numpy as np

# plotting packages
%matplotlib inline
from mpl_toolkits.mplot3d import Axes3D
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import matplotlib.colors as clrs

# Kmeans algorithm from scikit-learn
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_samples, silhouette_score


## 1. Loading the data and looking at it

In [None]:
# load raw data

raw = pd.read_csv('./Country_Risk_2019_Data.csv')

# check the raw data
print("Size of the dataset (row, col): ", raw.shape)
print("\nFirst 5 rows:\n")
raw.head(20)

### Print summary statistics
Note that all features have quite different variances, and Corruption and Legal are highly correlated.

In [None]:
# print summary statistics
print("Summary statistics")
raw.describe()

In [None]:
# print correlation matrix
print("Correlation matrix:")
raw.corr(numeric_only=True)

### Plot histograms

Note that distributions for GDP Growth is quite skewed.

In [None]:
# plot histograms
plt.figure(1)
raw['Corruption'].plot(kind = 'hist', title = 'Corruption',)
# raw.hist(column='Corruption')
plt.figure(2)
raw['Peace'].plot(kind = 'hist', title = 'Peace')

plt.figure(3)
raw['Legal'].plot(kind = 'hist', title = 'Legal')

plt.figure(4)
raw['GDP Growth'].plot(kind = 'hist', title = 'GDP Growth')

plt.show()

## Scatter plot of scaled legal risk and corruption indices

$\ex{1.1}$ Please reproduce Figure 2.5 (or Figure 2.4 depending on your version of the book). First scale the required data before plotting.

In [None]:
# START ANSWER
# END ANSWER

## 2. K-means cluster
### Pick features & normalization

Since Corruption and Legal are highly correlated, we drop the Corruption variable, i.e., we pick three features for this analysis, Peace, Legal and GDP Growth. Let's standardize all the features, effectively making them equally weighted.

For some extra background, check out [6.3.1 Standardization](https://scikit-learn.org/stable/modules/preprocessing.html) in the Scikit-learn User Guide.

In [None]:
X = raw[['Peace', 'Legal', 'GDP Growth']]
X = (X - X.mean()) / X.std()
X.head(5)

print("Summary statistics")
X.describe()


### Perform elbow method

The marginal gain of adding one cluster dropped quite a bit from k=3 to k=4. We will choose k=3 (not a clear cut though).

Ref. [Determining the number of clusters in a dataset.](https://en.wikipedia.org/wiki/Determining_the_number_of_clusters_in_a_data_set)

$\ex{2.1}$ Find the inertia for all the k values

In [None]:
# https://stackoverflow.com/questions/41540751/sklearn-kmeans-equivalent-of-elbow-method

Ks = range(1, 10)

inertia = []

# START ANSWER
# END ANSWER

plt.figure()
plt.plot(Ks,inertia, '-bo')
plt.title("The Elbow Method")
plt.xlabel('Number of clusters')
plt.ylabel('Inertia (within-cluster sum of squares)')
plt.show()

### K-means with k=3

Now that we found the optimal value of k to be 3, we will run k-means for this k an predict the labels of the datapoints. Run the cell below a couple of times and take a close look at the cluster centers.

In [None]:
k = 3
kmeans = KMeans(n_clusters=k,  n_init=10)
kmeans.fit(X)

# print inertia & cluster center
print("inertia for k=3 is", kmeans.inertia_)

print("cluster centers: \n", kmeans.cluster_centers_)

# take a quick look at the result
y = kmeans.labels_
print("cluster labels: \n", y)

$\ex{2.2}$ What happens to the cluster centers across the runs? Why do you think this happens?

<div style="background-color:#c2eafa">

Write your answer here:
    
[//]: # (START ANSWER)
[//]: # (END ANSWER)

## 3. Visualization of the results
### One 3D plot

We'd like to view the plot from a position a bit more to the right and a bit lower (so looking down, but at a less steep angle).   

$\ex{3.1}$ Find out how this can be done and then change the viewing angle as desired.  

*Hint: Take a look at the matplotlib.pyplot documentation*

In [None]:
# set up the color
norm = clrs.Normalize(vmin=0.,vmax=y.max() + 0.8)
cmap = cm.viridis

fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')

ax.scatter(X.iloc[:,0], X.iloc[:,1], X.iloc[:,2], c=cmap(norm(y)), marker='o')

centers = kmeans.cluster_centers_
ax.scatter(centers[:, 0], centers[:, 1], c='black', s=100, alpha=0.5)

ax.set_xlabel('Peace')
ax.set_ylabel('Legal')
ax.set_zlabel('GDP Growth')

# START ANSWER
# END ANSWER

plt.show()

### Three 2D plots
The plots below don't look right: we standardized, so the units on the x-axis and the y-axis are the same and should be the same in the plot. Check out `set_aspect` and make this right.



In [None]:
figs = [(0, 1), (0, 2), (1, 2)]
labels = ['Peace', 'Legal', 'GDP Growth']

for i in range(3):
    fig = plt.figure(i)
    plt.scatter(X.iloc[:,figs[i][0]], X.iloc[:,figs[i][1]], c=cmap(norm(y)), s=50)
    plt.scatter(centers[:, figs[i][0]], centers[:, figs[i][1]], c='black', s=200, alpha=0.5,label="cluster center")
    plt.xlabel(labels[figs[i][0]])
    plt.ylabel(labels[figs[i][1]])
    plt.legend()
    # Begin Answer
    # plt.axis('scaled')
    ax = plt.gca()
    ax.set_aspect('equal', adjustable='box')
    # End Answer

plt.show()

### Three 2D plots with country labels
Now, we plot the country abbreviations instead of dots.

In [None]:

figs = [(0, 1), (0, 2), (1, 2)]
labels = ['Peace', 'Legal', 'GDP Growth']
colors = ['blue','green', 'red']

for i in range(3):
    fig = plt.figure(i, figsize=(6, 6))
    x_1 = figs[i][0]
    x_2 = figs[i][1]
    plt.scatter(X.iloc[:, x_1], X.iloc[:, x_2], c=y, s=0, alpha=0)
    plt.scatter(centers[:, x_1], centers[:, x_2], c='black', s=200, alpha=0.5, label="cluster center")
    for j in range(X.shape[0]):
        plt.text(X.iloc[j, x_1], X.iloc[j, x_2], raw['Abbrev'].iloc[j], 
                 color=colors[y[j]], weight='semibold', horizontalalignment = 'center', verticalalignment = 'center')
    plt.xlabel(labels[x_1])
    plt.ylabel(labels[x_2])
    plt.legend()

plt.show()

### List the results

In [None]:
result = pd.DataFrame({'Country':raw['Country'], 'Abbrev':raw['Abbrev'], 'Label':y})
result.sort_values('Label')


## 4. Plotting the silhouette score
$\ex{4.1}$ Please make a plot of the average silhouette scores versus the number of clusters.

In [None]:
# Silhouette Analysis
range_n_clusters=[2,3,4,5,6,7,8,9,10]
scores = []

# START ANSWER
# END ANSWER

$\ex{4.2}$ What are your conclusions from the plot?

<div style="background-color:#c2eafa">

Write your answer here:
    
[//]: # (START ANSWER)
[//]: # (END ANSWER)