# **Lab2B: The Country Risk case: Principal Components Analysis**

**WHAT** This nonmandatory lab consists of several programming and insight exercises/questions.

**WHY** The exercises are meant to familiarize yourself with the application of Principal Components Analysis.

**HOW** Follow the exercises in this notebook either on your own or with a fellow student. If you want to skip right to questions and exercises, find the $\rightarrow$ symbol. 

$\newcommand{\q}[1]{\rightarrow \textbf{Question #1. }}$
$\newcommand{\ex}[1]{\rightarrow \textbf{Exercise #1. }}$

In [None]:
# loading packages

import pandas as pd
import numpy as np

# plotting packages
%matplotlib inline
import matplotlib.pyplot as plt

# Kmeans algorithm from scikit-learn
from sklearn.cluster import KMeans
from sklearn import metrics 
from sklearn.decomposition import PCA

# np.set_printoptions(precision=3)  # print results with 3 decimals behind the decimal point

## 1. Loading data, scaling, and K-means 

In [None]:
# load raw data
raw = pd.read_csv('./Country_Risk_2019_Data.csv')
X = raw[['Corruption', 'Peace', 'Legal', 'GDP Growth']]
X = (X - X.mean()) / X.std()
X.head(5)

In [None]:
print("Correlation matrix:")
X.corr()

### K means with k=3
For comparison with the results later, we repeat the K-means clustering on Peace, Legal, and GDP Growth.

In [None]:
kmeans = KMeans(n_clusters=3, random_state=1, n_init=10)
kmeans.fit(X.iloc[:,1:])  # skipping column Corruption
km_labels = kmeans.labels_
print("cluster labels: \n", km_labels)

## 2. Principal Components Analysis by Sklearn
### First: how you might find what you need in the Sklearn documentation
**N.B.** If you have difficulty finding your way, you should read this carefully and follow the steps.

You might start by having a peak at [PCA in the Skearn User Guide](https://scikit-learn.org/stable/modules/decomposition.html#principal-component-analysis-pca). The User Guide has lots of useful (also background) information, though this is not always very accessible for novices. In this case, we read after a few lines that "`PCA` is implemented as a *transformer* object that learns components in its `fit` method." After the PCA plot of the Iris data the discussion quickly leaves our scope. Time to move on. 

By clicking on [`PCA`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA) 
we get to the  [API page for `sklearn.decomposition.PCA`](https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html#sklearn.decomposition.PCA) (Apparently, PCA is part of the `sklearn.decomposition` module on Matrix Decomposition; good to know).

Of the **parameters** of (objects in the class) `PCA` you probably only need (to know) `n_components`, but you should scroll down to the **attributes**, of which you will recognize some from the PCA-terminology. Scroll even further down to **Methods** and scan them. Probably you only have use for the methods **fit**, **fit_transform**, and **transform**. Look over their descriptions and you might (like I) be a bit confused about whether the methods return an object or work on the object passed as arguments. 

You need to invest in reading these pages to get more familiar. The key is not to be overwhelmed and note what rings a bell; there are many many options and details that are outside our scope and can therefore be igonored. Sklearn is very structured and organized, which means that if you invest in understanding that structure and organization, you will learn to find your way. One way to do that is to go to the [`Getting Started` pages](https://scikit-learn.org/stable/getting_started.html). Very highly recommended!

$\ex{1}$ Let's first reproduce the Excel results reported in **Table 2.11** and **Table 2.12**. In the last table also show a row with the explained variances. An easy way to produce a nice table is to use a Pandas DataFrame.

In [None]:
# Table 2.11
pca_full = PCA().fit(X)
# START ANSWER
# END ANSWER

In [None]:
# Table 2.12 including explained variance

# START ANSWER
# END ANSWER

$\ex{2}$ Also make a plot of the cumulative explained variance against the number of components.

In [None]:
# START ANSWER
# END ANSWER

## 3. Clustering based on PC1 versus K-means with k=3

$\ex{3}$ Determine what percentage of the variance is explained by the first principal component.

In [None]:
# START ANSWER
# END ANSWER

Since this is quite a lot we are going to do K-means clustering on the factor scores of the first principal component.

$\ex{4}$ Generate the factor scores for PC1 and use K-means clustering to obtain 3 clusters.

In [None]:
PC_labels = None  # Use this name for the cluster labels that you find

# START ANSWER
# END ANSWER

### Comparing the clusterings
We are going to make a few comparisons and a contingency table. The (adjusted) Rand index is a measure of how well the clusterings agree (possibly after permuting the labeling); for identical clusters both indices are 1.0. The contingency table shows the frequencies of the label combinations.

In [None]:
rindex = metrics.rand_score(km_labels, PC_labels)
print(f"The Rand index is: {rindex:4.3}")
arindex = metrics.adjusted_rand_score(km_labels, PC_labels)
print(f"The adjusted Rand index is: {arindex:4.3}")

cm = metrics.cluster.contingency_matrix(km_labels, PC_labels)
con_tab = pd.DataFrame(cm, columns= [0,1,2], index=[0,1,2])
con_tab

It looks like for the PCA we should interchange labels 1 and 2. The results below show that this does not change the Rand indices (by their label permutation invariance).

In [None]:
# interchange labels 1 <-> 2 and repeat (to check that nothing really changes):
rel = np.array([0, 2, 1]) 
PC_labels = rel[PC_labels]
rindex = metrics.rand_score(km_labels, PC_labels)
print(f"The Rand index is: {rindex:4.3}")
arindex = metrics.adjusted_rand_score(km_labels, PC_labels)
print(f"The adjusted Rand index is: {arindex:4.3}")

cm = metrics.cluster.contingency_matrix(km_labels, PC_labels)
con_tab = pd.DataFrame(cm, columns= [0,1,2], index=[0,1,2])
con_tab

In [None]:
PC_labels[:10]

$\ex{5}$ Repeat this comparison, but now using 2 components.

In [None]:
# START ANSWER
# END ANSWER

$\ex{6}$ Relabel again, if necessary, to get the contingency table "right."

In [None]:
# START ANSWER
# END ANSWER

In [None]:
# np.vstack([km_labels, PC_labels]).T

<div style="background-color:#c2eafa">
$\q{1}$ Comment on the results. It seems that the PCA does not reproduce the K-means results (even if you use more components). Is this surprising?

<div style="background-color:#f1be3e">

Write your answer here:
    
[//]: # (START ANSWER)
[//]: # (END ANSWER)

## Appendix (**optional**). Principal Components Analysis using Numpy.linalg.eig

**Note:** In class the linear algebra behind PCA was briefly mentioned and the rest of this section shows how this can be implemented with tools from Numpy.linalg. You are not required to know this, or even read this. On the other hand, if you prefer to use this: that's fine, as long as you get the right answers.

### Computing the principal components
First compute eigenvalues and eigenvectors for the correlation matrix. They (might) need to be reordered, the order should be from the largest to the lowest eigenvalues.

In [None]:
eigenvalues, eigenvectors = np.linalg.eig(X.corr())
print("Eigenvalues:", eigenvalues)
eigenvectors

In [None]:
incr = eigenvalues.argsort() # this function returns an index list that would order the eigenvalues
decr = incr[::-1]    # reverse the order: we need largest FIRST
eigenvalues_ord = eigenvalues[decr]
eigenvectors_ord = eigenvectors[:,decr]  # reorder columns

In [None]:
# let's check:
print("Eigenvalues in decreasing order:", eigenvalues_ord)
print("Eigenvectors in the same order:\n", eigenvectors_ord)

In [None]:
columns = ["PC1", "PC2", "PC3", "PC4"]
#convert ordered eigenvectors into dataframe 
df_eigenvec = pd.DataFrame(data=eigenvectors_ord, columns=columns, 
                            index=["Corruption Index","Peace Index", "Legal risk Index", "GDP Growth rate"])
# SD of factor scores
SD_factor_scores = np.sqrt(eigenvalues_ord)
# % of variance 
variance = 100 * eigenvalues_ord / (np.sum(eigenvalues_ord)) 

# Convert factor scores and variance into dataframe
scores_variance = pd.DataFrame(data = [SD_factor_scores, eigenvalues_ord, variance] , columns = columns, 
                               index = ["SD of factor scores", "variance of scores", "% of total variance"]) 

In [None]:
eigenvalues_ord

In [None]:
np.around(df_eigenvec,3)

In [None]:
np.around(scores_variance,3)