[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/nkeriven/ensta-mt12/blob/main/notebooks/02_PCA/N1_pca_olympic_data.ipynb)

# Olympic decathlon data

This example is a short introduction to PCA analysis. The Data are performance marks on the ten [decathlon events](https://en.wikipedia.org/wiki/Decathlon) for 33 athletes at the Olympic Games (1988).

The code cell below defines some useful functions to display summary statistics of PCA representation

In [None]:
import matplotlib.pyplot as plt
plt.rcParams.update({'font.size': 18})

def scree_plot(pca):
    """bar plot of decreasing explained variance
    """
    PC_values = np.arange(pca.n_components_) + 1
    PC_labels = ['PC' + str(nb+1) for nb in range(pca.n_components_)] 
    plt.figure(figsize=(8,6))
    # unlike in the course, in scikit learn, the explained_variance_ is directly the eigenvalues.
    # The normalized "explained variance" in the course is the cumulative sum of explained_variance_ratio_
    plt.bar(PC_values, pca.explained_variance_, linewidth=2, edgecolor='k')
    plt.xticks(ticks=PC_values, labels=PC_labels)
    plt.title('Scree Plot')
    plt.xlabel('Principal Components')
    plt.ylabel('Eigenvalues')
    plt.show()
        
def pca_summary(pca, X, out=True):
    """Display a table of the explained std, proportion of variance, 
    and proportion of variance ratio for each component
    """
    names = ["PC"+str(i) for i in range(1, len(pca.explained_variance_ratio_)+1)]
    a = np.std(pca.transform(X), axis=0, ddof=1)
    b = pca.explained_variance_ratio_
    c = np.cumsum(pca.explained_variance_ratio_)
    columns = pd.MultiIndex.from_tuples([("sdev", "Standard deviation"), ("varprop", "Proportion of Variance"),
                                         ("cumprop", "Cumulative Proportion")])
    summary = pd.DataFrame(list(zip(a, b, c)), index=names, columns=columns)
    if out:
        print("Importance of components:")
        display(summary)
    return summary

def biplot2D(score,coeff,labels=None):
    """Generate biplot for the first two principal components 
    to display both scores and variables
    """
    
    xs = score[:,0] # projection on PC1
    ys = score[:,1] # projection on PC2
    p = coeff.shape[1]
    n = score.shape[0]
    
    fig, ax = plt.subplots(figsize=(10,8))
    # plot the scores with the index of the sample
    ax.scatter(xs, ys, marker=".", color = 'k')
    for i in range(33):
        ax.text(xs[i], ys[i], str(i), color = 'k')
    ax.set_xlabel("PC{}".format(1))
    ax.set_ylabel("PC{}".format(2))
    
    # plot the variable vectors (arrow) in the PC plane (loadings)  
    arrow_sc = 1.15 
    color = 'tab:red'
    ax2 = ax.twinx() # instantiate a second x axe
    ax2.set_ylim(-1.2,1.2)
    ax2.tick_params(axis='y', labelcolor=color)
    ax2 = ax2.twiny() # instantiate a second y axe
    ax2.set_xlim(-1.2,1.2)
    ax2.tick_params(axis='x', labelcolor=color)
    for i in range(p):
        ax2.arrow(0, 0, coeff[0, i], coeff[1, i], color =  color ,alpha = 0.5, 
                  linestyle = '-',linewidth = 1.5, head_width=0.02, head_length=0.02)
        if labels is None:
            ax2.text(coeff[0, i]* arrow_sc, coeff[1,i] * arrow_sc, "Var"+str(i+1), 
                     color = color, ha = 'center', va = 'center')
        else:
            ax2.text(coeff[0, i]* arrow_sc, coeff[1, i] * arrow_sc, labels[i], 
                     color = color, ha = 'center', va = 'center')

## Dataset

Load olympic dataset contained in the text file `olympic.csv` (local user: you can also copy the file in the same directory as the notebook and directly load `olympic.csv`).

We use **Pandas**, a multi-purpose library for handling datasets in Python. Pandas has *many* functionality of which we will only use a fraction, see more at https://pandas.pydata.org/docs/getting_started/intro_tutorials/

In [None]:
import pandas as pd
import numpy as np


#load data set
olympic = pd.read_csv('https://raw.githubusercontent.com/nkeriven/ensta-mt12/main/notebooks/data/olympic.csv',
                      sep=',', header=0)
olympic.head() #data overview: variable names and first rows

#### Display some descriptive statistics for this dataset

In [None]:
olympic.describe()

We can guess on the table above that the *running* event performances are measured in seconds, while the *jumping* or *throwing* ones are in meters.

## PCA

Make *PCA* on decathlon event scores data $X \in \mathbb{R}^{n \times p}$: $n=33$ samples (athletes), $p=10$ variables/features (decathlon events)

In [None]:
from sklearn.decomposition import PCA
pca = PCA()
olympic_pc = pca.fit_transform(olympic) # get the Principal components

How is the distribution of component variances/eigenvalues $\lambda_i^2$, $1 \le i \le p$ ? Let's visualize the **screeplot**

In [None]:
scree_plot(pca)

#### Display a summary of PCA representation

In [None]:
pca_summary(pca, olympic)

#### Dsiplay the biplot

The *biplot* gives a graphical summary of both samples (athletes) in terms of scores and the
variables/features in terms of loadings


In [None]:
#Call the function. Use only the 2 PCs.
biplot2D(olympic_pc[:,0:2], pca.components_[0:2, :], olympic.columns)
plt.show()

From this plot above, we see that the first principal component is positively associated with longer times on the 1500.  

We can compare the athlete '1500' event mark with their score with the first component to check that slower runners will have higher value on this component, and vice versa.

In [None]:
print('Average 1500 event mark (seconds) == {:.2f}'.format(olympic['1500'].mean()) )
pd.DataFrame(list(zip(olympic['1500'],olympic_pc[:,0])), columns=['1500', 'score PC1'])

In [None]:
plt.figure(figsize=(12,10))
plt.plot(olympic['1500'],olympic_pc[:,0],'k.')
for i in range(len(olympic_pc[:,0])):
    plt.text(olympic['1500'][i], olympic_pc[i,0], str(i), color = 'k')
plt.xlabel('1500 time (s)')
plt.ylabel('PC1')

So the correlation is almost perfect between the `1500`event and the first principal component!

Moreover, the previous biplot shows that the *second main component* is correlated with the force in the form of a long *javelin* throw.

In [None]:
plt.figure(figsize=(12,10))
plt.plot(olympic['jave'],olympic_pc[:,1],'k.')
for i in range(len(olympic_pc[:,0])):
    plt.text(olympic['jave'][i], olympic_pc[i,1], str(i), color = 'k')
plt.xlabel('Javelin throw (m)')
plt.ylabel('PC2')

We can check in the plot above that stronger throwers will have higher value on this second component.

## Standardizing: scale matters!

In the previous example, we saw that the two variables were based somewhat on speed and strength. However, 
**we did not scale the variables** so the 1500 has much more weight than the 400, for instance! 

We correct this by standardizing the variables with `sklearn` preprocessor methods

In [None]:
from sklearn.preprocessing import StandardScaler

# Center and reduce the variables
scaler = StandardScaler()
Xs = scaler.fit_transform(olympic)

# Make PCA on standardized variables
pca_s = PCA() # estimate only 2 PCs
Xs_pc = pca_s.fit_transform(Xs) # project the original data into the PCA space

Show the new biplot for the standardized variables

In [None]:
#Call the function. Use only the 2 PCs.
biplot2D(Xs_pc[:,0:2], pca_s.components_[0:2, :], olympic.columns)
plt.show()

By standardizing, this plot above reinforces our earlier interpretation by grouping sprint events (as *100m*,
*110m*, *400m*, *long*) along a same axis aligned with the first principal  

Likewise the strength and throwing events (in french, *javelot*, *disque*, *poids*)lies on a separate axis rather aligned on the second component (thus rather decorrelated from the previous one).

### Display the loadings

In [None]:
pca_s.components_[0:2, :]
pd.DataFrame(pca_s.components_[0:2, :].T, columns=['PC1', 'PC2'], index=olympic.columns)

## Exercise
- For the *non-standardized* olympic data, explain why the `1500` event is the more important to explain the variance. Is is still true after standardization?
- Explain how many components do you think are sufficient to explain the *non-standardized* olympic data? Do you think the same is true for tje standardized data?
- From the biplot analysis in the *standardized* case what are the global meanings of the first two principal components?  Are the loadings consistent with these conclusions?
- In your opinion, is it better (i.e. more useful) to perform PCA on *standardized* or *non-standardized* data for this example?