# Python Lab 05: Fifa '19 and PCA

## Francesco Della Santa, Computational Linear Algebra for Large Scale Problems, Politecnico di Torino

In this laboratory we apply the PCA and the $k$-Means algorithm on the FIFA19 dataset (available online, e.g., here: https://www.kaggle.com/karangadiya/fifa19).

The version presented here is cleaned from NaNs.



In [None]:
# ***** ATTENTION! *****
# If you want that the "%matplotlib widget" works, you need the package ipympl (pip install ipympl)
#
#
# MATPLOTLIB INTERACTIVE VISUALIZATION. REMOVE (OR COMMENT) IF YOU NEED TO PRINT THE NOTEBOOK AS A PDF, SOMETIMES IT DOES NOT WORK WELL...
%matplotlib widget
#
#

import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import cm
from matplotlib.lines import Line2D
import yaml
from IPython.display import display  # to display variables in a "nice" way

try:
    from yaml import CLoader as Loader, CDumper as Dumper
except ImportError:
    from yaml import Loader, Dumper

# pd.options.display.max_rows = 9999
pd.options.display.max_columns = 200

## Loading and Preparing the Dataset

### Loading as DataFrame

We start loading the dataset. 

**EXERCISE:** Load and display the dataset as a pandas DataFrame following these operations:
1. Download the file *fifa19datastats.csv* from the webpage of the course;
1. Save the file into a path of subfolders "data/Fifa19" inside your working folder (i.e., the one where this notebook is saved);

    **ATTENTION:** use the "backslash" for paths if you are working with Windows.
    
    **N.B.:** Actually you can save the file where you prefer. However, we suggest to use these path to follow better the solutions that will be published.
1. Load the file as a pandas DataFrame using the *read_csv* function;

    **ATTENTION:** the values in the first column of the .csv file are the indexes!
1. Display the DataFrame.

In [None]:
# PATH TO THE fifa19datastats.csv FILE
myfifa_path = ...  # <-- TODO!!

# LOADING THE DATASET AS DATAFRAME
myfifa_df = pd.read_csv(...)  # <-- TODO!!

# DISPLAY OF THE DATAFRAME
...  # <-- TODO!!

### Selecting the Features

In the dataset above, we see many columns. In particular, the features we are interested in are the skills; i.e., all the columns *after* the *GeneralPosition* column.
The skill columns describe with a value between $1$ and $99$ the level of a football player in that skill. Concerning the other columns:
1. **ID:** is the identinty number of the foorball player in the dataset;
1. **Position:** favourite position of a player in the field during a game;
1. **GeneralPosition:** favourite general position of a player in the field during a game. In particular, we identify four main general positions:
    - *Goal-Keeper* (GK);
    - *Defender* (DF);
    - *Mid-Fielder* (MF);
    - *Forward* (FW).
1. **Overall:** a value between $1$ and $99$ that describes the overall score of a player w.r.t. its general position. In particular, it is computed as a weighted mean of the most important skills associated to the player's general position.

For our scopes, we want to separate the skill features from the oter columns.

**EXERCISE:** Create a variable *Xfifa_df* and a variable *Xfifa* such that
1. *Xfifa_df* is a DataFrame obtained from *myfifa_df* keeping only the skill columns;
1. *Xfifa* is the 2D-array of data contained in *Xfifa_df*.

Then, display the DataFrame *Xfifa_df*.

In [None]:
# LIST OF COLUMN NAME SCORRESPONDING TO THE SKILLS
skill_cols = ...  # <-- TODO!!

# DATAFRAME WITH COLUMNS CORRESPONDING ONLY TO SKILLS
Xfifa_df = ...  # <-- TODO!!
# MATRIX OF DATA CONTAINED IN Xfifa_df
Xfifa = ...  # <-- TODO!!

# DISPLAY THE DATAFRAME
...  # <-- TODO!!

### Skill Types and Categories

The authors of FIFA '19, divide the skills into groups according to two criterions:
1. **Type:**
    1. *Physical skills:* Acceleration, SprintSpeed, Agility, Balance, Reactions, Jumping, Stamina, Strength;
    1. *Mental skills:* Positioning, Vision, Composure, Interceptions, Aggression;
    1. *Technical skills:* Finishing, LongShots, Penalties, ShotPower, Volleys, Crossing, Curve, FKAccuracy, LongPassing, ShortPassing, BallControl, Dribbling, HeadingAccuracy, Marking, SlidingTackle, StandingTackle;
    1. *Goalkeeper skills:* GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes.
1. **Category:**
    1. *Pace:* Acceleration, SprintSpeed;
    1. *Shooting:* Finishing, LongShots, Penalties, Positioning, ShotPower, Volleys;
    1. *Passing:* Crossing, Curve, FKAccuracy, LongPassing, ShortPassing, Vision;
    1. *Dribbling:* Agility, Balance, BallControl, Composure, Dribbling, Reactions;
    1. *Defending:* HeadingAccuracy, Interceptions, Marking, SlidingTackle, StandingTackle;
    1. *Power:* Aggression, Jumping, Stamina, Strength;
    1. *Goalkeeping:* GKDiving, GKHandling, GKKicking, GKPositioning, GKReflexes


**EXERCISE:** load as DataFrames the two .csv files (*skilltypes.csv* and *skillcategories.csv*) that describe these two groups. The files are available on the web page of the course and it is suggested to save them in the same folder of the file *fifa19datastats.csv*.

**EXERCISE:** Add a column to these DataFrames containing a different color for each type/category of skills; then display the result. Use the **colormap Set3** of matplotlib (cm.Set3.colors); in particular:
1. Use the first 4 colors for the types;
1. Use the colors from 5th to 11th for the categories.
*Suggestion:* use the given dictionaries.


In [None]:
# SAVE THE COLORS OF Set3 AS A LIST OF 3-TUPLES "(red_value, green_value, blue_value)"
set3 = cm.Set3.colors

######### SHOWING THE COLORS OF THE CHOSEN COLORMAP #########
display(cm.Set3)
#############################################################

# LOADING THE DATASETS AS DATAFRAMES
skill_types_df = pd.read_csv(...)  # <-- TODO!!
skill_cats_df = pd.read_csv(...)  # <-- TODO!!

type_colors = {
    'Physical': set3[0],
    'Mental': set3[1],
    'Technical': set3[2],
    'Goalkeeper': set3[3]
}

cat_colors = {
    'Pace': set3[4],
    'Shooting': set3[5],
    'Passing': set3[6],
    'Dribbling': set3[7],
    'Defending': set3[8],
    'Power': set3[9],
    'Goalkeeping': set3[10]
}

# ADDING THE 'color' COLUMNS
skill_types_df['color'] = ...  # <-- TODO!!
skill_cats_df['color'] = ...  # <-- TODO!!

# DISPLAY THE DATAFRAMES
...  # <-- TODO!!
...  # <-- TODO!!


## PCA

Now, we apply the PCA to our dataset. 

**ATTENTION:** obviuosly, since all the features have the same "unit measure", with a minimum value of $1$ and a maximum value of $99$, we do not need to apply any kind of preprocessing.

**EXERCISE:** In the first one of the two following cells, run the PCA on the dataset keeping *all* the Principal Components (PCs). Then, plot the cumulative explained variance (as percentage of the total variance) w.r.t. the number of principal components.

**EXERCISE:** In the second one of the two following cells, run the PCA on the dataset keeping *the first three* PCs. Then, draw a barplot with three columns representing the percentage of explained variance of the three PCs. In the title, add the value (rounded to 2 decimals) of the percentage of total expalained variance of the three PCs.



In [None]:
# INITIALIZE THE PCA
pca_full = ...  # <-- TODO!!

# FIT THE PCA
...  # <-- TODO!!

# MAKE THE CUMULATIVE EXPLAINED VARIANCE PLOT
plt.figure(figsize=(12, 6))
plt.plot(...)  # <-- TODO!!
plt.title('CUMULATIVE EXPLAINED VARIANCE (ALL PCs)')
plt.xticks(ticks=...,  # <-- TODO!! 
           labels=...,  # <-- TODO!!
           rotation=45)
plt.xlabel('Principal Components')
plt.ylabel('Cumulative explained variance (%)')
plt.grid()
plt.show()

In [None]:
m = 3

# INITIALIZE THE PCA
pca = ...  # <-- TODO!!

# FIT THE PCA
...  # <-- TODO!!

# COMPUTE THE PERCENTAGE OF TOT. EXPL. VARIANCE (ROUNDED TO 2 DECIMALS)
round_expl_var_ratio = np.round(...)  # <-- TODO!!

# MAKE THE BARPLOT
plt.figure(figsize=(6, 6))
plt.bar(...)  # <-- TODO!!
plt.title(f"PCs' EXPLAINED VARIANCE ({round_expl_var_ratio}% OF TOT. EXPL. VAR.)")
plt.xticks(ticks=...,  # <-- TODO!! 
           labels=...,  # <-- TODO!!
           rotation=45)
plt.xlabel('Principal Components')
plt.ylabel('Percentage of Explained variance')
plt.grid()
plt.show()

### Interpretation of the PCs

Now, looking at the contributes of the original features to the PCs, try to give an interpretation to the vectors of the new basis.

**EXERCISE:** completing the code in the first cell, for each PC, print two barplots representing the vector; the first barplot using the colors of the *skill types*, the second barplot using the colors of the *skill categories*. Moreover, for each PC, print the name of the skills with greatest contribute w.r.t. to the threshold $\epsilon = \sqrt{1/n}$ ($n$ is the number of skills). 

**EXERCISE:** int the second one of the two following cells, write the names that you have decided to assign to the three PCs, respectively.

In [None]:
# DEFINE EPSILON
eps = ...  # <-- TODO!!

# DEFINE THE LIST OF SKILL COLORS W.R.T. THE SKILL TYPES AND THE SKILL CATEGORIES
skill_colors_type = ...  # <-- TODO!!
skill_colors_cat = ...  # <-- TODO!!

# MAKE A CUSTOM LEGEND
type_colors_legend = [Line2D([0], [0], color=type_colors[k]) for k in type_colors.keys()]
cat_colors_legend = [Line2D([0], [0], color=cat_colors[k]) for k in cat_colors.keys()]

# FOR-CYCLE TO GENERALIZE THE PLOT COMMANDS
for ii in range(m):
    # MAKE THE VARPLOT WITH SKILL TYPE COLORS
    plt.figure(figsize=(12, 6))
    plt.bar(...)  # <-- TODO!!
    # --- RED LINE DENOTING THE THRESHOLD [-eps, +eps] -----------------
    plt.plot([-0.5, pca.n_features_ - 0.5], [eps, eps], 'red')
    plt.plot([-0.5, pca.n_features_ - 0.5], [-eps, -eps], 'red')
    # ------------------------------------------------------------------
    plt.xticks(ticks=...,  # <-- TODO!! 
               labels=...,  # <-- TODO!!
               rotation=75)
    plt.title(f'SKILLS (COLORED BY TYPE) - PC{ii + 1}')
    plt.legend(type_colors_legend, [k for k in type_colors.keys()])
    plt.grid()
    plt.tight_layout()
    plt.show()
    
    # MAKE THE BARPLOT WITH SKILL CATEGORY COLORS
    plt.figure(figsize=(12, 6))
    plt.bar(...)  # <-- TODO!!
    # --- RED LINE DENOTING THE THRESHOLD [-eps, +eps] -----------------
    plt.plot([-0.5, pca.n_features_ - 0.5], [eps, eps], 'red')
    plt.plot([-0.5, pca.n_features_ - 0.5], [-eps, -eps], 'red')
    # ------------------------------------------------------------------
    plt.xticks(ticks=...,  # <-- TODO!! 
               labels=...,  # <-- TODO!!
               rotation=75)
    plt.title(f'SKILLS (COLORED BY CATEGORY) - PC{ii + 1}')
    plt.legend(cat_colors_legend, [k for k in cat_colors.keys()])
    plt.grid()
    plt.tight_layout()
    plt.show()
    
    # THE SELECTION OF THE SKILLS WITH CONTRIBUTE GREATER THAN THE THRESHOLD
    ... some lines of code...  # <-- TODO!!
    
    print('')
    print(f'****************** PC{ii+1} **********************')
    print(f'HIGH-VALUED POSITIVE COMPONENTS: {...}')  # <-- TODO!!
    print('')
    print(f'HIGH-VALUED NEGATIVE COMPONENTS: {...}')  # <-- TODO!!
    print('*********************************************')
    print('')

In [None]:
# LIST OF THE NAMES ASSIGNED TO THE THREE PCs
pc_names = ['PC1_NAME',  # <-- TODO!!
           'PC2_NAME',  # <-- TODO!!
           'PC3_NAME'  # <-- TODO!!
           ]

### Score Graphs

**EXERCISE:** plot two score graphs of the football players, colored w.r.t. their general position using the **Set1 colormap** (*cm.Set1.colors*); if you want, you can add a 'color' column to *myfifa_df* (as we did for the DataFrames of skill types/categories).
In particular:
1. plot the score graph in the PC space (i.e., $\mathbb{R}^3$);
1. plot the score graph w.r.t. the projection on the plane of the first two PCs.


In [None]:
# SAVE THE COLORS OF Set1 AS A LIST OF 3-TUPLES "(red_value, green_value, blue_value)"
set1 = cm.Set1.colors

######### SHOWING THE COLORS OF THE CHOSEN COLORMAP #########
display(cm.Set1)
#############################################################

# EXTRACT THE GENERAL POSITIONS FROM THE DATASET
genpos = myfifa_df['GeneralPosition'].unique()

# VISUALIZE THE GENERAL POSITIONS
print('')
print('*************************')
print(f'GENERAL POSITIONS: {list(genpos)}')
print('*************************')
print('')

# DICTIONARY OF THE TYPE {general_pos: color}
genpos_colors = {...}  # <-- TODO!!

# ADDING THE 'color' COLUMN
myfifa_df['color'] = ...  # <-- TODO!!

# -------------------------------------------------------------

# COMPUTE THE DATA TRANSFORMATION INTO THE PC-SPACE
Yfifa = ...  # <-- TODO!!

# MAKE A CUSTOM LEGEND FOR COLORS
genpos_colors_legend = [Line2D([0], [0], color=genpos_colors[k]) for k in genpos_colors.keys()]

# MAKE THE 3D SCORE GRAPH
sg_3d = plt.figure(figsize=(8, 8))
ax_sg_3d = sg_3d.add_subplot(111, projection='3d')
ax_sg_3d.scatter(...)  # <-- TODO!!
plt.title('FOOTBALL PLAYERS - SCORE GRAPH')
ax_sg_3d.set_xlabel(pc_names[0])
ax_sg_3d.set_ylabel(pc_names[1])
ax_sg_3d.set_zlabel(pc_names[2])
plt.legend(genpos_colors_legend, [k for k in genpos_colors.keys()])
plt.grid()
plt.show()

# MAKE THE 2D SCORE GRAPH
plt.figure()
plt.scatter(...)  # <-- TODO!!
plt.title('FOOTBALL PLAYERS - SCORE GRAPH (PROJ. ON R^2)')
plt.xlabel(pc_names[0])
plt.ylabel(pc_names[1])
plt.legend(genpos_colors_legend, [k for k in genpos_colors.keys()])
plt.grid()
plt.show()

#### Very Interesting Observation:

From a particular point of view, in the 3D score graph, we can observe that the point cloud of the "field palyers" lies, approximately, on a plane. Then, it could be very interesting to **repeat the PCA on two separated datasets: the Goal-Keepers' dataset and the Field-Players' dataset**.

**OPTIONAL EXERCISE:** create a new notebook where you perform these analyses on separated datasets.

### Analyses of the quality of the PC-Based Representation and Interpretation

If we interpreted correctly the PCs, the "best" players, according to their general positions, are characterized by:
- Goal-Keepers: high positive values of PC1 (G.K. Player), high positive values of PC3 (Acc. & Ref.);
- Defenders: high negative values of PC1 (Field Player), high positive value of PC2 (Defense), and high positive values of PC3 (Acc. & Ref.);
- Mid-Fielders: high negative values of PC1 (Field Player), and high positive values of PC3 (Acc. & Ref.);
- Forwards:high negative values of PC1 (Field Player), high negative value of PC2 (Attack), and high positive values of PC3 (Acc. & Ref.).

Then, plotting the score graph with colors according to the "Overall" score of the players, the best players should follow this observations.

**Remember:** the authors of FIFA '19, compute the overall score of a player as a weighted mean of the most important skills associated to the palyer's general position.

**EXERCISE:** plot another score graph in $\mathbb{R}^3$ where the color describes the overall score of the players.

In [None]:
# MAKE THE 3D SCORE GRAPH
sg_overall = plt.figure(figsize=(8, 8))
ax_sg_overall = sg_overall.add_subplot(111, projection='3d')
ovrll_sc = ax_sg_overall.scatter(...)  # <-- TODO!!
plt.title('FOOTBALL PLAYERS - SCORE GRAPH')
ax_sg_overall.set_xlabel(pc_names[0])
ax_sg_overall.set_ylabel(pc_names[1])
ax_sg_overall.set_zlabel(pc_names[2])
plt.colorbar(ovrll_sc)
plt.grid()
plt.show()

## $k$-Means

Now, let's make a cluster analysis of the football players in the PC-space.

**EXERCISE:** Run the $k$-Means algorithm finding the optimal number of clusters, between $3$ and $10$, according to the silhouette coefficient. Then
1. Plot the centroids in the PC-space (if possible, with also the football players);
2. Plot as barplots the centroids and try to give them an interpretation, according to the PC names.

**N.B.:** tailored *grid-search* tools of scikit-learn can be used in this case, but it is not purpose of this course to talk about them. Then, a simple *for-cycle* is adopted to find the best clustering w.r.t. the silhouette coefficient.

In [None]:
# SET THE RANDOM STATE (THE LABORATORY DAY)
random_state = 20211220

# INITIALIZE SOME LISTS TO STORE THE TEMPORARY RESULTS AND, THEN, MAKE COMPARISONS
km_list = []
silcoeff_list = []
k_list = list(range(3, 11))

# START THE FOR-CYCLE TO RUN THE k-MEANS AND MEASURING THE SILHOUETTE COEFFICIENT
for i in range(len(k_list)):
    print(f'****************** START k-MEANS WITH k={k_list[i]} ******************')
    print('Computing...')
    km_list.append(KMeans(...))  # <-- TODO!!
    km = km_list[i]
    ... fit the k-means object...  # <-- TODO!!
    silcoeff_list.append(silhouette_score(...))  # <-- TODO!!
    print(f'****************** END k-MEANS WITH k={k_list[i]} ******************')
    print('')

# FIND THE BEST VALUE OF k AND THE BEST KMeans OBJECT
i_best = np.argmax(...)  # <-- TODO!!
k = k_list[i_best]
km = km_list[i_best]

# VISUALIZE THE RESULT
print('')
print('')
print('****************** RESULTS OF THE SEARCH... ******************')
print(f'BEST SILHOUETTE SCORE: {np.max(...)} --> k = {k}')  # <-- TODO!!
print('**************************************************************')

In [None]:
# MAKE THE 3D SCORE GRAPH WITH THE CENTROIDS
sg_3d_km = plt.figure(figsize=(8, 8))
ax_sg_3d_km = sg_3d_km.add_subplot(111, projection='3d')
ax_sg_3d_km.scatter(...)  # <-- TODO!!
ax_sg_3d_km.scatter(...)  # <-- TODO!!
plt.title('FOOTBALL PLAYERS - SCORE GRAPH')
ax_sg_3d_km.set_xlabel(pc_names[0])
ax_sg_3d_km.set_ylabel(pc_names[1])
ax_sg_3d_km.set_zlabel(pc_names[2])
plt.legend(genpos_colors_legend, [k for k in genpos_colors.keys()])
plt.grid()
plt.show()

In [None]:
# COMPUTE THE MAX/MIN VALUES IN THE PC-SPACE
maxs_y = ...  # <-- TODO!!
mins_y = ...  # <-- TODO!! 

# MAKE THE BARPLOTS OF THE CENTROIDS
fig_centroids, ax_centroids = plt.subplots(2, 2, figsize=(10, 10))
for ii in range(k):
    ir = ii // 2
    ic = ii % 2
    ax_centroids[ir, ic].bar(np.arange(km.cluster_centers_.shape[1]), maxs_y, color='blue', alpha=0.15)
    ax_centroids[ir, ic].bar(np.arange(km.cluster_centers_.shape[1]), mins_y, color='blue', alpha=0.15)
    ax_centroids[ir, ic].bar(...)  # <-- TODO!!
    ax_centroids[ir, ic].set_xticks(ticks=...)  # <-- TODO!!
    ax_centroids[ir, ic].set_xticklabels(labels=..., rotation=45)  # <-- TODO!!
    ax_centroids[ir, ic].grid(visible=True, which='both')
    plt.tight_layout()
    ax_centroids[ir, ic].set_title(f'FOOTBALL PLAYERS - CENTROID {ii+1}')

### Interpretation of the Centroids

Looking at the PC values of the centroids in the barplots above, we clearly understand that:
1. Centroid 1 is the representer of a cluster of goal-keepers (GK);
1. Centroid 2 is the representer of a cluster of forward (FW) players and mid-fielders (MF) good in attack actions;
1. Centroid 3 is the representer of a cluster of defenders (DF);
1. Centroid 4 is the representer of a cluster of "full" mid-fielders (MF), partially good also in defense.

### (Fast) Evaluation of the Clustering Results

Observing at the score graph with centroids and the centroids' barplots, we can do a fast *External evaluation* without quantifying exactly the percentages of general positions in each cluster. Indeed, we clearly see that the clusters characterize very well the goal-keepers, the defenders and the forwards; only the mid-fielders are scattered between the clusters of centroid 2, 3, and 4. Nonetheless, this fact is coherent with the great variety of mid-fielder types, since some of them are better during attack while other are better during defense and/or the in the center of the field.

Concerning the *Internal evaluations*, in this laboratory we avoid a detailed analysis, contenting ourselves with the use of the silhouette coefficient to identify the optimal value of $k$.