<a href="https://colab.research.google.com/github/datascience-uniandes/data-analysis-tutorial/blob/master/fifa/dim-reduction.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Dimensionality Reduction for high-dimensional data visualization

MINE-4101: Applied Data Science  
Univerisdad de los Andes  
  
Dataset: FIFA
  
Last update: September, 2022

In [None]:
#!pip install umap-learn

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE

#from umap import UMAP

In [None]:
pd.set_option('display.max_rows', 100)
pd.set_option('display.max_columns', 100)

### Loading the data

In [None]:
fifa_df = pd.read_csv('./data/fifa_clean.csv')

In [None]:
fifa_df.shape

In [None]:
fifa_df.dtypes

In [None]:
fifa_df.head()

### Chosing feature selection for dimensionality reduction

In [None]:
# Filtering the column list by index
player_attributes = fifa_df.columns[12:46]

In [None]:
player_attributes

In [None]:
# For features selected, making a transformation from string to int
# Why is this required? These features have values like '80+9' or '70-3'

attribute2int = lambda x: sum([int(i) for i in x.replace('-', '+').split('+')]) if type(x) == str else x

for attribute in player_attributes:
    print('Transforming', attribute)
    fifa_df[attribute] = fifa_df[attribute].apply(attribute2int)

### Making dimensionality reduction using PCA

In [None]:
pca = PCA(n_components = 2, random_state = 0)

In [None]:
pca_dimensions = pca.fit_transform(fifa_df[player_attributes])

In [None]:
pca_dimensions

In [None]:
pca.explained_variance_ratio_

In [None]:
plt.figure(figsize = (10, 7))
sns.scatterplot(x = pca_dimensions[:,0], y = pca_dimensions[:,1], hue = fifa_df['Preferred Position'], size = 1)
plt.legend(loc = 'upper right')
plt.show()

### Making dimensionality reduction using t-SNE

In [None]:
tsne = TSNE(perplexity = 30, random_state = 1)

In [None]:
tsne_dimensions = tsne.fit_transform(fifa_df[player_attributes])

In [None]:
tsne_dimensions

In [None]:
plt.figure(figsize = (10, 7))
sns.scatterplot(x = tsne_dimensions[:,0], y = tsne_dimensions[:,1], hue = fifa_df['Preferred Position'], size = 1)
plt.legend(loc = 'upper right')
plt.show()

### Making dimensionality reduction using UMAP

Clusters are difficult to see because of the high cardinality of the *Preferred Position* attribute used to encode color.

Next step:
- Look for a strategy to better group positions