# Desafio 5

Neste desafio, vamos praticar sobre redução de dimensionalidade com PCA e seleção de variáveis com RFE. Utilizaremos o _data set_ [Fifa 2019](https://www.kaggle.com/karangadiya/fifa19), contendo originalmente 89 variáveis de mais de 18 mil jogadores do _game_ FIFA 2019.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [1]:
from math import sqrt

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sct
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats as st
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

from loguru import logger

In [2]:
# Algumas configurações para o matplotlib.
#%matplotlib inline

from IPython.core.pylabtools import figsize


figsize(12, 8)

sns.set()

In [3]:
fifa = pd.read_csv("fifa.csv")

In [4]:
columns_to_drop = ["Unnamed: 0", "ID", "Name", "Photo", "Nationality", "Flag",
                   "Club", "Club Logo", "Value", "Wage", "Special", "Preferred Foot",
                   "International Reputation", "Weak Foot", "Skill Moves", "Work Rate",
                   "Body Type", "Real Face", "Position", "Jersey Number", "Joined",
                   "Loaned From", "Contract Valid Until", "Height", "Weight", "LS",
                   "ST", "RS", "LW", "LF", "CF", "RF", "RW", "LAM", "CAM", "RAM", "LM",
                   "LCM", "CM", "RCM", "RM", "LWB", "LDM", "CDM", "RDM", "RWB", "LB", "LCB",
                   "CB", "RCB", "RB", "Release Clause"
]

try:
    fifa.drop(columns_to_drop, axis=1, inplace=True)
except KeyError:
    logger.warning(f"Columns already dropped")

## Inicia sua análise a partir daqui

In [5]:
fifa.dropna(inplace = True)

## Questão 1

Qual fração da variância consegue ser explicada pelo primeiro componente principal de `fifa`? Responda como um único float (entre 0 e 1) arredondado para três casas decimais.

In [6]:
def q1(): 
    pca = PCA().fit(fifa)
    evr = pca.explained_variance_ratio_
    result_1 = np.round(evr , 3)[0]
    return float(result_1)
q1()

0.565

## Questão 2

Quantos componentes principais precisamos para explicar 95% da variância total? Responda como un único escalar inteiro.

In [7]:
def q2():
    pca = PCA(0.95).fit_transform(fifa) #when PCA argument n_components is less than 1, it is understood as the minimum amount of variance to be explained.
    pca_dimension = pca.shape
    result_2 = pca_dimension[1]
    return result_2
q2()

15

## Questão 3

Qual são as coordenadas (primeiro e segundo componentes principais) do ponto `x` abaixo? O vetor abaixo já está centralizado. Cuidado para __não__ centralizar o vetor novamente (por exemplo, invocando `PCA.transform()` nele). Responda como uma tupla de float arredondados para três casas decimais.

In [8]:
x = [0.87747123,  -1.24990363,  -1.3191255, -36.7341814,
     -35.55091139, -37.29814417, -28.68671182, -30.90902583,
     -42.37100061, -32.17082438, -28.86315326, -22.71193348,
     -38.36945867, -20.61407566, -22.72696734, -25.50360703,
     2.16339005, -27.96657305, -33.46004736,  -5.08943224,
     -30.21994603,   3.68803348, -36.10997302, -30.86899058,
     -22.69827634, -37.95847789, -22.40090313, -30.54859849,
     -26.64827358, -19.28162344, -34.69783578, -34.6614351,
     48.38377664,  47.60840355,  45.76793876,  44.61110193,
     49.28911284
]

**pca.components_** returns the vectors with the coeficients for each feature transformation for each component. For the first vector, I have the coeficients for $PC1$, and for the second vector, I have for $PC2$. Those coeficients will be multiplied by each feature value and then summed to get the coordinate for $PC1$ as my principal components are just functions that transforms the original coordinates into a single coordinate that refer to the respective PC. For instance, if I have an observation with features $feat_1$,$feat_2$ and $feat_3$ and I want to apply PCA with two components, I will get two functions with the following structure: $PC1 = coef_1 \times feat_1 + coef_2 \times feat_2 + coef_3 \times feat_3$ and $PC2 = coef_4 \times feat_1 + coef_5 \times feat_2 + coef_6 \times feat_3$. My vectors from $\text{pca.components_}$ will then be $[coef_1, coef_2, coef_3]$ and $[coef_4, coef_5, coef_6]$. So, to get $PC1$ and $PC2$ coordinates, all I have to do is to multiply my coeficients vector with my features vector $[feat_1, feat_2, feat_3]$ and sum the elements of the resultant vector :) And that's what I did! ⤵

In [9]:
def q3():
    pca = PCA(n_components = 2).fit(fifa)
    result_3 = [round(sum(coordinate),3) for coordinate in pca.components_ * x]  
    return tuple(result_3)
q3()

(186.556, -6.592)

0.87747123 * 0.00616389

In [11]:
pca = PCA(n_components = 2).fit(fifa)
result = pca.components_ * x
result[0]

array([-5.40863395e-03,  4.63244113e-02,  2.99971444e-02,  7.84785462e+00,
        7.07076388e+00,  6.37160281e+00,  5.17247632e+00,  5.78118598e+00,
        9.87835839e+00,  6.68158928e+00,  5.33287706e+00,  3.79713779e+00,
        8.17164425e+00,  2.86000958e+00,  3.06591968e+00,  3.53055412e+00,
       -1.06354637e-01,  3.25561525e+00,  6.55282089e+00,  2.07232449e-01,
        5.21933738e+00, -8.61472496e-02,  7.82123110e+00,  4.73344251e+00,
        3.42367940e+00,  8.52437663e+00,  2.90286098e+00,  4.96562222e+00,
        2.69248910e+00,  2.95884441e+00,  5.53838158e+00,  5.12836276e+00,
        9.97417970e+00,  9.36198318e+00,  8.74762106e+00,  8.80628195e+00,
        1.03011330e+01])

In [12]:
fifa.columns

Index(['Age', 'Overall', 'Potential', 'Crossing', 'Finishing',
       'HeadingAccuracy', 'ShortPassing', 'Volleys', 'Dribbling', 'Curve',
       'FKAccuracy', 'LongPassing', 'BallControl', 'Acceleration',
       'SprintSpeed', 'Agility', 'Reactions', 'Balance', 'ShotPower',
       'Jumping', 'Stamina', 'Strength', 'LongShots', 'Aggression',
       'Interceptions', 'Positioning', 'Vision', 'Penalties', 'Composure',
       'Marking', 'StandingTackle', 'SlidingTackle', 'GKDiving', 'GKHandling',
       'GKKicking', 'GKPositioning', 'GKReflexes'],
      dtype='object')

## Questão 4

Realiza RFE com estimador de regressão linear para selecionar cinco variáveis, eliminando uma a uma. Quais são as variáveis selecionadas? Responda como uma lista de nomes de variáveis.

In [13]:
def q4():
    reg = LinearRegression()
    rfe = RFE(reg, n_features_to_select= 5)
    x_features_train = fifa.drop(columns ='Overall') #dropped the target column
    y_target_train = fifa['Overall']
    rfe.fit(x_features_train, y_target_train)
    features_selected = x_features_train.T[rfe.support_]
    result_4 = list(features_selected.T.columns)
    return result_4
q4()

['Age', 'Potential', 'BallControl', 'Reactions', 'GKReflexes']

*rfe.support_* returns a boolean mask indicating the selected features. As it is an array I had to transpose the matrix so the columns would be rows and then be filtered by the mask. Then I *retransposed* to get the columns names.