# Desafio 5

Neste desafio, vamos praticar sobre redução de dimensionalidade com PCA e seleção de variáveis com RFE. Utilizaremos o _data set_ [Fifa 2019](https://www.kaggle.com/karangadiya/fifa19), contendo originalmente 89 variáveis de mais de 18 mil jogadores do _game_ FIFA 2019.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [2]:
from math import sqrt

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sct
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats as st
from sklearn.decomposition import PCA
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression

from loguru import logger

In [3]:
fifa = pd.read_csv("data.csv")

In [10]:
columns_to_drop = ["Unnamed: 0", "ID", "Name", "Photo", "Nationality", "Flag",
                   "Club", "Club Logo", "Value", "Wage", "Special", "Preferred Foot",
                   "International Reputation", "Weak Foot", "Skill Moves", "Work Rate",
                   "Body Type", "Real Face", "Position", "Jersey Number", "Joined",
                   "Loaned From", "Contract Valid Until", "Height", "Weight", "LS",
                   "ST", "RS", "LW", "LF", "CF", "RF", "RW", "LAM", "CAM", "RAM", "LM",
                   "LCM", "CM", "RCM", "RM", "LWB", "LDM", "CDM", "RDM", "RWB", "LB", "LCB",
                   "CB", "RCB", "RB", "Release Clause"
]

try:
    fifa.drop(columns_to_drop, axis=1, inplace=True)
except KeyError:
    logger.warning(f"Columns already dropped")

## Inicia sua análise a partir daqui

In [5]:
fifa.head()

Unnamed: 0,Age,Overall,Potential,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,...,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
0,31,94,94,84.0,95.0,70.0,90.0,86.0,97.0,93.0,...,75.0,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0
1,33,94,94,84.0,94.0,89.0,81.0,87.0,88.0,81.0,...,85.0,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0
2,26,92,93,79.0,87.0,62.0,84.0,84.0,96.0,88.0,...,81.0,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0
3,27,91,93,17.0,13.0,21.0,50.0,13.0,18.0,21.0,...,40.0,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0
4,27,91,92,93.0,82.0,55.0,92.0,82.0,86.0,85.0,...,79.0,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0


In [6]:
fifa.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 18207 entries, 0 to 18206
Data columns (total 37 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   Age              18207 non-null  int64  
 1   Overall          18207 non-null  int64  
 2   Potential        18207 non-null  int64  
 3   Crossing         18159 non-null  float64
 4   Finishing        18159 non-null  float64
 5   HeadingAccuracy  18159 non-null  float64
 6   ShortPassing     18159 non-null  float64
 7   Volleys          18159 non-null  float64
 8   Dribbling        18159 non-null  float64
 9   Curve            18159 non-null  float64
 10  FKAccuracy       18159 non-null  float64
 11  LongPassing      18159 non-null  float64
 12  BallControl      18159 non-null  float64
 13  Acceleration     18159 non-null  float64
 14  SprintSpeed      18159 non-null  float64
 15  Agility          18159 non-null  float64
 16  Reactions        18159 non-null  float64
 17  Balance     

In [7]:
#analyzing missing data
cons = pd.DataFrame({'colunas': fifa.columns,
                     'missing': fifa.isna().sum(),
                     'size': fifa.shape[0]})
cons

Unnamed: 0,colunas,missing,size
Age,Age,0,18207
Overall,Overall,0,18207
Potential,Potential,0,18207
Crossing,Crossing,48,18207
Finishing,Finishing,48,18207
HeadingAccuracy,HeadingAccuracy,48,18207
ShortPassing,ShortPassing,48,18207
Volleys,Volleys,48,18207
Dribbling,Dribbling,48,18207
Curve,Curve,48,18207


In [11]:
#Remove missing observations 
fifa.dropna(inplace = True)

In [10]:
#Instantiating the pca
pca = PCA(n_components = 2)

In [11]:
#Fit the model with fifa and apply the dimensionality reduction on fifa
principal_components = pca.fit_transform(fifa.to_numpy())
principal_components

array([[-126.71792515, -105.58000764],
       [-123.36568604,  -88.98416141],
       [-115.11013638,  -94.77505594],
       ...,
       [  49.83476757,  -44.19006161],
       [  42.8077879 ,  -39.12437286],
       [  37.17937197,   15.60717604]])

In [12]:
#Directions of maximum variance in the data
pca.components_

array([[-6.16388751e-03, -3.70623864e-02, -2.27401748e-02,
        -2.13639023e-01, -1.98891213e-01, -1.70828950e-01,
        -1.80309140e-01, -1.87038764e-01, -2.33139606e-01,
        -2.07690956e-01, -1.84764187e-01, -1.67186902e-01,
        -2.12972623e-01, -1.38740617e-01, -1.34902279e-01,
        -1.38433521e-01, -4.91611013e-02, -1.16410947e-01,
        -1.95840156e-01, -4.07181861e-02, -1.72711671e-01,
        -2.33585866e-02, -2.16594765e-01, -1.53339724e-01,
        -1.50834334e-01, -2.24571087e-01, -1.29586783e-01,
        -1.62548283e-01, -1.01038031e-01, -1.53454113e-01,
        -1.59617493e-01, -1.47955869e-01,  2.06147192e-01,
         1.96645602e-01,  1.91129889e-01,  1.97401130e-01,
         2.08994083e-01],
       [ 8.87203494e-03,  1.58367355e-04, -7.78142440e-03,
        -4.43084573e-02, -2.57629630e-01,  1.18911964e-01,
         1.21869793e-02, -1.91182282e-01, -1.18898465e-01,
        -1.27744634e-01, -1.00178915e-01,  4.89136910e-02,
        -5.12678591e-02, -9.84

In [13]:
#Variance for each selected component
pca.explained_variance_ratio_

array([0.56528056, 0.18102522])

## Questão 1

Qual fração da variância consegue ser explicada pelo primeiro componente principal de `fifa`? Responda como um único float (entre 0 e 1) arredondado para três casas decimais.

In [14]:
def q1():
    
    #Variance attributed to the first principal component
    return round(pca.explained_variance_ratio_[0], 3)

In [15]:
q1()

0.565

## Questão 2

Quantos componentes principais precisamos para explicar 95% da variância total? Responda como un único escalar inteiro.

In [16]:
def q2():
    
    #Specifying fraction on the total variance with PCA function 
    pca_095 = PCA(n_components = 0.95)
    
    #Reduces dataset but keeps 95% of the information 
    components = pca_095.fit_transform(fifa.to_numpy())
    
    return components.shape[1]

In [17]:
q2()

15

## Questão 3

Qual são as coordenadas (primeiro e segundo componentes principais) do ponto `x` abaixo? O vetor abaixo já está centralizado. Cuidado para __não__ centralizar o vetor novamente (por exemplo, invocando `PCA.transform()` nele). Responda como uma tupla de float arredondados para três casas decimais.

In [18]:
x = [0.87747123,  -1.24990363,  -1.3191255, -36.7341814,
     -35.55091139, -37.29814417, -28.68671182, -30.90902583,
     -42.37100061, -32.17082438, -28.86315326, -22.71193348,
     -38.36945867, -20.61407566, -22.72696734, -25.50360703,
     2.16339005, -27.96657305, -33.46004736,  -5.08943224,
     -30.21994603,   3.68803348, -36.10997302, -30.86899058,
     -22.69827634, -37.95847789, -22.40090313, -30.54859849,
     -26.64827358, -19.28162344, -34.69783578, -34.6614351,
     48.38377664,  47.60840355,  45.76793876,  44.61110193,
     49.28911284
]

In [19]:
def q3():
    
    coor = np.dot(pca.components_, x)
    
    return tuple(coor.round(3))

In [20]:
q3()

(186.556, -6.592)

## Questão 4

Realiza RFE com estimador de regressão linear para selecionar cinco variáveis, eliminando uma a uma. Quais são as variáveis selecionadas? Responda como uma lista de nomes de variáveis.

In [12]:
#Separating out the features
X = fifa[fifa.columns.difference(['Overall'])]

#Separating out the target
y = fifa['Overall']

In [13]:
#Define RFE with Linear Regressor as estimator 
rfe = RFE(LinearRegression(), n_features_to_select = 5)

#Fit RFE
rfe = rfe.fit(X, y)

ValueError: Found array with 0 sample(s) (shape=(0, 36)) while a minimum of 1 is required.

In [7]:
#Create a dataframe for summarize the results
df = pd.DataFrame({'Columns': X.columns, 'Selected': rfe.support_, 'Rank': rfe.ranking_})

NameError: name 'rfe' is not defined

In [24]:
def q4():
    
    #Filter all the five features selected for RFE 
    return list(df.query('Selected == True')['Columns'])

In [26]:
q4()

['Age', 'BallControl', 'GKReflexes', 'Potential', 'Reactions']

Apenas outra forma de selecionar as *features* :)

In [25]:
selected = []

#Summarize all features
for i, j in zip(X, range(len(X))):
    
    if rfe.support_[j] == True:
        selected.append(i)
        
selected

['Age', 'BallControl', 'GKReflexes', 'Potential', 'Reactions']

### Referências:
1. [rfe-feature-selection-in-python](https://machinelearningmastery.com/rfe-feature-selection-in-python/)
2. [step-forward-feature-selection-python](https://www.kdnuggets.com/2018/06/step-forward-feature-selection-python.html)
3. [pca-using-python-scikit-learn](https://towardsdatascience.com/pca-using-python-scikit-learn-e653f8989e60)