# Desafio 5

Neste desafio, vamos praticar sobre redução de dimensionalidade com PCA e seleção de variáveis com RFE. Utilizaremos o _data set_ [Fifa 2019](https://www.kaggle.com/karangadiya/fifa19), contendo originalmente 89 variáveis de mais de 18 mil jogadores do _game_ FIFA 2019.

> Obs.: Por favor, não modifique o nome das funções de resposta.

## _Setup_ geral

In [1]:
from math import sqrt

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as sct
import seaborn as sns
import statsmodels.api as sm
import statsmodels.stats as st
from sklearn.decomposition import PCA

from loguru import logger


pd.set_option('display.max_columns', None)

In [2]:
# Algumas configurações para o matplotlib.
#%matplotlib inline

from IPython.core.pylabtools import figsize


figsize(12, 8)

sns.set()

In [3]:
fifa = pd.read_csv("fifa.csv")

In [4]:
columns_to_drop = ["Unnamed: 0", "ID", "Name", "Photo", "Nationality", "Flag",
                   "Club", "Club Logo", "Value", "Wage", "Special", "Preferred Foot",
                   "International Reputation", "Weak Foot", "Skill Moves", "Work Rate",
                   "Body Type", "Real Face", "Position", "Jersey Number", "Joined",
                   "Loaned From", "Contract Valid Until", "Height", "Weight", "LS",
                   "ST", "RS", "LW", "LF", "CF", "RF", "RW", "LAM", "CAM", "RAM", "LM",
                   "LCM", "CM", "RCM", "RM", "LWB", "LDM", "CDM", "RDM", "RWB", "LB", "LCB",
                   "CB", "RCB", "RB", "Release Clause"
]

try:
    fifa.drop(columns_to_drop, axis=1, inplace=True)
except KeyError:
    logger.warning(f"Columns already dropped")

## Inicia sua análise a partir daqui
### Explorando dataset

In [5]:
# Funcoes para auxiliar na analise de dataset
from IPython.display import display

def missing_data(df):    
    missing_data = pd.DataFrame({'Tipo': df.dtypes,'Dados faltantes (%)': df.isna().sum() / df.shape[0]})
    return missing_data

def sumary_df(df):
    print("Informações básicas do dataset")
    print("\nFormato:", df.shape)
    display(df.head(5))

    print("\nPercentual de dados faltantes:")
    display(missing_data(df).T)

    print("\nEstatísticas das features:")
    display(df.describe())
    
def converte_float(num, casas_decimais):
    aux = np.float32(round(num, casas_decimais))
    num_float = round(aux.item(), casas_decimais)
    return num_float

In [6]:
# Carregando dataframe e sumario
df = fifa
sumary_df(df)

Informações básicas do dataset

Formato: (18207, 37)


Unnamed: 0,Age,Overall,Potential,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
0,31,94,94,84.0,95.0,70.0,90.0,86.0,97.0,93.0,94.0,87.0,96.0,91.0,86.0,91.0,95.0,95.0,85.0,68.0,72.0,59.0,94.0,48.0,22.0,94.0,94.0,75.0,96.0,33.0,28.0,26.0,6.0,11.0,15.0,14.0,8.0
1,33,94,94,84.0,94.0,89.0,81.0,87.0,88.0,81.0,76.0,77.0,94.0,89.0,91.0,87.0,96.0,70.0,95.0,95.0,88.0,79.0,93.0,63.0,29.0,95.0,82.0,85.0,95.0,28.0,31.0,23.0,7.0,11.0,15.0,14.0,11.0
2,26,92,93,79.0,87.0,62.0,84.0,84.0,96.0,88.0,87.0,78.0,95.0,94.0,90.0,96.0,94.0,84.0,80.0,61.0,81.0,49.0,82.0,56.0,36.0,89.0,87.0,81.0,94.0,27.0,24.0,33.0,9.0,9.0,15.0,15.0,11.0
3,27,91,93,17.0,13.0,21.0,50.0,13.0,18.0,21.0,19.0,51.0,42.0,57.0,58.0,60.0,90.0,43.0,31.0,67.0,43.0,64.0,12.0,38.0,30.0,12.0,68.0,40.0,68.0,15.0,21.0,13.0,90.0,85.0,87.0,88.0,94.0
4,27,91,92,93.0,82.0,55.0,92.0,82.0,86.0,85.0,83.0,91.0,91.0,78.0,76.0,79.0,91.0,77.0,91.0,63.0,90.0,75.0,91.0,76.0,61.0,87.0,94.0,79.0,88.0,68.0,58.0,51.0,15.0,13.0,5.0,10.0,13.0



Percentual de dados faltantes:


Unnamed: 0,Age,Overall,Potential,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
Tipo,int64,int64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
Dados faltantes (%),0,0,0,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635,0.00263635



Estatísticas das features:


Unnamed: 0,Age,Overall,Potential,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
count,18207.0,18207.0,18207.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0,18159.0
mean,25.122206,66.238699,71.307299,49.734181,45.550911,52.298144,58.686712,42.909026,55.371001,47.170824,42.863153,52.711933,58.369459,64.614076,64.726967,63.503607,61.83661,63.966573,55.460047,65.089432,63.219946,65.311967,47.109973,55.868991,46.698276,49.958478,53.400903,48.548598,58.648274,47.281623,47.697836,45.661435,16.616223,16.391596,16.232061,16.388898,16.710887
std,4.669943,6.90893,6.136496,18.364524,19.52582,17.379909,14.699495,17.694408,18.910371,18.395264,17.478763,15.32787,16.686595,14.92778,14.649953,14.766049,9.010464,14.136166,17.237958,11.820044,15.894741,12.557,19.260524,17.367967,20.696909,19.529036,14.146881,15.704053,11.436133,19.904397,21.664004,21.289135,17.695349,16.9069,16.502864,17.034669,17.955119
min,16.0,46.0,48.0,5.0,2.0,4.0,7.0,4.0,4.0,6.0,3.0,9.0,5.0,12.0,12.0,14.0,21.0,16.0,2.0,15.0,12.0,17.0,3.0,11.0,3.0,2.0,10.0,5.0,3.0,3.0,2.0,3.0,1.0,1.0,1.0,1.0,1.0
25%,21.0,62.0,67.0,38.0,30.0,44.0,54.0,30.0,49.0,34.0,31.0,43.0,54.0,57.0,57.0,55.0,56.0,56.0,45.0,58.0,56.0,58.0,33.0,44.0,26.0,38.0,44.0,39.0,51.0,30.0,27.0,24.0,8.0,8.0,8.0,8.0,8.0
50%,25.0,66.0,71.0,54.0,49.0,56.0,62.0,44.0,61.0,48.0,41.0,56.0,63.0,67.0,67.0,66.0,62.0,66.0,59.0,66.0,66.0,67.0,51.0,59.0,52.0,55.0,55.0,49.0,60.0,53.0,55.0,52.0,11.0,11.0,11.0,11.0,11.0
75%,28.0,71.0,75.0,64.0,62.0,64.0,68.0,57.0,68.0,62.0,57.0,64.0,69.0,75.0,75.0,74.0,68.0,74.0,68.0,73.0,74.0,74.0,62.0,69.0,64.0,64.0,64.0,60.0,67.0,64.0,66.0,64.0,14.0,14.0,14.0,14.0,14.0
max,45.0,94.0,95.0,93.0,95.0,94.0,93.0,90.0,97.0,94.0,94.0,93.0,96.0,97.0,96.0,96.0,96.0,96.0,95.0,95.0,96.0,97.0,94.0,95.0,92.0,95.0,94.0,92.0,96.0,94.0,93.0,91.0,90.0,92.0,91.0,90.0,94.0


Como o percentual de dados nulos no dataset é baixo (0.002%) as linhas com campos nulos serão eliminadas.

In [7]:
df.dropna(inplace=True)
print("Formato pós limpeza:", df.shape)
display(missing_data(df).T)

Formato pós limpeza: (18159, 37)


Unnamed: 0,Age,Overall,Potential,Crossing,Finishing,HeadingAccuracy,ShortPassing,Volleys,Dribbling,Curve,FKAccuracy,LongPassing,BallControl,Acceleration,SprintSpeed,Agility,Reactions,Balance,ShotPower,Jumping,Stamina,Strength,LongShots,Aggression,Interceptions,Positioning,Vision,Penalties,Composure,Marking,StandingTackle,SlidingTackle,GKDiving,GKHandling,GKKicking,GKPositioning,GKReflexes
Tipo,int64,int64,int64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64,float64
Dados faltantes (%),0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


## Questão 1

Qual fração da variância consegue ser explicada pelo primeiro componente principal de `fifa`? Responda como um único float (entre 0 e 1) arredondado para três casas decimais.

In [8]:
def q1():
    pca = PCA().fit(df)
    evr = pca.explained_variance_ratio_
    resp = converte_float(evr[0], 3)
    return resp

In [9]:
print("Resposta da questão 1: ", q1())

Resposta da questão 1:  0.565


## Questão 2

Quantos componentes principais precisamos para explicar 95% da variância total? Responda como un único escalar inteiro.

In [10]:
def q2():
    pca_095 = PCA(n_components=0.95)
    X_reduced = pca_095.fit_transform(df)
    return X_reduced.shape[1]

In [11]:
print("Resposta da questão 2: ", q2())

Resposta da questão 2:  15


## Questão 3

Qual são as coordenadas (primeiro e segundo componentes principais) do ponto `x` abaixo? O vetor abaixo já está centralizado. Cuidado para __não__ centralizar o vetor novamente (por exemplo, invocando `PCA.transform()` nele). Responda como uma tupla de float arredondados para três casas decimais.

In [12]:
x = [0.87747123,  -1.24990363,  -1.3191255, -36.7341814,
     -35.55091139, -37.29814417, -28.68671182, -30.90902583,
     -42.37100061, -32.17082438, -28.86315326, -22.71193348,
     -38.36945867, -20.61407566, -22.72696734, -25.50360703,
     2.16339005, -27.96657305, -33.46004736,  -5.08943224,
     -30.21994603,   3.68803348, -36.10997302, -30.86899058,
     -22.69827634, -37.95847789, -22.40090313, -30.54859849,
     -26.64827358, -19.28162344, -34.69783578, -34.6614351,
     48.38377664,  47.60840355,  45.76793876,  44.61110193,
     49.28911284
]

In [17]:
def q3():
    pca = PCA().fit(df)
    result = pca.components_.dot(x).round(3)
    return (result[0], result[1])

In [18]:
print("Resposta da questão 3: ", q3())

Resposta da questão 3:  (186.556, -6.592)


## Questão 4

Realiza RFE com estimador de regressão linear para selecionar cinco variáveis, eliminando uma a uma. Quais são as variáveis selecionadas? Responda como uma lista de nomes de variáveis.

In [24]:
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
    
def q4():
    # Separando Overall como variável de interesse
    X = df.drop('Overall', 1)
    y = df['Overall']

    # Criando objeto RFE e aplicando método
    rfe = RFE(LinearRegression(), n_features_to_select=5)
    rfe.fit(X,y)

    # Armazenando nome das variáveis selecionadas
    mask = rfe.support_
    result = X.columns[mask]

    return list(result)

In [25]:
print("Resposta da questão 4: ", q4())

Resposta da questão 4:  ['Age', 'Potential', 'BallControl', 'Reactions', 'GKReflexes']
