# Porto Seguro Data Challenge
---

## Definição do Problema

**Objetivo**: "Nessa competição você será desafiado a construir um modelo que prediz a probabilidade de aquisição de um produto."


Obs.: Segundo a descrição, a medida utilizada como critério de avaliação será a F1 e a medida Sensibilidade como critério para desempate

<p style="color:red">Se gostou não esqueça do voto! 🤘</p>

<div class="alert alert-warning"> 
<h3><strong>⚠️ Atenção! <br></strong> </h3>
    
<p style="color: rgb(0, 0, 0);">No futuro este notebook pode ser alterado pois tanto as features <b>categóricas</b> quanto os valores <b>faltantes</b> não tiveram nenhum tratamento especial nessas análises!</p>
</div>

# Carregar dependências
---

In [None]:
!pip install sweetviz

In [None]:
import sweetviz as sv

import pandas as pd
import numpy as np

from random import uniform

from sklearn.feature_selection import mutual_info_classif
from sklearn.feature_selection import VarianceThreshold
from mlxtend.feature_selection import ExhaustiveFeatureSelector
from mlxtend.feature_selection import SequentialFeatureSelector
from sklearn.feature_selection import SelectFromModel
from sklearn.feature_selection import RFECV
from boruta import BorutaPy

from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier

In [None]:
train = pd.read_csv('../input/porto-seguro-data-challenge/train.csv').drop(['id'], axis=1)
test = pd.read_csv('../input/porto-seguro-data-challenge/test.csv').drop(['id'], axis=1)
sample_submission = pd.read_csv('../input/porto-seguro-data-challenge/submission_sample.csv')
meta = pd.read_csv('../input/porto-seguro-data-challenge/metadata.csv')

In [None]:
cat_nom = [x for x in meta.iloc[1:-1, :].loc[(meta.iloc[:,1]=="Qualitativo nominal")].iloc[:,0]] # 0.66 / 0.56
cat_ord = [x for x in meta.iloc[1:-1, :].loc[(meta.iloc[:,1]=="Qualitativo ordinal")].iloc[:,0]] # 0.36 / 0.36
num_dis = [x for x in meta.iloc[1:-1, :].loc[(meta.iloc[:,1]=="Quantitativo discreto")].iloc[:,0]] # 0.40
num_con = [x for x in meta.iloc[1:-1, :].loc[(meta.iloc[:,1]=="Quantitativo continua")].iloc[:,0]] # 0.38

# SweetViz Report
---

In [None]:
my_report = sv.analyze(train, target_feat='y')

In [None]:
my_report.show_notebook()

# Feature Selection
---

Métodos utilizados para seleção de features:
    
- Variable Importance
- Information Gain
- Zero Proportion
- Null Proportion
- Backward Feature Elimination
- Exhaustive Feature Selection (off)
- Lasso Regularization (L1)
- Recursive Feature Elimination (RFE)
- Boruta
- Random Feature

Preparar dados para análises:

In [None]:
X_test = test[cat_nom+cat_ord+num_dis+num_con]
X = train[cat_nom+cat_ord+num_dis+num_con]
y = train.y

SEED=314

# Variable Importance

In [None]:
clf = RandomForestClassifier(max_depth=2, random_state=SEED)

clf.fit(X, y);

res_var_imp = pd.DataFrame({
    "feature": X.columns,
    "var_imp": clf.feature_importances_
})

res_var_imp.sort_values('var_imp', ascending=False)

# Information Gain

In [None]:
%%time
mutual_info = mutual_info_classif(X, y)

In [None]:
res_mutual_info = pd.DataFrame({
    "feature": X.columns,
    "mutual_info": mutual_info
})

res_mutual_info.sort_values('mutual_info', ascending=False)

# Zero Proportion

In [None]:
res_pzeros = pd.DataFrame({
    "feature": X.columns,
    "pzeros": (X.shape[0] - X.astype(bool).sum(axis=0)) / X.shape[0] * 100
})

res_pzeros.sort_values('pzeros', ascending=False)

# Null Proportion

In [None]:
res_pnull = pd.DataFrame({
    "feature": X.columns,
    "pnull": 100 - (X.shape[0] - X.replace(-999, np.nan).isnull().sum()) / X.shape[0] * 100
})

res_pnull.sort_values('pnull', ascending=False)

# Backward Feature Elimination

In [None]:
%%time
lasso_newton = LogisticRegression(C=1, penalty="l2", solver='sag', tol = 0.1, random_state=314)
bfs=SequentialFeatureSelector(lasso_newton,
                              k_features='best',
                              forward=False,
                              floating=False, 
                              scoring='neg_log_loss',
                              cv=0,
                              verbose=2,
                              n_jobs=1)
bfs.fit(X, y);

In [None]:
res_bfs = pd.DataFrame({
    "feature": X.columns,
    "bfs": np.where(X.columns.isin(bfs.k_feature_names_), "to_keep", "to_remove")
})
res_bfs.sort_values('bfs', ascending=True)

# Exhaustive Feature Selection

In [None]:
# %%time
# 
# efs = ExhaustiveFeatureSelector(LGBMClassifier(),
#                                 min_features=10,
#                                 max_features=75,
#                                 scoring='neg_log_loss',
#                                 print_progress=True,
#                                 cv=5)
# 
# efs.fit(X, y);

In [None]:
# res_efs = pd.DataFrame({
#     "feature": X.columns,
#     "efs": np.where(X.columns.isin(efs.k_feature_names_), "to_keep", "to_remove")
# })
# res_efs.sort_values('efs', ascending=True)

# Lasso Regularization (L1)

In [None]:
%%time

lasso = LogisticRegression(C=1, penalty="l1", solver="liblinear", random_state=314).fit(X, y)
lasso_selector = SelectFromModel(lasso, prefit=True, threshold="median")

In [None]:
res_lasso = pd.DataFrame({
    "feature": X.columns,
    "lasso": np.where(lasso_selector.get_support(), "to_keep", "to_remove")
})
res_lasso.sort_values('lasso', ascending=True)

# RFE

In [None]:
%%time

rf = RandomForestClassifier(n_jobs=-1, max_depth=4)
rfe_selector = RFECV(rf, min_features_to_select=20, step=1, n_jobs=1, verbose=1)
#rfe_selector.fit(X_sample.values, y[X_sample.index]) #dev
rfe_selector.fit(X.values, y)

In [None]:
res_rfe = pd.DataFrame({
    "feature": X.columns,
    "rfe": np.where(rfe_selector.support_, "to_keep", "to_remove")
})
res_rfe.sort_values('rfe', ascending=True)

# Boruta

In [None]:
%%time

rf = RandomForestClassifier(n_jobs=-1, max_depth=4)
boruta_selector = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=314)
boruta_selector.fit(X.values, y)
#boruta_selector.fit(X_sample.values, y[X_sample.index]) #dev

In [None]:
res_boruta = pd.DataFrame({
    "feature": X.columns,
    "boruta": np.where(boruta_selector.support_, "to_keep", "to_remove")
})
res_boruta.sort_values('boruta', ascending=True)

# Random Column

In [None]:
X_random = pd.concat([X, pd.DataFrame({'random':[uniform(0.0, 100.0) for i in range(X.shape[0])]})], axis=1)

In [None]:
%%time
rf = RandomForestClassifier(n_jobs=-1, max_depth=3)
rf.fit(X_random, y);

In [None]:
varip_random = np.float(rf.feature_importances_[X_random.columns=="random"])
print("Random VarImp:", varip_random)

res_rand_var_imp = pd.DataFrame({
    "feature": X_random.columns,
    "rand_var_imp": rf.feature_importances_,
    "rand_var": np.where(rf.feature_importances_ > varip_random, "to_keep", "to_remove")
})
res_rand_var_imp.sort_values('rand_var_imp', ascending=False)

# Combinar Resultados
---

In [None]:
feature_selection = res_var_imp.\
                    merge(res_mutual_info).\
                    merge(res_pzeros).\
                    merge(res_pnull).\
                    merge(res_bfs).\
                    merge(res_lasso).\
                    merge(res_boruta).\
                    merge(res_rfe).\
                    merge(res_rand_var_imp.drop('rand_var_imp', axis=1))

feature_selection.to_csv('feature_selection.csv', index=False)

In [None]:
feature_selection.style.\
    bar(subset=['var_imp'],color='#205ff2').\
    bar(subset=['mutual_info'],color='#205ff2').\
    background_gradient(subset=['pzeros'],cmap='coolwarm').\
    background_gradient(subset=['pnull'],cmap='coolwarm').\
    apply(lambda x: ["background: red" if v == "to_remove" else "" for v in x], axis = 1)

# Conclusão

O objetivo deste notebook foi apenas explorar mais os dados sob uma perspectiva da seleção automatizada de features. 

Note que nenhum tratamento foi feito sobre as features categóricas nem aos valores faltantes neste notebook (por enquanto), então use com sabedoria! 