<h1>Classificação supervisionada com Titanic</h1>

<p>Este tutorial é um tradução e adaptação para Python 3 daquele encontrado no blog do <a href="http://ahmedbesbes.com/how-to-score-08134-in-titanic-kaggle-challenge.html">ahmedbesbes</a>.</p>

O conjunto de dados pode ser encontrado em https://www.kaggle.com/competitions/titanic

### Dicas para melhorar o score

- Teste outros tipos de normalização
- Compare os modelos já tunados previamente (toma bastante tempo)
- Evite dummies trap
- Use redes neurais (Futuro)

## Bibliotecas

In [49]:
# estrutura de dados
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# estatistica
from scipy.stats import skew

# metricas
from sklearn.metrics import accuracy_score # acurácia
# https://en.wikipedia.org/wiki/Accuracy_and_precision
# https://en.wikipedia.org/wiki/Confusion_matrix

# normalizador
from sklearn.preprocessing import scale

# treino teste
from sklearn.model_selection import train_test_split

# modelos
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.svm import SVC, LinearSVC
from lightgbm import LGBMClassifier
from xgboost import XGBClassifier
from catboost import CatBoostClassifier

# tunagem
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV

# stack
from mlxtend.regressor import StackingCVRegressor

## Remover warning (opcional e não recomendado)

In [50]:
import warnings
warnings.filterwarnings('ignore')

<h3>Carregando treino e teste</h3>

In [51]:
train = pd.read_csv('../datasets/titanic/train.csv')
test = pd.read_csv('../datasets/titanic/test.csv')
all_data = pd.concat((
    train.loc[:,'Pclass':],
     test.loc[:,'Pclass':]))
y = train['Survived']

# EDA

In [52]:
all_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 10 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Pclass    1309 non-null   int64  
 1   Name      1309 non-null   object 
 2   Sex       1309 non-null   object 
 3   Age       1046 non-null   float64
 4   SibSp     1309 non-null   int64  
 5   Parch     1309 non-null   int64  
 6   Ticket    1309 non-null   object 
 7   Fare      1308 non-null   float64
 8   Cabin     295 non-null    object 
 9   Embarked  1307 non-null   object 
dtypes: float64(2), int64(3), object(5)
memory usage: 112.5+ KB


Dessa vez vamos processar cada variável separadamente para uma melhor preparação para modelagem.

In [53]:
all_data.columns

Index(['Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch', 'Ticket', 'Fare',
       'Cabin', 'Embarked'],
      dtype='object')

In [54]:
# PClass (apesar de numérico) -> dummies
# Names -> *transformar em títulos* e dummies
# Age -> preencher com média por grupos
# Sex -> Dummies
# SibSp e Parch -> transformar em tamanho da família
# Ticket -> Manter só as letras
# Fare -> preencher nulos com média
# Cabin -> transformar em deck, preencher nulos com U e dummies
# Embarked -> preencher nulos com a moda e depois dummies

### Name (parte 1)

In [55]:
all_data['Title'] = all_data['Name'].map(lambda name:name.split(',')[1].split('.')[0].strip())
title_map = {
    "Capt":       "Officer",
    "Col":        "Officer",
    "Major":      "Officer",
    "Jonkheer":   "Royalty",
    "Don":        "Royalty",
    "Sir" :       "Royalty",
    "Dr":         "Officer",
    "Rev":        "Officer",
    "the Countess":"Royalty",
    "Dona":       "Royalty",
    "Mme":        "Mrs",
    "Mlle":       "Miss",
    "Ms":         "Mrs",
    "Mr" :        "Mr",
    "Mrs" :       "Mrs",
    "Miss" :      "Miss",
    "Master" :    "Master",
    "Lady" :      "Royalty"
}
all_data['Title'] = all_data['Title'].map(title_map)

### Age

In [56]:
def fillAges(row):
    if row['Sex']=='female' and row['Pclass'] == 1:
        if row['Title'] == 'Miss':
            return 30
        elif row['Title'] == 'Mrs':
            return 45
        elif row['Title'] == 'Officer':
            return 49
        elif row['Title'] == 'Royalty':
            return 39

    elif row['Sex']=='female' and row['Pclass'] == 2:
        if row['Title'] == 'Miss':
            return 20
        elif row['Title'] == 'Mrs':
            return 30

    elif row['Sex']=='female' and row['Pclass'] == 3:
        if row['Title'] == 'Miss':
            return 18
        elif row['Title'] == 'Mrs':
            return 31

    elif row['Sex']=='male' and row['Pclass'] == 1:
        if row['Title'] == 'Master':
            return 6
        elif row['Title'] == 'Mr':
            return 41.5
        elif row['Title'] == 'Officer':
            return 52
        elif row['Title'] == 'Royalty':
            return 40

    elif row['Sex']=='male' and row['Pclass'] == 2:
        if row['Title'] == 'Master':
            return 2
        elif row['Title'] == 'Mr':
            return 30
        elif row['Title'] == 'Officer':
            return 41.5

    elif row['Sex']=='male' and row['Pclass'] == 3:
        if row['Title'] == 'Master':
            return 6
        elif row['Title'] == 'Mr':
            return 26
all_data["Age"] = all_data.apply(lambda r : fillAges(r) if np.isnan(r['Age']) else r['Age'], axis=1)

### Name (parte 2)

In [57]:
all_data.drop('Name',axis=1,inplace=True)
titles_dummies = pd.get_dummies(all_data['Title'], prefix='Title')
all_data = pd.concat([all_data, titles_dummies],axis=1)
all_data.drop('Title',axis=1,inplace=True)

### Sex

In [58]:
all_data["Sex"] = all_data["Sex"].map(lambda x: 1 if x == 'male' else 0)

### SibSp e Parch

In [59]:
all_data['FamilySize'] = all_data['Parch'] + all_data['SibSp'] + 1 # (+1 o próprio cara)
# introducing other features based on the family size
all_data['Singleton'] = all_data['FamilySize'].map(lambda s : 1 if s == 1 else 0)
all_data['SmallFamily'] = all_data['FamilySize'].map(lambda s : 1 if 2<=s<=4 else 0)
all_data['LargeFamily'] = all_data['FamilySize'].map(lambda s : 1 if 5<=s else 0)

all_data.drop("SibSp", axis=1, inplace=True)
all_data.drop("Parch", axis=1, inplace=True)

### Ticket

In [60]:
all_data['Ticket'] = all_data['Ticket'].map(lambda x: ''.join(filter(str.isalpha, x)))
all_data['Ticket'] = all_data['Ticket'].map(lambda x: x if x else "XXX")
tickets_dummies = pd.get_dummies(all_data['Ticket'],prefix='Ticket')
all_data = pd.concat([all_data, tickets_dummies],axis=1)
all_data.drop('Ticket',inplace=True,axis=1)

### Fare

In [61]:
fare_mean = all_data["Fare"].mean()
all_data["Fare"] = all_data["Fare"].fillna(fare_mean)

### Cabin

In [62]:
all_data["Cabin"] = all_data["Cabin"].fillna("U")
all_data["Cabin"] = all_data["Cabin"].map(lambda x: x[0])
cabin_dummies = pd.get_dummies(all_data['Cabin'],prefix='Cabin')
all_data = pd.concat([all_data, cabin_dummies],axis=1)
all_data.drop('Cabin',inplace=True,axis=1)

### Embarked

In [63]:
embarked_mode = all_data["Embarked"].mode()
all_data["Embarked"] = all_data["Embarked"].fillna(embarked_mode)
embarked_dummies = pd.get_dummies(all_data['Embarked'],prefix='Embarked')
all_data = pd.concat([all_data, embarked_dummies],axis=1)
all_data.drop('Embarked',inplace=True,axis=1)

## Removendo assimetria

In [28]:
# log(x+1) nas features númericas para obter distribuição de fequência mais próxima da normal

# selecionando features numéricas
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
# calculando skew (assimetria)
skewed_feats = all_data[numeric_feats].apply(lambda x: skew(x.dropna()))
# filtro por skew maior que 0.75 (perto de zero é normal)
skewed_feats = skewed_feats[skewed_feats > 0.75]
# selecionando índices para normalização
skewed_feats = skewed_feats.index
# normalizando por log(x + 1)
all_data[skewed_feats] = np.log1p(all_data[skewed_feats]) 

## Normalizando

In [64]:
features = list(all_data.columns)
all_data[features] = all_data[features].apply(lambda x: x/x.max(), axis=0)

## Treino e Test

In [65]:
X = all_data[:train.shape[0]] # treino
test = all_data[train.shape[0]:] # teste
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.25,random_state=42)

### Modelos

In [32]:
model_instances = [
    (RandomForestClassifier(), "RandomForestClassifier"),
    (ExtraTreesClassifier(), "ExtraTreesClassifier"),
    (GradientBoostingClassifier(), "GradientBoostingClassifier"),
    (LogisticRegression(), "LogisticRegression"),
    (DecisionTreeClassifier(), "DecisionTreeClassifier"),
    (KNeighborsClassifier(), "KNeighborsClassifier"),
    (GaussianNB(), "GaussianNB"),
    (Perceptron(), "Perceptron"),
    (SGDClassifier(), "SGDClassifier"),
    (SVC(), "SVC"),
    (LinearSVC(), "LinearSVC"),
    (LGBMClassifier(verbose=0), "LGBMClassifier"),
    (XGBClassifier(), "XGBClassifier"),
    (CatBoostClassifier(verbose=False), "CatBoostClassifier"), 
] 

## Resultados

In [33]:
results = {
    "Model":[],
    "ACC":[]
}

In [50]:
for model, model_name in model_instances:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    results['Model'].append(model_name)
    results['ACC'].append(accuracy_score(y_test, y_pred))

You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.


In [51]:
results = pd.DataFrame(results)
results

Unnamed: 0,Model,ACC
0,RandomForestClassifier,0.834081
1,ExtraTreesClassifier,0.820628
2,GradientBoostingClassifier,0.793722
3,LogisticRegression,0.820628
4,DecisionTreeClassifier,0.807175
5,KNeighborsClassifier,0.820628
6,GaussianNB,0.461883
7,Perceptron,0.811659
8,SGDClassifier,0.780269
9,SVC,0.829596


In [52]:
results_temp = results.sort_values("ACC", ascending=False)
results_temp.iloc[:5]["Model"]

0     RandomForestClassifier
13        CatBoostClassifier
9                        SVC
10                 LinearSVC
1       ExtraTreesClassifier
Name: Model, dtype: object

### Tunagem de hiperparâmetros

In [61]:
# Escolha dos melhores parâmetros
randomForest = RandomForestClassifier()
cross_validation = StratifiedKFold(n_splits=5) # n_folds deve ser escolhido de forma precisa
parameter_grid = {
     'max_depth' : [15,16],
     'n_estimators': [10000,20000,30000],
     'criterion': ['gini','entropy',],
     'min_samples_split': [27,28], # tunado
     'max_features':[6,7,8]
}
grid_search = GridSearchCV(
    randomForest,
    param_grid=parameter_grid,
    cv=cross_validation)

grid_search.fit(X_train, y_train)

print('Best score: {}'.format(grid_search.best_score_))
print('Best parameters: {}'.format(grid_search.best_params_))

Best score: 0.8353832342049152
Best parameters: {'criterion': 'gini', 'max_depth': 18, 'n_estimators': 1000}


## Submissão para o Kaggle

In [62]:
random_forest = RandomForestClassifier(
    n_estimators=10000, # tunado
    criterion='entropy', # tunado
    max_depth=9, # tunado
    min_samples_split=27, # tunado
    max_features=8)

random_forest.fit(X, y)
y_pred = random_forest.predict(test)

In [66]:
random_forest = RandomForestClassifier(
    n_estimators=20, # tunado
    max_depth=9)

random_forest.fit(X, y)
y_pred = random_forest.predict(test)

In [67]:
sample_submission = pd.read_csv("../datasets/titanic/gender_submission.csv",index_col=0)
sample_submission['Survived'] = y_pred
sample_submission.to_csv("random_forest_tunado_x.csv")

<b>Pontuação no Kaggle:</b> Your submission scored 0.78299.

Para tunar outros modelos <a href='http://scikit-learn.org/stable/modules/grid_search.html'>Tuning the hyper-parameters of an estimator</a>

<p>Resumo das pontuações com modelos tunados Kaggle:</p>
<table cellspacing="2" cellpadding="4" style="border:solid 2px; margin:auto;" >
    <tr style="border:solid 2px;">
          <td height='50' bgcolor='#D4D0C8' style="border:solid 2px;"><b>Submission</td>
          <td height='50' bgcolor='#D4D0C8' style="border:solid 2px;"><b>Score</td>
    </tr>
    <tr style="border:solid 2px;">
        <td style="border:solid 2px;"> random_forest_tunado.csv </td>
        <td style="border:solid 2px;">0.80383</td>
    </tr>
    <tr style="border:solid 2px;">
        <td style="border:solid 2px;"> extra_trees_tunado.csv </td>
        <td style="border:solid 2px;">0.79426</td>
    </tr>
    <tr style="border:solid 2px;">
        <td style="border:solid 2px;"> log_reg_tunado.csv </td>
        <td style="border:solid 2px;"> 0.77990</td>
    </tr>
    <tr style="border:solid 2px;">
        <td style="border:solid 2px;"> decision_tree_tunado.csv </td>
        <td style="border:solid 2px;"> 0.76555</td>
    </tr>
    <tr style="border:solid 2px;">
        <td style="border:solid 2px;"> grad_boost_tunado.csv </td>
        <td style="border:solid 2px;"> 0.62201</td>
    </tr>
</table>

## Exercícios

1. Faça a tunagem dos modelos Light GBM, XGBoost e CatBoost. Os scores tiveram melhoras?

2. Faça um stack dos 3 melhores modelos que você tunou. Os scores tiveram melhoras?

3. Simulado! Nossa próxima atividade avaliativa será sobre classificação supervisionada. Faremos uma simulação usando o Titanic! Qual seria sua nota?

Suponha que seu score é S:

- Se S < 0.67, seus pontos seriam (1 - (S - 0.67)/0.67) * 4
- Se 0.67 <= S <= 0.73, seus pontos seriam proporcionais no intervalo de 4 a 7.
- Se 0.73 <= S <= 0.83, seus pontos seriam proporcionais no intervalo de 7 a 10.

Boa sorte no simulado!