# Titanic - XGBoost
Este notebook cria um modelo baseado no dataset do Titanic e usando XGBoost.

Vamos começar importando as bibliotecas básicas que vamos usar.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

Próximo passo: carregando os dados a partir dos CSVs disponibilizados no Kaggle. Estamos usando a biblioteca pandas para esse propósito.

In [2]:
# Vamos iniciar o notebook importanto o Dataset
titanic_df = pd.read_csv("../input/train.csv")
test_df    = pd.read_csv("../input/test.csv")

# Podemos observar as primeiras linhas dele.
titanic_df.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


Vamos começar com o básico de tratamento desse dataset. Importante: tudo que fizermos vamos fazer no dataset de treinamento e também de teste.

## Tratando a Idade - Imputation

Teremos que preencher isso de algum jeito. Uma abordagem comum nesses casos é usar uma média ou mediana. Vamos usar aqui a mediana do dataset - mas poderíamos agrupar por sexo, por exemplo. Fica a seu critério fazer isso de forma mais fancy. ;)

In [3]:
age_median = titanic_df['Age'].median()
print(age_median)

28.0


In [4]:
titanic_df['Age'] = titanic_df['Age'].fillna(age_median)
test_df['Age'] = test_df['Age'].fillna(age_median)

## Tratando Gênero - LabelEncoding

In [5]:
from sklearn.preprocessing import LabelEncoder
sex_encoder = LabelEncoder()

sex_encoder.fit(list(titanic_df['Sex'].values) + list(test_df['Sex'].values))

LabelEncoder()

In [6]:
sex_encoder.classes_

array(['female', 'male'], 
      dtype='<U6')

In [7]:
titanic_df['Sex'] = sex_encoder.transform(titanic_df['Sex'].values)
test_df['Sex'] = sex_encoder.transform(test_df['Sex'].values)

## Feature Engineering - Título

Feature Engineering é uma técnica que envolve criar novas features - em geral a partir de outras. Vamos usar essa técnica para extrair o título a partir do nome.

In [8]:
titanic_df.head()['Name']

0                              Braund, Mr. Owen Harris
1    Cumings, Mrs. John Bradley (Florence Briggs Th...
2                               Heikkinen, Miss. Laina
3         Futrelle, Mrs. Jacques Heath (Lily May Peel)
4                             Allen, Mr. William Henry
Name: Name, dtype: object

In [15]:
import re
def extract_title(name):
    x = re.search(', (.+)\.', name)
    if x:
        return x.group(1)
    else:
        return ''

In [16]:
titanic_df['Title'] = titanic_df['Name'].apply(extract_title)
test_df['Title'] = test_df['Name'].apply(extract_title)

## OneHotEncoding

Agora vamos trabalhar com features que são MultiCategoricas. 

In [17]:
from sklearn.preprocessing import OneHotEncoder
from sklearn.feature_extraction import DictVectorizer

feature_names = ['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Title']
dv = DictVectorizer()
dv.fit(titanic_df[feature_names].append(test_df[feature_names]).to_dict(orient='records'))
dv.feature_names_

['Age',
 'Fare',
 'Parch',
 'Pclass',
 'Sex',
 'SibSp',
 'Title=Capt',
 'Title=Col',
 'Title=Don',
 'Title=Dona',
 'Title=Dr',
 'Title=Jonkheer',
 'Title=Lady',
 'Title=Major',
 'Title=Master',
 'Title=Miss',
 'Title=Mlle',
 'Title=Mme',
 'Title=Mr',
 'Title=Mrs',
 'Title=Mrs. Martin (Elizabeth L',
 'Title=Ms',
 'Title=Rev',
 'Title=Sir',
 'Title=the Countess']

In [20]:
from sklearn.model_selection import train_test_split
train_X, valid_X, train_y, valid_y = train_test_split(dv.transform(titanic_df[feature_names].to_dict(orient='records')),
                                                     titanic_df['Survived'],
                                                     test_size=0.2,
                                                     random_state=42)

In [21]:
import xgboost as xgb



In [22]:
train_X.todense()

matrix([[  45.5   ,   28.5   ,    0.    , ...,    0.    ,    0.    ,    0.    ],
        [  23.    ,   13.    ,    0.    , ...,    0.    ,    0.    ,    0.    ],
        [  32.    ,    7.925 ,    0.    , ...,    0.    ,    0.    ,    0.    ],
        ..., 
        [  41.    ,   14.1083,    0.    , ...,    0.    ,    0.    ,    0.    ],
        [  14.    ,  120.    ,    2.    , ...,    0.    ,    0.    ,    0.    ],
        [  21.    ,   77.2875,    1.    , ...,    0.    ,    0.    ,    0.    ]])

In [23]:
dtrain = xgb.DMatrix(data=train_X.todense(), feature_names=dv.feature_names_, label=train_y)
dvalid = xgb.DMatrix(data=valid_X.todense(), feature_names=dv.feature_names_, label=valid_y)

In [24]:
xgb_clf = xgb.train({'max_depth':20, 'eta':0.1, 'objective':'binary:logistic', 'eval_metric': 'error'}, 
                    num_boost_round=3000,
                    dtrain=dtrain,
                    verbose_eval=True, 
                    early_stopping_rounds=30,
                    evals=[(dtrain, 'train'), (dvalid, 'valid')])

[0]	train-error:0.116573	valid-error:0.184358
Multiple eval metrics have been passed: 'valid-error' will be used for early stopping.

Will train until valid-error hasn't improved in 30 rounds.
[1]	train-error:0.109551	valid-error:0.162011
[2]	train-error:0.11236	valid-error:0.173184
[3]	train-error:0.106742	valid-error:0.156425
[4]	train-error:0.105337	valid-error:0.167598
[5]	train-error:0.102528	valid-error:0.173184
[6]	train-error:0.102528	valid-error:0.178771
[7]	train-error:0.102528	valid-error:0.184358
[8]	train-error:0.099719	valid-error:0.178771
[9]	train-error:0.099719	valid-error:0.173184
[10]	train-error:0.098315	valid-error:0.167598
[11]	train-error:0.09691	valid-error:0.167598
[12]	train-error:0.099719	valid-error:0.167598
[13]	train-error:0.101124	valid-error:0.173184
[14]	train-error:0.102528	valid-error:0.156425
[15]	train-error:0.098315	valid-error:0.167598
[16]	train-error:0.09691	valid-error:0.167598
[17]	train-error:0.09691	valid-error:0.167598
[18]	train-error:0.09

## Submissão do Arquivo

In [25]:
test_df['Fare'] = test_df['Fare'].fillna(0)

Lembra que o sklean trabalha com matrizes numpy, certo?

In [26]:
test_X = dv.transform(test_df[feature_names].to_dict(orient='records'))
print(test_X.shape)

(418, 25)


In [27]:
dtest = xgb.DMatrix(data=test_X.todense(), feature_names=dv.feature_names_)

In [28]:
y_pred = np.round(xgb_clf.predict(dtest)).astype(int)

Ótimo! Já temos aquilo que precisávamos. Próximo passo agora é empacotar num arquivo CSV e submeter no Kaggle.

In [29]:
submission_df = pd.DataFrame()

In [30]:
submission_df['PassengerId'] = test_df['PassengerId']
submission_df['Survived'] = y_pred
submission_df

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,0
2,894,0
3,895,0
4,896,1
5,897,0
6,898,1
7,899,0
8,900,1
9,901,0


In [31]:
submission_df.to_csv('xgboost_model.csv', index=False)

Por favor, anote aqui para referência: quanto foi o seu score de treinamento do modelo? E no dataset de Validação? Quanto foi o seu score na submissão do Kaggle?