# Mini Projeto de Sistemas Inteligentes
## Árvores de Decisão - Resolvendo o problema do Titanic
### Grupo:
- FNAP - Felipe Nunes de Almeida Pereira
- GME - Gabriel de Melo Evangelista
- JPSPM - João Pedro de Souza Pereira Moura
- MLLL - Maria Luísa Leandro de Lima
- WISS - Washington Igor Santos Silva

https://www.kaggle.com/code/shimjh/titanic-assignment

## Imports


Inicialmente são feitos os imports das bibliotecas e ferramentas necessárias para o funcionamento do código.

In [None]:
import numpy as np
import pandas as pd

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score, train_test_split
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
from sklearn.metrics import recall_score, precision_score, f1_score, classification_report

import matplotlib.pyplot as plt


In [None]:
from google.colab import drive
drive.mount('/content/drive', force_remount=True)

Mounted at /content/drive


## Leitura dos dados

A leitura dos dados é feita a partir da conexão com o Google Drive

In [None]:
#REMOVER O /SI PELO AMOR DE DEUS
data_train = pd.read_csv('/content/drive/MyDrive/train.csv')
data_test = pd.read_csv('/content/drive/MyDrive/test.csv')
data_train_FA = data_train.copy()
data_test_FA = data_test.copy()

In [None]:
data_train.shape

(891, 12)

In [None]:
data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
data_test.shape

(418, 11)

In [None]:
data_test.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


## Tratamento dos dados

Observa-se que alguns valores estão faltando em ambos datasets, tais valores são preenchidos pela média dos valores.

In [None]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


Substituindo pela média das idades as colunas com Nan.

In [None]:
data_train.Age.fillna(data_train.Age.mean(), inplace=True)
data_test.Age.fillna(data_test.Age.mean(), inplace=True)
data_test.Fare.fillna(data_test.Fare.mean(), inplace=True)

In [None]:
data_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          891 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [None]:
data_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          418 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         418 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB


Além disso, há a troca das variáveis categóricas de String para Inteiro.

In [None]:
#Normalizando os dados de forma que os Sexos dos passageiros serão tratados como 0 e 1
#e as informações de embarque como 0, 1 e 2, assim facilitando o uso dos dados do DF
data_train['Sex'].replace('male', 0, inplace=True)
data_train['Sex'].replace('female', 1, inplace=True)
data_test['Sex'].replace('male', 0, inplace=True)
data_test['Sex'].replace('female', 1, inplace=True)

data_train['Embarked'].replace('Q', 0, inplace=True)
data_train['Embarked'].replace('S', 1, inplace=True)
data_train['Embarked'].replace('C', 2, inplace=True)
data_test['Embarked'].replace('Q', 0, inplace=True)
data_test['Embarked'].replace('S', 1, inplace=True)
data_test['Embarked'].replace('C', 2, inplace=True)

In [None]:
#Removendo colunas que não serão utilizadas e removendo linhas que possuam algum valor inválido
data_train.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)
data_train.dropna(axis=0, inplace=True)
data_test.drop(['Cabin', 'Ticket', 'Name'], axis=1, inplace=True)
data_test.dropna(axis=0, inplace=True)

In [None]:
data_train.head()

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,1,0,3,0,22.0,1,0,7.25,1.0
1,2,1,1,1,38.0,1,0,71.2833,2.0
2,3,1,3,1,26.0,0,0,7.925,1.0
3,4,1,1,1,35.0,1,0,53.1,1.0
4,5,0,3,0,35.0,0,0,8.05,1.0


In [None]:
data_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,892,3,0,34.5,0,0,7.8292,0
1,893,3,1,47.0,1,0,7.0,1
2,894,2,0,62.0,0,0,9.6875,0
3,895,3,0,27.0,0,0,8.6625,1
4,896,3,1,22.0,1,1,12.2875,1


### Normalização

Separa-se os datasets em X e Y, além disso é feita a normalização dos X's para ajudar o treinamento.

In [None]:
scaler = MinMaxScaler()

In [None]:
data_test.columns

Index(['PassengerId', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare',
       'Embarked'],
      dtype='object')

In [None]:
#Criando um novo DF com os dados transformados e removendo as colunas de passageiro ID e sobrevivente, pois
# o Df de teste não possui a coluna de survived e é ela que será utilizada na árvore de decisão para recriar.
data_X_train = pd.DataFrame(scaler.fit_transform(data_train.drop(['Survived', 'PassengerId'], axis = 1)))
data_X_test = pd.DataFrame(scaler.fit_transform(data_test.drop(['PassengerId'], axis = 1)))

In [None]:
data_X_train.describe()

Unnamed: 0,0,1,2,3,4,5,6
count,889.0,889.0,889.0,889.0,889.0,889.0,889.0
mean,0.655793,0.350956,0.367347,0.065523,0.063742,0.062649,0.551181
std,0.41735,0.477538,0.16296,0.137963,0.13446,0.097003,0.25759
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.5,0.0,0.271174,0.0,0.0,0.015412,0.5
50%,1.0,0.0,0.367921,0.0,0.0,0.028213,0.5
75%,1.0,1.0,0.434531,0.125,0.0,0.060508,0.5
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
data_X_test.describe()

Unnamed: 0,0,1,2,3,4,5,6
count,418.0,418.0,418.0,418.0,418.0,418.0,418.0
mean,0.632775,0.363636,0.396975,0.055921,0.043594,0.06954,0.566986
std,0.420919,0.481622,0.166617,0.112095,0.109048,0.108993,0.290226
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.301068,0.0,0.0,0.015412,0.5
50%,1.0,0.0,0.396975,0.0,0.0,0.028213,0.5
75%,1.0,1.0,0.469207,0.125,0.0,0.061484,0.5
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [None]:
data_Y_train = data_train.drop('PassengerId', axis = 1)['Survived']

In [None]:
data_Y_train.head()

0    0
1    1
2    1
3    1
4    0
Name: Survived, dtype: int64

Como o dataset de teste não contém as saídas (y_test), cria-se dados parciais de teste retirados do dataset de treino. Esses dados serão utilizados para avaliar a performance do modelo nos experimentos.

In [None]:
data_X_train, data_X_testT, data_Y_train, data_Y_testT = train_test_split(data_X_train, data_Y_train, test_size = 0.2)

Percebe-se um pequeno desbalanceamento dos dados, por conta disso é criada a variável ```weight``` que armazena a razão entre a classe majoritária e minoritária. Esta variável será utilizada posteriormente.

In [None]:
data_Y_train.value_counts()

0    451
1    260
Name: Survived, dtype: int64

In [None]:
weight = data_Y_train.value_counts()[0]/data_Y_train.value_counts()[1]

In [None]:
results = pd.DataFrame()

## Modelo Básico com Colunas originais

Um modelo básico utilizando as colunas originais do dataset é criado e treinado para servir de base para as próximas análises.

Para todos os modelos utilizamos a função ```cross_val_score``` que para cada partição, treina o modelo em cv - 1 partições dos dados e valida na partição restante, retornando assim uma média e desvio padrão.

Além disso, avaliamos o modelo no próprio dataset de treino.

In [None]:
model = DecisionTreeClassifier()
score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
model.fit(data_X_train, data_Y_train)
print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))

Mean: 0.7848787167449139|| Std: 0.03938877777976616
Score do modelo: 0.7808988764044944


Para os melhores modelos de cada experimento uma submissão no *Kaggle* será feita a partir da predição ao teste, os nomes dos sobreviventes será impresso na tela.

In [None]:
y_pred = model.predict(data_X_test)
output = pd.DataFrame({"PassengerId": data_test["PassengerId"], "Survived": y_pred})
output.to_csv("Submission.csv", index=False)
test = pd.read_csv('/content/drive/MyDrive/test.csv')
nomes = test[{'PassengerId', 'Name'}]
nomes.set_index('PassengerId').join(output.set_index('PassengerId'), how = 'right')

Unnamed: 0_level_0,Name,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,"Kelly, Mr. James",0
893,"Wilkes, Mrs. James (Ellen Needs)",0
894,"Myles, Mr. Thomas Francis",1
895,"Wirz, Mr. Albert",0
896,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1
...,...,...
1305,"Spector, Mr. Woolf",0
1306,"Oliva y Ocana, Dona. Fermina",1
1307,"Saether, Mr. Simon Sivertsen",0
1308,"Ware, Mr. Frederick",0


###Submissão do primeiro teste

Instalando a Kaggle API:

In [None]:
!pip install kaggle



Primeiro é necessário ir no perfil do Kaggle e criar uma nova chave de API, após isso será criado um arquivo json que deve ser salvo localmente, o mesmo deve ser escolhido ao executar a função abaixo que irá autorizar a sessão atual do colab a consumir a API do Kaggle.

In [None]:
from google.colab import files

uploaded = files.upload()

for fn in uploaded.keys():
  print('User uploaded file "{name}" with length {length} bytes'.format(
      name=fn, length=len(uploaded[fn])))
  
# Then move kaggle.json into the folder where the API expects to find it.
!mkdir -p ~/.kaggle/ && mv kaggle.json ~/.kaggle/ && chmod 600 ~/.kaggle/kaggle.json

Saving kaggle.json to kaggle.json
User uploaded file "kaggle.json" with length 74 bytes


Fazendo a submissão do teste com a API do Kaggle

In [None]:
!kaggle competitions submit -c titanic -f Submission.csv -m "Basic Decision Tree Model"

100% 2.77k/2.77k [00:04<00:00, 581B/s]
Successfully submitted to Titanic - Machine Learning from Disaster

In [None]:
!kaggle competitions submissions -v -c titanic > "submissions.txt"

In [None]:
fileRead = pd.read_csv('submissions.txt')

In [None]:
results = results.append(fileRead[fileRead['description'] == "Basic Decision Tree Model"][['description', 'publicScore']])

In [None]:
print(fileRead[fileRead['description'] == "Basic Decision Tree Model"][['description', 'publicScore']])

                 description  publicScore
0  Basic Decision Tree Model      0.70334


Recuperando o score da submissão, vemos que o modelo básico teve um score de 70.3%.

## Extração de dados e criaçao de novas colunas

Neste trecho é feito o tratamento dos dados para criação de novas colunas e comparação da performance com as colunas originais.

In [None]:
data = pd.concat([data_train_FA, data_test_FA])

Primeiramente, cria-se uma nova coluna com o tipo da cabine a partir da primeira letra da cabine de cada usuário. Após isso é feita a transformação dos caracteres em números inteiros.

In [None]:
data['Cabin_type'] = data["Cabin"].astype(str).str[0]

In [None]:
data['Cabin_type'].value_counts()

n    1014
C      94
B      65
D      46
E      41
A      22
F      21
G       5
T       1
Name: Cabin_type, dtype: int64

In [None]:
data['Cabin_type'] = data['Cabin_type'].map({'A':1,'B':2,'C':3,'D':4,'E':5,'F':6,'G':7,'T':8, 'n':0})

In [None]:
data['Cabin_type']

0      0
1      3
2      0
3      3
4      0
      ..
413    0
414    3
415    0
416    0
417    0
Name: Cabin_type, Length: 1309, dtype: int64

Um fator que pode influenciar na sobrevivência do passageiro é o seu status social, para fazer essa análise foi criada a coluna de status a partir do nome do passageiro.

In [None]:
data['status'] = data['Name'].str.extract('([A-Za-z]+)\.')

In [None]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_type,status
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0,Mr
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3,Mrs
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,0,Miss
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,3,Mrs
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0,Mr
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,0,Mr
414,1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,3,Dona
415,1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,Mr
416,1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,0,Mr


In [None]:
#Verificando os tipos de tratamento presentes
data['status'].unique()

array(['Mr', 'Mrs', 'Miss', 'Master', 'Don', 'Rev', 'Dr', 'Mme', 'Ms',
       'Major', 'Lady', 'Sir', 'Mlle', 'Col', 'Capt', 'Countess',
       'Jonkheer', 'Dona'], dtype=object)

Reduz-se a variabilidade da coluna status pela substituição de alguns valores por 'important', 'Miss' e 'Mrs' a depender do nível social do passageiro.

In [None]:
data['status'] = data['status'].replace(['Master', 'Don', 'Dona', 'Rev', 'Dr', 'Major', 'Lady', 'Sir', 'Col', 'Capt', 'Countess', 'Jonkheer'], 'important')

In [None]:
data['status'] = data['status'].replace(['Mlle', 'Ms', 'Miss', 'Mme'], 'Miss')
data['status'] = data['status'].replace(['Mr', 'Mrs'], 'Mrs')

In [None]:
data['status'] = data['status'].map({'important': 1, 'Miss': 2, 'Mrs': 3})

In [None]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_type,status
0,1,0.0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S,0,3
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C,3,3
2,3,1.0,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S,0,2
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S,3,3
4,5,0.0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,3,"Spector, Mr. Woolf",male,,0,0,A.5. 3236,8.0500,,S,0,3
414,1306,,1,"Oliva y Ocana, Dona. Fermina",female,39.0,0,0,PC 17758,108.9000,C105,C,3,1
415,1307,,3,"Saether, Mr. Simon Sivertsen",male,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,3
416,1308,,3,"Ware, Mr. Frederick",male,,0,0,359309,8.0500,,S,0,3


Categoriza-se também a coluna de sexo.

In [None]:
# Normalizando os sexos
data['Sex'] = data['Sex'].map({'female':0, 'male':1})
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_type,status
0,1,0.0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.2500,,S,0,3
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C,3,3
2,3,1.0,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.9250,,S,0,2
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1000,C123,S,3,3
4,5,0.0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.0500,,S,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,3,"Spector, Mr. Woolf",1,,0,0,A.5. 3236,8.0500,,S,0,3
414,1306,,1,"Oliva y Ocana, Dona. Fermina",0,39.0,0,0,PC 17758,108.9000,C105,C,3,1
415,1307,,3,"Saether, Mr. Simon Sivertsen",1,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,3
416,1308,,3,"Ware, Mr. Frederick",1,,0,0,359309,8.0500,,S,0,3


Cria-se uma nova coluna com a quantidade de familiares de cada passageiro.

In [None]:
data['familiaT'] = data['SibSp'] + data['Parch'] + 1
data

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Cabin_type,status,familiaT
0,1,0.0,3,"Braund, Mr. Owen Harris",1,22.0,1,0,A/5 21171,7.2500,,S,0,3,2
1,2,1.0,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",0,38.0,1,0,PC 17599,71.2833,C85,C,3,3,2
2,3,1.0,3,"Heikkinen, Miss. Laina",0,26.0,0,0,STON/O2. 3101282,7.9250,,S,0,2,1
3,4,1.0,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",0,35.0,1,0,113803,53.1000,C123,S,3,3,2
4,5,0.0,3,"Allen, Mr. William Henry",1,35.0,0,0,373450,8.0500,,S,0,3,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
413,1305,,3,"Spector, Mr. Woolf",1,,0,0,A.5. 3236,8.0500,,S,0,3,1
414,1306,,1,"Oliva y Ocana, Dona. Fermina",0,39.0,0,0,PC 17758,108.9000,C105,C,3,1,1
415,1307,,3,"Saether, Mr. Simon Sivertsen",1,38.5,0,0,SOTON/O.Q. 3101262,7.2500,,S,0,3,1
416,1308,,3,"Ware, Mr. Frederick",1,,0,0,359309,8.0500,,S,0,3,1


Os valores faltantes na coluna de Fare são preenchidos com o valor da mediana da coluna e é feita a categorização da coluna em classes.

In [None]:
data['Fare'].fillna(data['Fare'].median(), inplace=True)
data.loc[ data['Fare'] <= 7.91, 'Fare'] = 0
data.loc[(data['Fare'] > 7.91) & (data['Fare'] <= 14.454), 'Fare'] = 1
data.loc[(data['Fare'] > 14.454) & (data['Fare'] <= 31), 'Fare'] = 2
data.loc[ data['Fare'] > 31, 'Fare'] = 3
data['Fare'] = data['Fare'].astype(int)

Remove-se colunas não necessárias mais ao código.

In [None]:
# Removendo colunas que não serão utlizadas
data.drop(['Name', 'SibSp', 'Parch', 'Ticket', 'Cabin', 'Embarked', 'Parch'], axis = 1, inplace = True)

In [None]:
data

Unnamed: 0,PassengerId,Survived,Pclass,Sex,Age,Fare,Cabin_type,status,familiaT
0,1,0.0,3,1,22.0,0,0,3,2
1,2,1.0,1,0,38.0,3,3,3,2
2,3,1.0,3,0,26.0,1,0,2,1
3,4,1.0,1,0,35.0,3,3,3,2
4,5,0.0,3,1,35.0,1,0,3,1
...,...,...,...,...,...,...,...,...,...
413,1305,,3,1,,1,0,3,1
414,1306,,1,0,39.0,3,3,1,1
415,1307,,3,1,38.5,0,0,3,1
416,1308,,3,1,,1,0,3,1


Preenche-se as linhas faltantes da coluna de idade com valores randômicos entre [média - desvio padrão, média + desvio padrão]

In [None]:
data.loc[np.isnan(data['Age']), 'Age'] = np.random.randint(data.Age.mean() - data.Age.std(), data.Age.mean() + data.Age.std(), size = data.Age.isnull().sum())

Categoriza-se a coluna de idade em classes.

In [None]:
data.loc[ data['Age'] <= 16, 'Age'] = 0
data.loc[(data['Age'] > 16) & (data['Age'] <= 32), 'Age'] = 1
data.loc[(data['Age'] > 32) & (data['Age'] <= 48), 'Age'] = 2
data.loc[(data['Age'] > 48) & (data['Age'] <= 64), 'Age'] = 3
data.loc[ data['Age'] > 64, 'Age'] = 4
data['Age'] = data['Age'].astype(int)

In [None]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Sex          1309 non-null   int64  
 4   Age          1309 non-null   int64  
 5   Fare         1309 non-null   int64  
 6   Cabin_type   1309 non-null   int64  
 7   status       1309 non-null   int64  
 8   familiaT     1309 non-null   int64  
dtypes: float64(1), int64(8)
memory usage: 102.3 KB


In [None]:
# Criando um df train removendo todas as linhas com NaN do data
train = data.dropna()
train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    float64
 2   Pclass       891 non-null    int64  
 3   Sex          891 non-null    int64  
 4   Age          891 non-null    int64  
 5   Fare         891 non-null    int64  
 6   Cabin_type   891 non-null    int64  
 7   status       891 non-null    int64  
 8   familiaT     891 non-null    int64  
dtypes: float64(1), int64(8)
memory usage: 69.6 KB


In [None]:
# Criando um df test removendo todas as linhas que tiverem a coluna Survived como NaN
test = data[data.Survived.isna()]
test.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 418 entries, 0 to 417
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Survived     0 non-null      float64
 2   Pclass       418 non-null    int64  
 3   Sex          418 non-null    int64  
 4   Age          418 non-null    int64  
 5   Fare         418 non-null    int64  
 6   Cabin_type   418 non-null    int64  
 7   status       418 non-null    int64  
 8   familiaT     418 non-null    int64  
dtypes: float64(1), int64(8)
memory usage: 32.7 KB


Separam-se os dados em X e y para treino e apenas X para o teste.

In [None]:
X_train = train.drop(['PassengerId', 'Survived'], axis=1)
y_train = train['Survived']
X_test = test.drop(['PassengerId', 'Survived'], axis=1)

In [None]:
X_train, X_testT, y_train, y_testT = train_test_split(X_train, y_train, test_size = 0.2)

In [None]:
X_train

Unnamed: 0,Pclass,Sex,Age,Fare,Cabin_type,status,familiaT
9,2,0,0,2,0,3,2
576,2,0,2,1,0,2,1
167,3,0,2,2,0,3,6
22,3,0,0,1,0,2,1
152,3,1,3,1,0,3,1
...,...,...,...,...,...,...,...
872,1,1,2,0,2,3,1
401,3,1,1,1,0,3,1
214,3,1,1,0,0,3,2
883,2,1,1,1,0,3,1


In [None]:
y_train

9      1.0
576    1.0
167    0.0
22     1.0
152    0.0
      ... 
872    0.0
401    0.0
214    0.0
883    0.0
426    1.0
Name: Survived, Length: 712, dtype: float64

## Novo modelo com colunas novas

Com as novas colunas criadas, um modelo básico é criado.

In [None]:
modelFA = DecisionTreeClassifier()
modelFA.fit(X_train, y_train)
score = cross_val_score(modelFA, X_train, y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
modelFA.fit(X_train, y_train)
print("Score do modelo: " + str(modelFA.score(X_testT, y_testT)))

Mean: 0.7822769953051644|| Std: 0.04329458870454461
Score do modelo: 0.8659217877094972


Percebe-se que o modelo com colunas novas apresentou uma média do ```cross_val_score``` maior que o modelo original e score de teste parcial maior demonstrando que as novas colunas podem ter ajudado a escolha do modelo.

Uma submissão é feita com esse novo modelo.

In [None]:
y_predFA = modelFA.predict(X_test)
output = pd.DataFrame({"PassengerId": data_test["PassengerId"], "Survived": y_predFA.astype(int)})
output.to_csv("Submission.csv", index=False)
test = pd.read_csv('/content/drive/MyDrive/test.csv')
nomes = test[{'PassengerId', 'Name'}]
nomes.set_index('PassengerId').join(output.set_index('PassengerId'), how = 'right')

Unnamed: 0_level_0,Name,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,"Kelly, Mr. James",0
893,"Wilkes, Mrs. James (Ellen Needs)",1
894,"Myles, Mr. Thomas Francis",0
895,"Wirz, Mr. Albert",0
896,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0
...,...,...
1305,"Spector, Mr. Woolf",0
1306,"Oliva y Ocana, Dona. Fermina",1
1307,"Saether, Mr. Simon Sivertsen",0
1308,"Ware, Mr. Frederick",0


Submetendo o modelo no Kaggle

In [None]:
!kaggle competitions submit -c titanic -f Submission.csv -m "Basic Decision Tree Model with new Columns"

100% 2.77k/2.77k [00:05<00:00, 537B/s]
Successfully submitted to Titanic - Machine Learning from Disaster

Recuperando o score da submissão

In [None]:
!kaggle competitions submissions -v -c titanic > "submissions.txt"

In [None]:
fileRead = pd.read_csv('submissions.txt')

In [None]:
results = results.append(fileRead[fileRead['description'] == "Basic Decision Tree Model with new Columns"][['description', 'publicScore']])

In [None]:
print(fileRead[(fileRead['description'] == "Basic Decision Tree Model with new Columns") | (fileRead['description'] == "Basic Decision Tree Model")][['description', 'publicScore']])

                                  description  publicScore
3  Basic Decision Tree Model with new Columns      0.74401
4                   Basic Decision Tree Model      0.70334


Repetindo o comportamento, o modelo teve score maior que o modelo com colunas originais demonstrando que a extração de informação de outras colunas trouxe dados valiosos para o modelo

## Experimentação

- criterion: $gini$ ou $entropy$

- splitter: $best$ ou $random$

- min_samples_split: $[2, 4, ..., 10]$

- max_features: $[1, 2, ..., 8, auto, sqrt, log2]$

- min_impurity_decrease: $[0, 0.0001, ..., 0.001]$

- class_weight: ${0: 1, 1: 1.6}$

- ccp_alpha: valores  definidos pelo modelo


No Kaggle serão submetidos os 3 modelos com maior Score

### Critério
Gini ou Entropia

Método pelo qual decide se uma divisão de ramos é boa.

In [None]:
model = DecisionTreeClassifier(random_state=13, criterion='gini')
score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
model.fit(data_X_train, data_Y_train)
print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))


Mean: 0.7722222222222223|| Std: 0.04444065590324207
Score do modelo: 0.7808988764044944


In [None]:
model = DecisionTreeClassifier(random_state=13, criterion='entropy')
score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
model.fit(data_X_train, data_Y_train)
print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))

Mean: 0.7849374021909233|| Std: 0.045440728605228596
Score do modelo: 0.7921348314606742


O critério de Entropy apresentou melhor Mean e Std do cross validation e maior score de treino parcial, iremos utilizar eles como parâmetros nos próximos experimentos.

### Splitter
Best ou Random

Método usado para efetivamente dividir os ramos, seguindo o critério definido.

In [None]:
model = DecisionTreeClassifier(random_state=11, criterion='entropy', splitter='best')
score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
model.fit(data_X_train, data_Y_train)
print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))


Mean: 0.7792840375586855|| Std: 0.042038159102613264
Score do modelo: 0.797752808988764


In [None]:
model = DecisionTreeClassifier(random_state=11, criterion='entropy', splitter='random')
score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
model.fit(data_X_train, data_Y_train)
print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))


Mean: 0.7496870109546165|| Std: 0.03944849474687885
Score do modelo: 0.8089887640449438


Ao utilizar o splitter como Random obtemos um melhor resultado em comparação com o Best, assim iremos utilizar o Random como paramêtro de splitter

### Número máximo de features usadas
[1, 2, 3, 4, 5, 6, 7]

Número máximo de features a serem usadas na criação da árvore.


In [None]:
for i in range(1, 8):
  model = DecisionTreeClassifier(random_state=13, criterion='entropy', splitter='random', max_features=i)
  score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
  print("Current max features: " + str(i))
  print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
  model.fit(data_X_train, data_Y_train)
  print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))
  print("-----------------------------------------------------------")

for i in ['auto', 'sqrt', 'log2']:
  model = DecisionTreeClassifier(random_state=13, criterion='entropy', splitter='random', max_features=i)
  score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
  print("Current max features: " + str(i))
  print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
  model.fit(data_X_train, data_Y_train)
  print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))
  print("-----------------------------------------------------------")

Current max features: 1
Mean: 0.7454029733959311|| Std: 0.03920192186047971
Score do modelo: 0.7415730337078652
-----------------------------------------------------------
Current max features: 2
Mean: 0.7636737089201878|| Std: 0.03098241417477986
Score do modelo: 0.7752808988764045
-----------------------------------------------------------
Current max features: 3
Mean: 0.766510172143975|| Std: 0.032336479545018906
Score do modelo: 0.7696629213483146
-----------------------------------------------------------
Current max features: 4
Mean: 0.7566705790297339|| Std: 0.05044767653126208
Score do modelo: 0.7584269662921348
-----------------------------------------------------------
Current max features: 5
Mean: 0.7637910798122066|| Std: 0.032494200877335254
Score do modelo: 0.7696629213483146
-----------------------------------------------------------
Current max features: 6
Mean: 0.7652190923317683|| Std: 0.040563658764910866
Score do modelo: 0.7808988764044944
--------------------------

Obteve-se um melhor score de teste parcial para o valor 7, que é igual ao número de features originais do modelo.

### Amostras mínimas para realizar separação
[2, 4..., 11]

Número mínimo de amostras acumuladas numa folha para que se possa realizar a divisão em ramos.

In [None]:
for i in range(2, 11, 2):
  model = DecisionTreeClassifier(random_state=13, criterion='entropy', splitter='random', max_features=7, min_samples_split=i)
  score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
  print("Current min samples for split: " + str(i))
  print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
  model.fit(data_X_train, data_Y_train)
  print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))
  print("-----------------------------------------------------------")

Current min samples for split: 2
Mean: 0.7455203442879499|| Std: 0.04268820649848359
Score do modelo: 0.7921348314606742
-----------------------------------------------------------
Current min samples for split: 4
Mean: 0.782003129890454|| Std: 0.05123397407607084
Score do modelo: 0.7865168539325843
-----------------------------------------------------------
Current min samples for split: 6
Mean: 0.801643192488263|| Std: 0.01644030638018137
Score do modelo: 0.7752808988764045
-----------------------------------------------------------
Current min samples for split: 8
Mean: 0.7946400625978092|| Std: 0.032323057188467626
Score do modelo: 0.7696629213483146
-----------------------------------------------------------
Current min samples for split: 10
Mean: 0.8002347417840376|| Std: 0.041866379803516325
Score do modelo: 0.8089887640449438
-----------------------------------------------------------


Com 10 amostras mínimas para realizar separação, tivemos um aumento significativo no score do teste parcial.

### Minimum Impurity Decrease

Define um threshold para que a cada divisão de ramos ocorra um aumento da homogeneidade das amostras de cada ramo a ser dividido.

[0, 0.0001, 0.0002, ..., 0.001]

In [None]:
for i in np.arange(0, 0.001, 0.0001):
  model = DecisionTreeClassifier(random_state=13, criterion='entropy', splitter='random', max_features=7, min_samples_split=10, min_impurity_decrease=i)
  score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
  print("Current min impurity decrease: " + str(i))
  print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
  model.fit(data_X_train, data_Y_train)
  print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))
  print("-----------------------------------------------------------")

Current min impurity decrease: 0.0
Mean: 0.8002347417840376|| Std: 0.041866379803516325
Score do modelo: 0.8089887640449438
-----------------------------------------------------------
Current min impurity decrease: 0.0001
Mean: 0.7931924882629108|| Std: 0.05059161987510371
Score do modelo: 0.8089887640449438
-----------------------------------------------------------
Current min impurity decrease: 0.0002
Mean: 0.7931924882629109|| Std: 0.035358845663755474
Score do modelo: 0.8202247191011236
-----------------------------------------------------------
Current min impurity decrease: 0.00030000000000000003
Mean: 0.8016431924882628|| Std: 0.03261559486453357
Score do modelo: 0.797752808988764
-----------------------------------------------------------
Current min impurity decrease: 0.0004
Mean: 0.7918427230046949|| Std: 0.028012143895943205
Score do modelo: 0.8146067415730337
-----------------------------------------------------------
Current min impurity decrease: 0.0005
Mean: 0.796048513

O maior score de teste parcial foi com o valor 0.0002 com um alto valor de mean do KFold.

### Balanceamento de dados
Os dados possuem um desbalanceamento entre as classes. Será passado para o modelo um dicionário que compensa o desbalanceamento no treinamento.

Vamos testar se o balanceamento dos dados provido pelo próprio modelo melhora sua performance.

In [None]:
model = DecisionTreeClassifier(random_state=13, criterion='entropy', splitter='random', max_features=7, min_samples_split=10, min_impurity_decrease=0.0002, class_weight={0:1, 1:weight})
score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
model.fit(data_X_train, data_Y_train)
print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))


Mean: 0.7608763693270736|| Std: 0.03576713043143833
Score do modelo: 0.7471910112359551


Como pode-se observar, houve uma grande queda em ambos mean do KFold e score de teste parcial com o balanceamento. Logo, este será descartado.

### Pruning

A técnica de Pruning será utilizada para diminuir os tamanhos das árvores e tornar o modelo mais generalizado.

In [None]:
model = DecisionTreeClassifier(random_state=13, criterion='entropy', splitter='random', max_features=7, min_samples_split=10, min_impurity_decrease=0.0002)
model.fit(data_X_train, data_Y_train)
path = model.cost_complexity_pruning_path(data_X_train, data_Y_train)
alphas = path['ccp_alphas']
alphas

array([0.        , 0.00023298, 0.00027221, 0.0003013 , 0.00039034,
       0.00049449, 0.00057953, 0.00074023, 0.00085229, 0.0009094 ,
       0.0009094 , 0.00108922, 0.00111086, 0.00111086, 0.00117453,
       0.00122471, 0.0012533 , 0.00128209, 0.00136907, 0.00139681,
       0.00141587, 0.00145457, 0.00150324, 0.00161898, 0.00165603,
       0.00167057, 0.001673  , 0.0019935 , 0.00201784, 0.0021251 ,
       0.00220672, 0.00223024, 0.00229687, 0.00242026, 0.0025378 ,
       0.00256881, 0.00315174, 0.0032726 , 0.00334939, 0.00342282,
       0.00370767, 0.00399266, 0.0044673 , 0.00502782, 0.00588176,
       0.0068181 , 0.00712459, 0.00750815, 0.00983498, 0.01223428,
       0.01931044, 0.03331233, 0.03405011, 0.21392471])

In [None]:
for i in alphas:
  model = DecisionTreeClassifier(random_state=13, criterion='entropy', splitter='random', max_features=7, min_samples_split=10, min_impurity_decrease=0.0002, ccp_alpha=i)
  score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
  print("Alpha value: " + str(i))
  print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
  model.fit(data_X_train, data_Y_train)
  print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))
  print("-----------------------------------------------------------")

Alpha value: 0.0
Mean: 0.7931924882629109|| Std: 0.035358845663755474
Score do modelo: 0.8202247191011236
-----------------------------------------------------------
Alpha value: 0.00023298036318347573
Mean: 0.7931924882629109|| Std: 0.035358845663755474
Score do modelo: 0.8202247191011236
-----------------------------------------------------------
Alpha value: 0.0002722143354297853
Mean: 0.7931924882629109|| Std: 0.035358845663755474
Score do modelo: 0.8202247191011236
-----------------------------------------------------------
Alpha value: 0.0003012991535035125
Mean: 0.7931924882629109|| Std: 0.035358845663755474
Score do modelo: 0.8202247191011236
-----------------------------------------------------------
Alpha value: 0.000390338987351033
Mean: 0.7931924882629109|| Std: 0.035358845663755474
Score do modelo: 0.8202247191011236
-----------------------------------------------------------
Alpha value: 0.0004944904418357002
Mean: 0.7931924882629109|| Std: 0.035358845663755474
Score do m

*Muitos* valores de Alpha tiveram resultados similares e nenhum possui melhora de score de teste parcial em relação ao modelo anterior. Com isso, Pruning não será utilizado.

## Random Forest
Comparando o desempenho de uma árvore com uma floresta.

In [None]:
from sklearn.ensemble import RandomForestClassifier

In [None]:
rf = RandomForestClassifier(n_estimators=100, min_samples_split=10, criterion='entropy', random_state=13, min_impurity_decrease=0.)
score = cross_val_score(rf, data_X_train, data_Y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
rf.fit(data_X_train, data_Y_train)
print("Score do modelo: " + str(rf.score(data_X_testT, data_Y_testT)))

Mean: 0.8269757433489827|| Std: 0.04284057656393786
Score do modelo: 0.8314606741573034


Observa-se que o modelo básico do random forest sem ajustes finos de parâmetros obteve uma performance similar ao melhor modelo de decision tree.

## Submissão Final

Um resumo das melhorias obtidas nos experimentos acima são reunidos nesse trecho com a submissão final das predições de cada modelo (Melhor decision tree, Random Forest e melhor decision tree com colunas novas)

### Melhor modelo decision tree

In [None]:
model = DecisionTreeClassifier(random_state=13, criterion='entropy', splitter='random', max_features=7, min_samples_split=10, min_impurity_decrease=0.0002)
model.fit(data_X_train, data_Y_train)
score = cross_val_score(model, data_X_train, data_Y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
print("Score do modelo: " + str(model.score(data_X_testT, data_Y_testT)))

Mean: 0.7931924882629109|| Std: 0.035358845663755474
Score do modelo: 0.8202247191011236


In [None]:
y_pred = model.predict(data_X_test)
output = pd.DataFrame({"PassengerId": data_test["PassengerId"], "Survived": y_pred.astype(int)})
output.to_csv("Submission.csv", index=False)
test = pd.read_csv('/content/drive/MyDrive/test.csv')
nomes = test[{'PassengerId', 'Name'}]
nomes.set_index('PassengerId').join(output.set_index('PassengerId'), how = 'right')

Unnamed: 0_level_0,Name,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,"Kelly, Mr. James",0
893,"Wilkes, Mrs. James (Ellen Needs)",0
894,"Myles, Mr. Thomas Francis",0
895,"Wirz, Mr. Albert",0
896,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0
...,...,...
1305,"Spector, Mr. Woolf",0
1306,"Oliva y Ocana, Dona. Fermina",1
1307,"Saether, Mr. Simon Sivertsen",0
1308,"Ware, Mr. Frederick",0


In [None]:
  !kaggle competitions submit -c titanic -f Submission.csv -m "Best Decision Tree Model"

100% 2.77k/2.77k [00:04<00:00, 588B/s]
Successfully submitted to Titanic - Machine Learning from Disaster

In [None]:
!kaggle competitions submissions -v -c titanic > "submissions.txt"

In [None]:
fileRead = pd.read_csv('submissions.txt')

In [None]:
results = results.append(fileRead[fileRead['description'] == "Best Decision Tree Model"][['description', 'publicScore']])

In [None]:
print(fileRead[fileRead['description'] == "Best Decision Tree Model"][['description', 'publicScore']])

                description  publicScore
0  Best Decision Tree Model       0.7488


### Modelo Random forest

In [None]:
y_pred = rf.predict(data_X_test)
output = pd.DataFrame({"PassengerId": data_test["PassengerId"], "Survived": y_pred.astype(int)})
output.to_csv("Submission.csv", index=False)
test = pd.read_csv('/content/drive/MyDrive/test.csv')
nomes = test[{'PassengerId', 'Name'}]
nomes.set_index('PassengerId').join(output.set_index('PassengerId'), how = 'right')

Unnamed: 0_level_0,Name,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,"Kelly, Mr. James",0
893,"Wilkes, Mrs. James (Ellen Needs)",0
894,"Myles, Mr. Thomas Francis",0
895,"Wirz, Mr. Albert",0
896,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",0
...,...,...
1305,"Spector, Mr. Woolf",0
1306,"Oliva y Ocana, Dona. Fermina",1
1307,"Saether, Mr. Simon Sivertsen",0
1308,"Ware, Mr. Frederick",0


In [None]:
!kaggle competitions submit -c titanic -f Submission.csv -m "Random Forest Model"

100% 2.77k/2.77k [00:05<00:00, 521B/s]
Successfully submitted to Titanic - Machine Learning from Disaster

In [None]:
!kaggle competitions submissions -v -c titanic > "submissions.txt"

In [None]:
fileRead = pd.read_csv('submissions.txt')

In [None]:
results = results.append(fileRead[fileRead['description'] == "Random Forest Model"][['description', 'publicScore']])

In [None]:
print(fileRead[fileRead['description'] == "Random Forest Model"][['description', 'publicScore']])

           description  publicScore
0  Random Forest Model      0.76076
2  Random Forest Model      0.74162


### Melhor modelo com colunas novas

Os melhores parâmetros encontrados para o modelo com as colunas originais será reproduzido na execução com as novas colunas obtidas na etapa de extração de dados com exceção do ```max_features``` que depende da quantidade de colunas.

In [None]:
model = DecisionTreeClassifier(random_state=13, criterion='entropy', splitter='random', max_features=7, min_samples_split=10, min_impurity_decrease=0.0002)
model.fit(X_train, y_train)
score = cross_val_score(model, X_train, y_train, cv=10)
print("Mean: " + str(np.average(score)) + "|| Std: " + str(np.std(score)))
print("Score do modelo: " + str(model.score(X_testT, y_testT)))

Mean: 0.7850547730829421|| Std: 0.050609542893433314
Score do modelo: 0.8603351955307262


In [None]:
y_pred = model.predict(X_test)
output = pd.DataFrame({"PassengerId": data_test["PassengerId"], "Survived": y_pred.astype(int)})
output.to_csv("Submission.csv", index=False)
test = pd.read_csv('/content/drive/MyDrive/test.csv')
nomes = test[{'PassengerId', 'Name'}]
nomes.set_index('PassengerId').join(output.set_index('PassengerId'), how = 'right')

Unnamed: 0_level_0,Name,Survived
PassengerId,Unnamed: 1_level_1,Unnamed: 2_level_1
892,"Kelly, Mr. James",0
893,"Wilkes, Mrs. James (Ellen Needs)",1
894,"Myles, Mr. Thomas Francis",0
895,"Wirz, Mr. Albert",0
896,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",1
...,...,...
1305,"Spector, Mr. Woolf",0
1306,"Oliva y Ocana, Dona. Fermina",1
1307,"Saether, Mr. Simon Sivertsen",0
1308,"Ware, Mr. Frederick",0


In [None]:
!kaggle competitions submit -c titanic -f Submission.csv -m "Best Decision Tree Model with new columns"

100% 2.77k/2.77k [00:05<00:00, 566B/s]
Successfully submitted to Titanic - Machine Learning from Disaster

In [None]:
!kaggle competitions submissions -v -c titanic > "submissions.txt"

In [None]:
fileRead = pd.read_csv('submissions.txt')

In [None]:
results = results.append(fileRead[fileRead['description'] == "Best Decision Tree Model with new columns"][['description', 'publicScore']])

In [None]:
print(fileRead[fileRead['description'] == "Best Decision Tree Model with new columns"][['description', 'publicScore']])

                                 description  publicScore
0  Best Decision Tree Model with new columns      0.75358


### Resultados Finais

In [None]:
print(results.sort_values('publicScore', ascending=False))

                                  description  publicScore
2                    Best Decision Tree Model      0.77033
5                         Random Forest Model      0.76076
4   Best Decision Tree Model with new columns      0.75358
1  Basic Decision Tree Model with new Columns      0.74401
0                   Basic Decision Tree Model      0.70334


Observa-se a superioridade do random forest, que com pouco ajuste foi capaz de ser melhor que quase todos outros modelos. Este resultado é esperado pois o classificador Random Forest é composto de várias árvores que realizam suas classificações e votam num resultado final, sendo melhor em generalizações.

Percebe-se também a melhoria em ambos os modelos que tiveram seus parâmetros selecionados, com destaque ao modelo com as colunas originais que conseguiu aumentar 7% de score. É possível que o modelo com novas colunas performasse melhor se os experimentos tivessem sido feitos nele, mas mesmo utilizando os melhores parâmetros do modelo original, obteve aumento de 1% do score.