# Titanic - Machine Learning from Disaster

<img src="https://storage.googleapis.com/kaggle-competitions/kaggle/3136/logos/header.png" width=900>

- We will use the [data available on Kaggle](https://www.kaggle.com/competitions/titanic)
    - It is a **competition** dataset
    - The result is evaluated through **accuracy**:
        - _"Your score is the percentage of passengers you correctly predict. This is known as accuracy."_

### Importing the datasets again and performing data preprocessing
- We will only replicate what we did in the **[first](https://github.com/italoeliastorresdacruz-maker/Kaggle-Titanic-competition/blob/main/Titanic_Machine_Learning_Part_1_EN.ipynb)**, **[second](https://github.com/italoeliastorresdacruz-maker/Kaggle-Titanic-competition/blob/main/Titanic_Analysis_Part_2_Final_File_EN.ipynb)** and **[third](https://github.com/italoeliastorresdacruz-maker/Kaggle-Titanic-competition/blob/main/Titanic_Analysis_Part_3_Final_File_EN.ipynb)** file of this analysis (to view the complete files, visit this link)

In [None]:
# Importando o pandas
import pandas as pd

In [None]:
# Visualizando a base de treino
treino = pd.read_csv('train.csv')
treino.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [None]:
# Visualizando a base de teste
teste = pd.read_csv('test.csv')
teste.head(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


- Performing the same initial treatment that we did on the empty columns

In [None]:
# Eliminando as colunas com elevada cardinalidade
treino = treino.drop(['Name','Ticket','Cabin'],axis=1)
teste = teste.drop(['Name','Ticket','Cabin'],axis=1)

In [None]:
# Usando a média para substituir valores nulos na coluna de idade
treino.loc[treino.Age.isnull(),'Age'] = treino.Age.mean()
teste.loc[teste.Age.isnull(),'Age'] = teste.Age.mean()

In [None]:
# Tratando a coluna Embarked da base de treino usando a moda
treino.loc[treino.Embarked.isnull(),'Embarked'] = treino.Embarked.mode()[0]

In [None]:
# E também a coluna Fare da base de teste usando a média
teste.loc[teste.Fare.isnull(),'Fare'] = teste.Fare.mean()

- And performing feature engineering on our data

In [None]:
# Usando uma lambda function para tratar a coluna "Sex"
treino['MaleCheck'] = treino.Sex.apply(lambda x: 1 if x == 'male' else 0)
teste['MaleCheck'] = teste.Sex.apply(lambda x: 1 if x == 'male' else 0)

In [None]:
# Fazendo o RobustScaler das colunas Age e Fare
from sklearn.preprocessing import RobustScaler
transformer = RobustScaler().fit(treino[['Age','Fare']])
treino[['Age','Fare']] = transformer.transform(treino[['Age','Fare']])

# e para a base de teste
transformer = RobustScaler().fit(teste[['Age','Fare']])
teste[['Age','Fare']] = transformer.transform(teste[['Age','Fare']])

In [None]:
# Adicionando a coluna sozinho
def sozinho(a,b):
    if (a == 0 and b == 0):
        return 1
    else:
        return 0

treino['Sozinho'] = treino.apply(lambda x: sozinho(x.SibSp,x.Parch),axis=1)
teste['Sozinho'] = teste.apply(lambda x: sozinho(x.SibSp,x.Parch),axis=1)

In [None]:
# E criando a coluna de familiares
treino['Familiares'] = treino.SibSp + treino.Parch
teste['Familiares'] = treino.SibSp + treino.Parch

In [None]:
# Fazendo o OrdinalEncoder para a coluna Embarked
from sklearn.preprocessing import OrdinalEncoder
categorias = ['S','C','Q']

enc = OrdinalEncoder(categories=[categorias],dtype='int32')
enc = enc.fit(treino[['Embarked']])
treino['Embarked'] = enc.transform(treino[['Embarked']])

teste['Embarked'] = enc.transform(teste[['Embarked']])

In [None]:
# Apagando as colunas de texto
treino = treino.drop('Sex',axis=1)
teste = teste.drop('Sex',axis=1)

- Visualizing the resulting dataset

In [None]:
# Visualizando a base de treino
treino.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Age,SibSp,Parch,Fare,Embarked,MaleCheck,Sozinho,Familiares
0,1,0,3,-0.59224,1,0,-0.312011,0,1,0,1
1,2,1,1,0.638529,1,0,2.461242,1,0,0,1
2,3,1,3,-0.284548,0,0,-0.282777,0,0,1,0


### We can use other models to make the prediction

- We can select algorithms different from those we saw in the previous parts (see the **[part 1](https://github.com/lucaslealx/Titanic/blob/main/Parte1.ipynb)** file) considering the [other algorithms available in scikit-learn](https://scikit-learn.org/stable/supervised_learning.html#supervised-learning)
    - **Logistic Regression**
        - https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html#sklearn.linear_model.LogisticRegression
    - **Random Forest**
        - https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier
    - **MLPClassifier (Neural Networks)**
        - https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html#sklearn.neural_network.MLPClassifier


- Now, **in addition to train_test_split**:
    - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html
- We will also use **grid_search** to estimate the best parameters
    - https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html

In [None]:
# Importando o train_test_split
from sklearn.model_selection import train_test_split

In [None]:
# Separando a base de treino em X e y
X = treino.drop(['PassengerId','Survived'],axis=1)
y = treino.Survived

In [None]:
# Separando em treino e validação
X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=42)

- For **Logistic Regression**

In [None]:
# Importando
from sklearn.linear_model import LogisticRegression

In [None]:
# Criando o classificador
clf_rl = LogisticRegression(random_state=42)

In [None]:
# Definindo os parâmetros
parametros_rl = {
    'penalty': ['l1','l2'],
    'C': [0.01,0.1,1,10],
    'solver': ['lbfgs','liblinear','saga'],
    'max_iter': [100,1000,5000,10000]
}

- For **Random Forest**

In [None]:
# Importando
from sklearn.ensemble import RandomForestClassifier

In [None]:
# Criando o classificador
clf_rf = RandomForestClassifier(random_state=42)

In [None]:
# Definindo os parâmetros
parametros_rf = {
    'n_estimators': [100,200,500,1000],
    'criterion': ['gini','entropy','log_loss'],
    'max_depth': [2,4,6,8,None],
    'max_features': ['sqrt','log2',None]
}

- And for **MLPClassifier (Neural Networks)**

In [None]:
# Importando
from sklearn.neural_network import MLPClassifier

In [None]:
# Criando o classificador
clf_mlp = MLPClassifier(random_state=42)

In [None]:
# Definindo os parâmetros
parametros_mlp = {
    'solver':  ['lbfgs','sgd','adam'],
    'alpha': [10.0**(-1),10.0**(-5),10.0**(-7),10.0**(-10)],
    'max_iter': [200,500,1000,5000]
}

- **Performing grid_search**

In [None]:
# Ignorando os avisos
import warnings
warnings.filterwarnings('ignore')

In [None]:
# Importando o datetime para visualizar a hora atual
from datetime import datetime
def hora_atual():
    agora = datetime.now()
    print(str(agora.hour)+':'+str(agora.minute)+":"+str(agora.second))

In [None]:
# Importando o KFold e o GridSearchCV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold

In [None]:
# Para a Regressão Logística
hora_atual()
kfold_rl = KFold(shuffle=True,random_state=42,n_splits=8)
grid_search_rl = GridSearchCV(clf_rl, parametros_rl,scoring='accuracy',cv=kfold_rl)
grid_search_rl = grid_search_rl.fit(X_train,y_train)
hora_atual()

18:19:36
18:19:41


In [None]:
# Para o RandomForest
hora_atual()
kfold_rf = KFold(shuffle=True,random_state=42,n_splits=8)
grid_search_rf = GridSearchCV(clf_rf, parametros_rf,scoring='accuracy',cv=kfold_rf)
grid_search_rf = grid_search_rf.fit(X_train,y_train)
hora_atual()

18:20:16
18:31:5


In [None]:
# Para o MLPClassifier
hora_atual()
kfold_mlp = KFold(shuffle=True,random_state=42,n_splits=8)
grid_search_mlp = GridSearchCV(clf_mlp, parametros_mlp,scoring='accuracy',cv=kfold_mlp)
grid_search_mlp = grid_search_mlp.fit(X_train,y_train)
hora_atual()

18:31:5
18:37:36


- **Checking the best scores**

In [None]:
# Verificando o melhor score da regressão logística
grid_search_rl.best_score_

0.8089887640449438

In [None]:
# Para o RandomForest
grid_search_rf.best_score_

0.8314606741573034

In [None]:
# e para o MLPClassifier
grid_search_mlp.best_score_

0.8174157303370786

- **And the best parameters**

In [None]:
# Verificando os melhores parâmetros da regressão logística
grid_search_rl.best_params_

{'C': 0.1, 'max_iter': 100, 'penalty': 'l2', 'solver': 'lbfgs'}

In [None]:
# Para o RandomForest
grid_search_rf.best_params_

{'criterion': 'entropy',
 'max_depth': 6,
 'max_features': 'sqrt',
 'n_estimators': 100}

In [None]:
# e para o MLPClassifier
grid_search_mlp.best_params_

{'alpha': 0.1, 'max_iter': 200, 'solver': 'adam'}

- **Making the prediction on the validation data with each of the best models**

In [None]:
# Para a regressão logística
clf_best_rl = grid_search_rl.best_estimator_
y_pred_rl = clf_best_rl.predict(X_val)

In [None]:
# Para o RandomForest
clf_best_rf = grid_search_rf.best_estimator_
y_pred_rf = clf_best_rf.predict(X_val)

In [None]:
# e para o MLPClassifier
clf_best_mlp = grid_search_mlp.best_estimator_
y_pred_mlp = clf_best_mlp.predict(X_val)

- We will again **evaluate the models**
    - Accuracy (evaluation method used in the competition):
        - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.accuracy_score.html
    - Confusion matrix (helps visualize the distribution of errors):
        - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html

- Evaluating the **accuracy**

In [None]:
# Importando
from sklearn.metrics import accuracy_score

In [None]:
# Para a Regressão Logística
accuracy_score(y_val, y_pred_rl)

0.8044692737430168

In [None]:
# Para o Random Forest
accuracy_score(y_val, y_pred_rf)

0.8100558659217877

In [None]:
# Para o MLPClassifier (Redes Neurais)
accuracy_score(y_val, y_pred_mlp)

0.8100558659217877

- Evaluating the **confusion matrix**

In [None]:
# Importando
from sklearn.metrics import confusion_matrix

In [None]:
# Para a Regressão Logística
confusion_matrix(y_val, y_pred_rl)

array([[91, 14],
       [21, 53]], dtype=int64)

In [None]:
# Para o Random Forest
confusion_matrix(y_val, y_pred_rf)

array([[93, 12],
       [22, 52]], dtype=int64)

In [None]:
# Para o MLPClassifier (Redes Neurais)
confusion_matrix(y_val, y_pred_mlp)

array([[92, 13],
       [21, 53]], dtype=int64)

### Making predictions for the test data
- We will use the model with the best precision to make the predict on the test dataset

In [None]:
# Visualizando o X_train
X_train.head(3)

Unnamed: 0,Pclass,Age,SibSp,Parch,Fare,Embarked,MaleCheck,Sozinho,Familiares
331,1,1.215452,0,0,0.608317,0,1,1,0
733,2,-0.515317,0,0,-0.062981,0,1,1,0
382,3,0.176991,0,0,-0.282777,0,1,1,0


In [None]:
# Visualizando a base de teste
teste.head(3)

Unnamed: 0,PassengerId,Pclass,Age,SibSp,Parch,Fare,Embarked,MaleCheck,Sozinho,Familiares
0,892,3,0.331562,0,0,-0.28067,2,1,1,1
1,893,3,1.311954,1,0,-0.3158,0,0,0,1
2,894,2,2.488424,0,0,-0.201943,2,1,1,0


In [None]:
# Para a base de teste ser igual a base de treino, precisamos eliminar a coluna de id
X_teste = teste.drop('PassengerId',axis=1)

In [None]:
# Utilizando o melhor modelo na base de teste
y_pred = clf_best_rf.predict(X_teste)

In [None]:
# Criando uma nova coluna com a previsão na base de teste
teste['Survived'] = y_pred

In [None]:
# Selecionando apenas a coluna de Id e Survived para fazer o envio
base_envio = teste[['PassengerId','Survived']]

In [None]:
# Exportando para um csv
base_envio.to_csv('resultados10.csv',index=False)