### Dataset Description
Overview
The data has been split into two groups:

training set (train.csv)
test set (test.csv)
The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

We also include gender_submission.csv, a set of predictions that assume all and only female passengers survive, as an example of what a submission file should look like.

### Data Dictionary

Variable|Definition|Key 
-|-|-
survival|Survival| 0 = No, 1 = Yes     
pclass|Ticket class|1 = 1st, 2 = 2nd, 3 = 3rd                
sex|Sex|          
Age|Age in years|
sibsp|# of siblings / spouses aboard the Titanic |
parch|# of parents / children aboard the Titanic |
ticket|Ticket number|
fare|Passenger fare|
cabin|Cabin number|
embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton

# Importação das bibliotecas utilizadas

In [1]:
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split, KFold
from sklearn import svm
from sklearn.ensemble import ExtraTreesClassifier, RandomForestClassifier
from sklearn.metrics import confusion_matrix, accuracy_score
from imblearn.over_sampling import RandomOverSampler

# Importação da base, visualização e tratamento de dados

In [2]:
dados = pd.read_csv('train.csv', sep = ',')
dados.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
dados = dados[['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked','Survived']]

In [4]:
dados.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Pclass       891 non-null    int64  
 2   Name         891 non-null    object 
 3   Sex          891 non-null    object 
 4   Age          714 non-null    float64
 5   SibSp        891 non-null    int64  
 6   Parch        891 non-null    int64  
 7   Ticket       891 non-null    object 
 8   Fare         891 non-null    float64
 9   Cabin        204 non-null    object 
 10  Embarked     889 non-null    object 
 11  Survived     891 non-null    int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [5]:
dados['Fare'].describe()

count    891.000000
mean      32.204208
std       49.693429
min        0.000000
25%        7.910400
50%       14.454200
75%       31.000000
max      512.329200
Name: Fare, dtype: float64

In [6]:
dados['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [26]:
media = round(dados['Age'].mean(), 2)
dados['Age'].fillna(media, inplace = True)
dados['Age'].info()

<class 'pandas.core.series.Series'>
RangeIndex: 891 entries, 0 to 890
Series name: Age
Non-Null Count  Dtype  
--------------  -----  
891 non-null    float64
dtypes: float64(1)
memory usage: 7.1 KB


# Transformação de variáveis categóricas para numéricas

In [8]:
label = LabelEncoder()
dados_transf = pd.DataFrame()
dados_transf['Sex'] = label.fit_transform(dados['Sex'])
dados_transf['Embarked'] = label.fit_transform(dados['Embarked'])
dados_transf['Cabin'] = label.fit_transform(dados['Cabin'])
dados_transf = pd.concat([dados_transf,
                          dados['Pclass'],
                          dados['Age'], 
                          dados['SibSp'],
                          dados['Parch'],
                          dados['Fare'],
                          dados['Survived']],
                         axis = 1)
dados_transf.head()

Unnamed: 0,Sex,Embarked,Cabin,Pclass,Age,SibSp,Parch,Fare,Survived
0,1,2,147,3,22.0,1,0,7.25,0
1,0,0,81,1,38.0,1,0,71.2833,1
2,0,2,147,3,26.0,0,0,7.925,1
3,0,2,55,1,35.0,1,0,53.1,1
4,1,2,147,3,35.0,0,0,8.05,0


# Preparação do modelo

In [9]:
previsor = dados_transf.iloc[:,0:8].values
classe = dados_transf.iloc[:,8].values
seed = [0,1,2,3,4,5,6,7,8,9,10]

In [10]:
x_treino, x_teste, y_treino, y_teste = train_test_split(previsor, classe, 
                                                        test_size = 0.3,
                                                        random_state = 0)

# Treinando modelo sem balanceamento

#### Modelo SVC DESBALANCEADO

In [11]:
for x in seed:
    svm_class = svm.SVC(random_state = x)
    svm_class.fit(x_treino, y_treino)
    
    svm_class_sb = svm_class.predict(x_teste)
    taxa_acerto_svm_sb = accuracy_score(y_teste, 
                                    svm_class_sb)
    print(taxa_acerto_svm_sb * 100)

71.64179104477611
71.64179104477611
71.64179104477611
71.64179104477611
71.64179104477611
71.64179104477611
71.64179104477611
71.64179104477611
71.64179104477611
71.64179104477611
71.64179104477611


In [12]:
svm_desbalanceado = confusion_matrix(y_teste, 
                                     svm_class_sb)
svm_desbalanceado

array([[157,  11],
       [ 65,  35]], dtype=int64)

#### Modelo ExtraTreesClassifier DESBALANCEADO

In [13]:
for x in seed:
    et_class = ExtraTreesClassifier(random_state = x)
    et_class.fit(x_treino, y_treino)
    et_class_sb = et_class.predict(x_teste)
    taxa_acerto_et_sb = accuracy_score(y_teste, 
                                   et_class_sb)
    print(taxa_acerto_et_sb * 100)

80.97014925373134
79.1044776119403
79.47761194029852
80.59701492537313
77.98507462686567
80.97014925373134
79.1044776119403
80.59701492537313
79.8507462686567
80.22388059701493
79.47761194029852


In [14]:
et_desbalanceado = confusion_matrix(y_teste, 
                                    et_class_sb)
et_desbalanceado

array([[142,  26],
       [ 29,  71]], dtype=int64)

#### Modelo RandomForestClassifier DESBALANCEADO

In [15]:
for x in seed:
    rf_class = RandomForestClassifier(random_state = x)
    rf_class.fit(x_treino, y_treino)
    rf_class_sb = rf_class.predict(x_teste)
    taxa_acerto_rf_sb = accuracy_score(y_teste,
                                   rf_class_sb)
    print(taxa_acerto_rf_sb * 100)

82.46268656716418
82.46268656716418
82.83582089552239
82.46268656716418
82.83582089552239
82.08955223880598
83.2089552238806
83.95522388059702
82.08955223880598
82.46268656716418
83.95522388059702


In [16]:
rf_desbalanceado = confusion_matrix(y_teste,
                                    rf_class_sb)
rf_desbalanceado

array([[149,  19],
       [ 24,  76]], dtype=int64)

### Fazendo balanceamento OverSample 

In [17]:
os = RandomOverSampler(random_state = 0)
x_res, y_res = os.fit_resample(x_treino,
                               y_treino)

# Treinando modelo com balanceamento

#### Modelo SVC BALANCEADO

In [18]:
for x in seed:
    svm_class = svm.SVC(random_state = x)
    svm_class.fit(x_res, y_res)
    svm_class_b = svm_class.predict(x_teste)
    taxa_acerto_svm_b = accuracy_score(y_teste,
                                   svm_class_b)
    print(taxa_acerto_svm_b * 100)

72.38805970149254
72.38805970149254
72.38805970149254
72.38805970149254
72.38805970149254
72.38805970149254
72.38805970149254
72.38805970149254
72.38805970149254
72.38805970149254
72.38805970149254


In [19]:
svm_balanceado = confusion_matrix(y_teste,
                                  svm_class_b)
svm_balanceado

array([[149,  19],
       [ 55,  45]], dtype=int64)

#### Modelo ExtraTreesClassifier BALANCEADO

In [20]:
for x in seed:
    et_class = ExtraTreesClassifier(random_state = x)
    et_class.fit(x_res, y_res)
    et_class_b = et_class.predict(x_teste)
    taxa_acerto_et_b = accuracy_score(y_teste,
                                  et_class_b)
    print(taxa_acerto_et_b*100)

81.71641791044776
77.61194029850746
81.34328358208955
80.22388059701493
79.8507462686567
79.8507462686567
80.97014925373134
80.22388059701493
80.97014925373134
79.47761194029852
80.22388059701493


In [21]:
et_balanceado = confusion_matrix(y_teste,
                                 et_class_sb)
et_balanceado

array([[142,  26],
       [ 29,  71]], dtype=int64)

#### Modelo RandomForestClassifier BALANCEADO

In [22]:
for x in seed:
    rf_class = RandomForestClassifier(random_state = x)
    rf_class.fit(x_res, y_res)
    rf_class_b = rf_class.predict(x_teste)
    taxa_acerto_rf_b = accuracy_score(y_teste, 
                                  rf_class_b)
    print(taxa_acerto_rf_b*100)

82.46268656716418
81.34328358208955
80.97014925373134
81.34328358208955
80.97014925373134
81.71641791044776
82.83582089552239
83.2089552238806
82.83582089552239
81.71641791044776
82.46268656716418


In [23]:
rf_balanceado = confusion_matrix(y_teste,
                                 rf_class_b)
rf_balanceado

array([[146,  22],
       [ 25,  75]], dtype=int64)

##### Escolha do modelo Random forest balanceado devido maior precisão comparado com ambientes balanceados
(escolha pessoal, pois a taxa de distribuição entre sobreviventes e não sobreviventes é de 40-60)

In [24]:
importancia = rf_class.feature_importances_
importancia

array([0.23912664, 0.03562187, 0.09075223, 0.07537389, 0.23753904,
       0.05427876, 0.03344236, 0.23386521])

In [25]:
print('Colunas', dados_transf.columns.to_list())
print()
print('Imporância', importancia)

Colunas ['Sex', 'Embarked', 'Cabin', 'Pclass', 'Age', 'SibSp', 'Parch', 'Fare', 'Survived']

Imporância [0.23912664 0.03562187 0.09075223 0.07537389 0.23753904 0.05427876
 0.03344236 0.23386521]
