### IESB
___
### Pós Graduação em Inteligência Artificial
#### Disciplina: Aprendizado Supervisionado
#### Discente: Henrique Brandão

Utilizaremos os dados relativos ao naufrágio da embarcação *Titanic* (1915).

|Atributos|Tipo|Descrição|
|--|--|--|
|`PassengerId`|`int`|Indentificador do passageiro (chave primária)|
|`Pclass`|`int`|Classe da acomodação do passageiro (`1`: 1a; `1`: 2a; `3`: 3a)|
|`Name`|`str`|Nome do passageiro|
|`Sex`|`str`|Gênero do passageiro|
|`Age`|`float`|Idade do passageiro|
|`SibSp`|`int`|Quantidade de irmãos e cônjuges do passageiro a bordo|
|`Parch`|`int`|Quantidade de pais e filhos do passageiro a bordo|
|`Ticket`|`str`|*Id* do ticket|
|`Fare`|`float`|Preço do ticket|
|`Cabin`|`str`|Cabine do passageiro|
|`Embarked`|`str`|Local de embarque (`C`: Cherbourg; `Q`: Queenstown; `S`: Southampton)|
|`Survived`|`int`|Estado de sobrevivência (`1`: sim; `0`: não)|

O *target* da nossa classificação será o atributo `Survived`.

Fonte: [Kaggle - Titanic](https://www.kaggle.com/c/titanic/data)

In [1]:
import numpy as np
import pandas as pd

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

from sklearn.ensemble import RandomForestClassifier

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix

from datass.dataframe.inspection import _isnull

In [2]:
!ls

atividade4.ipynb  atividade-4.png  gender_submission.csv  test.csv  train.csv


In [3]:
df_train = pd.read_csv('train.csv')
df_test = pd.read_csv('test.csv')

df_train.shape, df_test.shape

((891, 12), (418, 11))

In [4]:
df_train.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [5]:
df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [6]:
df_train['Age'].describe()

count    714.000000
mean      29.699118
std       14.526497
min        0.420000
25%       20.125000
50%       28.000000
75%       38.000000
max       80.000000
Name: Age, dtype: float64

In [7]:
TRAIN_AGE_MEAN = df_train['Age'].mean()
TRAIN_AGE_MEAN

29.69911764705882

In [8]:
df_train['Age'].fillna(value=TRAIN_AGE_MEAN, inplace=True)

In [9]:
df_train['Embarked'].value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [10]:
_ = {**df_train['Embarked'].value_counts()}

TRAIN_EMBARKED_FILLNA = max(_, key=_.get)

In [11]:
df_train['Embarked'].fillna(value=TRAIN_EMBARKED_FILLNA, inplace=True)

In [12]:
_isnull(df_train)

>> Null registers:

# PassengerId: 0 null rows
# Survived: 0 null rows
# Pclass: 0 null rows
# Name: 0 null rows
# Sex: 0 null rows
# Age: 0 null rows
# SibSp: 0 null rows
# Parch: 0 null rows
# Ticket: 0 null rows
# Fare: 0 null rows
# Cabin: 687 rows
# Embarked: 0 null rows


In [13]:
TRAIN_DROP_COLS = ['PassengerId', 'Name', 'Ticket', 'Cabin']

In [14]:
df_train.drop(columns=TRAIN_DROP_COLS, inplace=True)

In [15]:
df_train.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [16]:
ENCODE_SEX = {'female': 0, 'male': 1}

def encode_sex(sex: str):
    return ENCODE_SEX[sex]

In [17]:
df_train['Sex'] = df_train['Sex'].apply(encode_sex)

In [19]:
ENCODE_EMBARKED_AT = {
    'C': [1, 0, 0], 'Q': [0, 1, 0], 'S': [0, 0, 1]
}

def encode_embarked_at(city: str):
    return ENCODE_EMBARKED_AT[city]

In [20]:
df_train['Embarked'].apply(encode_embarked_at)

0      [0, 0, 1]
1      [1, 0, 0]
2      [0, 0, 1]
3      [0, 0, 1]
4      [0, 0, 1]
         ...    
886    [0, 0, 1]
887    [0, 0, 1]
888    [0, 0, 1]
889    [1, 0, 0]
890    [0, 1, 0]
Name: Embarked, Length: 891, dtype: object

In [21]:
df_train[['Embarked@C', 'Embarked@Q', 'Embarked@S']] = 