# TITANIC CLASSIFIER

O famoso dataset do titanic tem o objetivo de prever se um individuo irá sobreviver ou não ao acidente.

## Importando o dataset

In [1]:
import os
import urllib.request

TITANIC_PATH = os.path.join("datasets", "titanic")
DOWNLOAD_URL = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/titanic/"

def fetch_titanic_data(url=DOWNLOAD_URL, path=TITANIC_PATH):
    if not os.path.isdir(path):
        os.makedirs(path)
    for filename in ("train.csv", "test.csv"):
        filepath = os.path.join(path, filename)
        if not os.path.isfile(filepath):
            print("Downloading", filename)
            urllib.request.urlretrieve(url + filename, filepath)

fetch_titanic_data() 

In [2]:
import pandas as pd

def load_titanic_data(filename, titanic_path=TITANIC_PATH):
    csv_path = os.path.join(titanic_path, filename)
    return pd.read_csv(csv_path)

In [3]:
train_data = load_titanic_data("train.csv")
test_data = load_titanic_data("test.csv")

### Observando o dataset

In [4]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [5]:
train_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


Podemos observar que o dataset possui 12 atributos, porém um deles é o atributo alvo (survived) e o outro apresenta apenas o ID do passageiro, que não será utilizado como atrbiuto previsor

Além disso, podemos observar que temos alguns valores faltantes em "Age", "Cabin" e "Embarked"

Por fim, observamos como as classes estão balanceadas no dado de treino

In [6]:
train_data['Survived'].value_counts()

0    549
1    342
Name: Survived, dtype: int64

Após observamos como os dados estão organizados, podemos começar com processos de pre-processamentos

#### Pipeline para atributos numericos

Primeiramente, os dados faltantes serão preenchidos com a mediana da coluna e o em seguida os dados númericos serão normalizados por StandardScaler

In [7]:
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

#### Pipeline para atributos categóricos

Primeiramente, os dados faltantes serão preenchidos com a categoria mais frequente e em seguida será aplicado o OneHotEncoder

In [8]:
from sklearn.preprocessing import OneHotEncoder

cat_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('encoder', OneHotEncoder(sparse=False))
])

#### ColumnsTransformer para aplicar os pipelines

In [9]:
from sklearn.compose import ColumnTransformer

num_attribs = ['Age', 'SibSp', 'Parch', 'Fare']
cat_attribs = ['Pclass', 'Sex', 'Embarked']

preprocess_pipeline = ColumnTransformer([
    ('num', num_pipeline, num_attribs),
    ('cat', cat_pipeline, cat_attribs)
])

#### Aplicando o preprocessamento completo

In [10]:
X_train = preprocess_pipeline.fit_transform(train_data[num_attribs+cat_attribs])

#### Obtendo os rotúlos do atributo alvo

In [15]:
y_train = train_data['Survived']

## Classificador

Para determinar o classificador, iremos testar 3 modelos (RandomForest, SVM e KNN), comparar os resultados e tentar otimizar o modelo que se apresentou mais promissor para nossa tarefa

In [31]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier

#RANDOM FOREST
rndf_clf = RandomForestClassifier(random_state=42)
rndf_clf.fit(X_train, y_train)
forest_scores = cross_val_score(rndf_clf, X_train, y_train, cv=10)

#SVM
svm_clf = SVC()
svm_clf.fit(X_train, y_train)
svm_scores = cross_val_score(svm_clf, X_train, y_train, cv=10)

#KNN
knn_clf = KNeighborsClassifier()
knn_clf.fit(X_train, y_train)
knn_scores = cross_val_score(knn_clf, X_train, y_train, cv=10)

scores = [forest_scores.mean(), svm_scores.mean(), knn_scores.mean()]
classificadores = ['RandForest', 'SVM', 'KNN']

result = pd.DataFrame(scores, index=classificadores, columns=['Accuracy'])

In [32]:
result

Unnamed: 0,Accuracy
RandForest,0.813758
SVM,0.824944
KNN,0.805868


Podemos observar que o melhor resultado foi o SVM, portanto iremos escolher esse modelo para tentar optimiza-lo com grid_search

### Grid Search

In [36]:
from sklearn.model_selection import GridSearchCV

param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
              'kernel': ['rbf', 'poly']
             }
grid_search = GridSearchCV(svm_clf, param_grid, refit=True, cv=5, verbose=3)

grid_search.fit(X_train, y_train)

Fitting 5 folds for each of 40 candidates, totalling 200 fits
[CV 1/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.821 total time=   0.0s
[CV 2/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.787 total time=   0.0s
[CV 3/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.753 total time=   0.0s
[CV 4/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.725 total time=   0.0s
[CV 5/5] END ........C=0.1, gamma=1, kernel=rbf;, score=0.697 total time=   0.0s
[CV 1/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.760 total time=   0.0s
[CV 2/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.787 total time=   0.0s
[CV 3/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.815 total time=   0.0s
[CV 4/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.775 total time=   0.0s
[CV 5/5] END .......C=0.1, gamma=1, kernel=poly;, score=0.848 total time=   0.0s
[CV 1/5] END ......C=0.1, gamma=0.1, kernel=rbf;, score=0.804 total time=   0.0s
[CV 2/5] END ......C=0.1, gamma=0.1, kernel=rbf

GridSearchCV(cv=5, estimator=SVC(),
             param_grid={'C': [0.1, 1, 10, 100],
                         'gamma': [1, 0.1, 0.01, 0.001, 0.0001],
                         'kernel': ['rbf', 'poly']},
             verbose=3)

In [37]:
best_svm = grid_search.best_estimator_

#### Utilizando o melhor modelo encontrado para prever os dados de teste

Aplicando o pre-processamento nos dados de teste

In [42]:
X_test = preprocess_pipeline.transform(test_data[num_attribs + cat_attribs])

Utilizando os dados de teste para fazer a predição

In [43]:
y_pred = best_svm.predict(X_test)