# Datos

### Objetivo
Predecir si cada pasajero sobrevivirá o no al hundimiento del Titanic. 

### Métrica
*Accuracy*: Porcentaje de pasajeros que se predicen correctamente.

### Formato de entrega

Un fichero *.csv* con 418 entradas (más encabezamiento) y el siguiente formato:

~~~
PassengerId,  Survived
892,          0
893,          1
894,          0
etc.
~~~

---------

## Preparación
- Imports
- Comprobación de los datos
- Definición de funciones útiles

In [86]:
%matplotlib inline

In [87]:
from fastai.imports import *
from fastai.structured import *

from pandas_summary import DataFrameSummary
from sklearn.ensemble import RandomForestClassifier
from IPython.display import display

from sklearn import metrics

In [88]:
PATH = 'data/titanic/'

In [89]:
!ls {PATH}

submission.csv	test.csv  train.csv


In [90]:
!head -n 3 {PATH}train.csv

PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
1,0,3,"Braund, Mr. Owen Harris",male,22,1,0,A/5 21171,7.25,,S
2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Thayer)",female,38,1,0,PC 17599,71.2833,C85,C


## Observación de los datos

In [91]:
df_raw = pd.read_csv(f'{PATH}train.csv', low_memory=False)

In [92]:
df_raw.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


No hay fechas para hacer un datetime, ni hay problemas para ver todas las columnas.  
No me piden en la métrica logaritmos ni nada especial, así que puedo continuar

In [93]:
df_raw.shape

(891, 12)

## Procesamiento inicial

In [94]:
m = RandomForestClassifier(n_jobs=-1)
m.fit(df_raw.drop('Survived', axis=1), df_raw.Survived)

ValueError: could not convert string to float: 'Braund, Mr. Owen Harris'

El primer intento de RandomForestClassifier me saca error por la columna de los nombres. Voy a intentar seguir con ella en el DF pero probaré a sacarla también.

Me encargo de las variables categóricas

In [95]:
train_cats(df_raw)

In [96]:
print(df_raw.isnull().sum().sort_index()/len(df_raw))

Age            0.198653
Cabin          0.771044
Embarked       0.002245
Fare           0.000000
Name           0.000000
Parch          0.000000
PassengerId    0.000000
Pclass         0.000000
Sex            0.000000
SibSp          0.000000
Survived       0.000000
Ticket         0.000000
dtype: float64


Hay varios null, sobre todo muchos en *Cabin*, unos cuantos en *Age*, y alguno que otro en *Embarked*

Guardo a feather

In [97]:
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/titanic-raw')

## Pre-procesamiento

Lectura del df_raw desde el feather

In [98]:
df_raw = pd.read_feather('tmp/titanic-raw')

Proceso el DF (Variables categóricas, NAs y separo df e y)

In [99]:
df, y, nas = proc_df(df_raw, 'Survived')

Segundo intento de Random Forest

In [100]:
m = RandomForestClassifier(n_jobs=-1)
m.fit(df, y)
m.score(df, y)

1.0

Claro Overfitting, imposible ese 1.0.  
Puede ser por los nombres?

Por si acaso separaré en *train* y *valid*.  
El test set tiene 418 entradas, sería la mitad del train set. Como no puedo permitirme tanto voy a separar un 20%, es decir, 180.

In [101]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 180# same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

((711, 12), (711,), (180, 12))

## Modelo base

In [102]:
def print_score(m):
    res = [metrics.accuracy_score(m.predict(X_train), y_train), metrics.accuracy_score(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [103]:
m = RandomForestClassifier(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)

CPU times: user 450 ms, sys: 109 ms, total: 558 ms
Wall time: 528 ms
[1.0, 0.8611111111111112, 1.0, 0.8611111111111112]


## Prueba de entrega

Obtengo 0.63

In [104]:
df_test = pd.read_csv(f'{PATH}test.csv', low_memory=False)

In [105]:
apply_cats(df=df_test, trn=df_raw)

In [106]:
X_test, _, nas = proc_df(df_test, na_dict=nas)

In [107]:
X_test.shape

(418, 12)

In [108]:
prediction = m.predict(X_test)

In [109]:
submission = pd.DataFrame()
submission['PassengerId'] = df_test.PassengerId
submission['Survived'] = prediction
submission.to_csv(f'{PATH}submission.csv', index=False)

-----------
## SEGUNDO INTENTO
Esta vez quitaré la columna de Nombres

In [110]:
df_raw = pd.read_csv(f'{PATH}train.csv', low_memory=False)

In [111]:
df_raw.shape

(891, 12)

In [112]:
df_raw.drop('Name', axis=1, inplace=True)

In [113]:
df_raw.shape

(891, 11)

Me encargo de las variables categóricas

In [114]:
train_cats(df_raw)

In [115]:
print(df_raw.isnull().sum().sort_index()/len(df_raw))

Age            0.198653
Cabin          0.771044
Embarked       0.002245
Fare           0.000000
Parch          0.000000
PassengerId    0.000000
Pclass         0.000000
Sex            0.000000
SibSp          0.000000
Survived       0.000000
Ticket         0.000000
dtype: float64


In [116]:
os.makedirs('tmp', exist_ok=True)
df_raw.to_feather('tmp/titanic-raw')

In [117]:
df_raw = pd.read_feather('tmp/titanic-raw')

Proceso el DF (Variables categóricas, NAs y separo df e y)

In [118]:
df, y, nas = proc_df(df_raw, 'Survived')

Segundo intento de Random Forest

In [119]:
m = RandomForestClassifier(n_jobs=-1)
m.fit(df, y)
m.score(df, y)

1.0

Claro Overfitting, imposible ese 1.0.  

Por si acaso separaré en *train* y *valid*.  
El test set tiene 418 entradas, sería la mitad del train set. Como no puedo permitirme tanto voy a separar un 20%, es decir, 180.

In [120]:
def split_vals(a,n): return a[:n].copy(), a[n:].copy()

n_valid = 180# same as Kaggle's test set size
n_trn = len(df)-n_valid
raw_train, raw_valid = split_vals(df_raw, n_trn)
X_train, X_valid = split_vals(df, n_trn)
y_train, y_valid = split_vals(y, n_trn)

X_train.shape, y_train.shape, X_valid.shape

((711, 11), (711,), (180, 11))

In [121]:
def print_score(m):
    res = [metrics.accuracy_score(m.predict(X_train), y_train), metrics.accuracy_score(m.predict(X_valid), y_valid),
                m.score(X_train, y_train), m.score(X_valid, y_valid)]
    if hasattr(m, 'oob_score_'): res.append(m.oob_score_)
    print(res)

In [122]:
m = RandomForestClassifier(n_jobs=-1)
%time m.fit(X_train, y_train)
print_score(m)

CPU times: user 405 ms, sys: 140 ms, total: 545 ms
Wall time: 527 ms
[1.0, 0.8666666666666667, 1.0, 0.8666666666666667]


El resultado es muy similar, creo que no mejora

In [123]:
df_test.drop('Name', axis=1, inplace=True)

In [124]:
df_test.shape

(418, 10)

In [125]:
apply_cats(df=df_test, trn=df_raw)

In [126]:
X_test, _, nas = proc_df(df_test, na_dict=nas)

In [127]:
X_test.shape

(418, 11)

In [128]:
X_test.head()

Unnamed: 0,PassengerId,Pclass,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked,Age_na
0,892,3,2,34.5,0,0,0,7.8292,0,2,False
1,893,3,1,47.0,1,0,0,7.0,0,3,False
2,894,2,2,62.0,0,0,0,9.6875,0,2,False
3,895,3,2,27.0,0,0,0,8.6625,0,3,False
4,896,3,1,22.0,1,1,252,12.2875,0,3,False


In [129]:
prediction = m.predict(X_test)

In [130]:
submission = pd.DataFrame()
submission['PassengerId'] = df_test.PassengerId
submission['Survived'] = prediction
submission.to_csv(f'{PATH}submission.csv', index=False)

Obtengo 0.67 aprox, una ligera mejora