# Buildind a model for the Kaggle Titanic Competition: summary (a 1st approach)

You can find the competion here: https://www.kaggle.com/c/titanic

The procedure could be the following.

## Load datasets

Load the train and test dataset. Then you will
- Train and validade on the train dataset
- Do prediction on test dataset and submit them to kaggle 

In [None]:
import pandas as pd

train_df = pd.read_csv("data/titanic_train.csv")
test_df = pd.read_csv("data/titanic_test.csv")

Take a glimpse at the data 

In [None]:
train_df.head()  

In [None]:
test_df.head()  

Keep the test's _PassengerId_ for later (submission fase)

In [None]:
test_passenger_id = test_df["PassengerId"]

Get the target from train_df (test does not have it) and drop that column

In [None]:
target = train_df["Survived"].values
train_df = train_df.drop(["Survived"], axis=1)

Concatenate the test and train dataframes as they will suffer the same operations, but remember their dimension as you'll need them later

In [None]:
m_train, n_train = train_df.shape
m_test, n_test = test_df.shape

df = pd.concat([train_df, test_df], axis=0)
df.info()

## Remove non-informative (?) columns

In a first approach, we might suppose that the _PassengerId_, _Name_ and _Ticket_ might be irrelevant and can be dropped

In [None]:
df = df.drop(["PassengerId", "Name", "Ticket"], axis=1)

## Replace `nan` in numerical columns by the median/mean value of the remaining values

Replace the "nan" _Ages_ and _Fare_ by the mean value 

In [None]:
age_mean_value = df["Age"].median()
df["Age"] = df["Age"].fillna(age_mean_value)

fare_mean_value = df["Fare"].mean()
df["Fare"] = df["Fare"].fillna(fare_mean_value)

## Apply one hot encoding to categorical classes

In our case, perhaps is better to replace the _Cabin_ by the corresponding deck (see 19-Missing-Data.ipynb) and then do the one hot encoding

In [None]:
def transf(x):
    try:
        return x[0]
    except:
        return x

df["Cabin"] = df["Cabin"].apply(transf)
df = pd.get_dummies(df)

In [None]:
df.head()

## Transformed dataset

The transformed dataset has now the following configuration

In [None]:
df.info()

In [None]:
df.describe().T

## Recover the train and test dataset

In [None]:
train_df = df.iloc[:m_train, :]
test_df = df.iloc[m_train:, :]

## Model 

To build the model get the data from the dataframe

In [None]:
data = train_df.values

Find the "best" `RandomForestClassifier` model doing grid search cross validation

In [None]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV

parameters = {"n_estimators": [10, 50, 100], "max_depth": [5, 10, 20, None]}

gscv = GridSearchCV(
    estimator=RandomForestClassifier(random_state=0),
    param_grid=parameters,
    cv=5,  # number of folds in a (Stratified)KFold,
)

gscv.fit(data, target)

## Get the best achieved model and predict for test data

In [None]:
model = gscv.best_estimator_
prediction = model.predict(test_df.values)
prediction

## Submit to Kaggle

Build the csv file to be subbmited to Kaggle

In [None]:
df_pred = pd.DataFrame(
    {
        "PassengerId": test_passenger_id, # retrieved after loading
        "Survived": prediction
    })

df_pred

save the prediction in a csv file

In [None]:
df_pred.to_csv("my_titanic_survival_predictions.csv", index=False)

## Not good enough...?
If you don't like your results you can see training details and try to some actions...

In [None]:
gscv.best_score_

The best paramters

In [None]:
gscv.best_params_

and the CV results

In [None]:
gscv.cv_results_