
# <center> Getting started in ML with Titanic dataset! <br>
![RMS Titanic](https://upload.wikimedia.org/wikipedia/commons/thumb/f/fd/RMS_Titanic_3.jpg/300px-RMS_Titanic_3.jpg "Titanic photo")
<br>
Hello there and welcome to my kernel. Here i will use titanic dataset from titanic competition.<br>
Before we start i must say that i'm no data scientist or ML Engineer (yet ;-) ), but a student interested in DS and ML.<br>
Here i'll try do cover such themes as:<br>
* basic exploratory data analysis
* some simple ML models such as LogisticRegression and DecisionTree
* handling missing values, scaling, feature selection and feature engineering
* ensembling with RandomForest etc.
* hyperparameters tuning with cross_val_score and GridSearchCV

In [None]:
%env JOBLIBTEMPFOLDER=/tmp

In [None]:
# From now all i will try to do all the imports
# before the section where i use them
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import os
# you can comment the following 2 lines if you'd like to see warnings
import warnings
warnings.filterwarnings('ignore')

In [None]:
# If you will use this notebook on your computer you would like to change these
INPUT_DATA_DIR = '../input'
TRAIN_FILE_NAME = 'train.csv'
TEST_FILE_NAME = 'test.csv'
TRAIN_DATA_PATH = os.path.join(INPUT_DATA_DIR, TRAIN_FILE_NAME)
TEST_DATA_PATH = os.path.join(INPUT_DATA_DIR, TEST_FILE_NAME)

In [None]:
# helper function to write submissions
def write_submission_file(prediction, filename,
    path_to_sample=os.path.join(INPUT_DATA_DIR, 'gender_submission.csv')):
    submission = pd.read_csv(path_to_sample, index_col='PassengerId')
    submission['Survived'] = prediction
    submission.to_csv(filename)

## 1 - The very beginnings
### Let's see what we got
It's always a good idea to look at data before you start doing anything.

In [None]:
train_data = pd.read_csv(TRAIN_DATA_PATH)
test_data = pd.read_csv(TEST_DATA_PATH)

In [None]:
train_data.head()

In [None]:
print(f'So we got {train_data.columns.values} columns.')

In [None]:
train_data.info()

In [None]:
test_data.info()

As you can see it's a fairly small dataset (11 features + target and only 891 rows).<br>
Also we have a lot of values missing in the `Cabin` column and some part of `Age`, `Fare` and `Embarked` features are missing as well in both datasets.<br>
Now let's see what pandas can say use about the features.

In [None]:
train_data.describe()

So as we can see only around **38%** of ppl from train dataset actually survived.

In [None]:
train_data.drop(['Name', 'Ticket'], axis=1).describe(include=['object', 'bool'])

### New let's do some initial preprocessing <br>
It would be a good idea to exclude `PassengerId` and `Ticket`.<br>
Transform `Sex` to integer (1 - for men, 0 - for women) will help our classifier as well.<br>
Fill the NaN values for `Age`, `Embarked` and `Fare`.

In [None]:
train_test = [train_data, test_data]
p_s_age_means = train_data.groupby(['Pclass', 'Sex']).agg({'Age': pd.Series.mean}).values
p_s_age = lambda x: p_s_age_means[x[0]-1+x[1]]

for dataset in train_test:
    # excluding some features
    dataset.drop(['PassengerId', 'Ticket'], axis=1, inplace=True)
    dataset['Sex'] = dataset['Sex'].map({'male': 1, 'female': 0})
    dataset['Age'].fillna(train_data[['Pclass', 'Sex']].apply(p_s_age, axis=1),
                          inplace=True)
#     #fill missing age with median
#     dataset['Age'].fillna(train_data['Age'].median(), inplace = True)

    #fill missing embarked with mode
    dataset['Embarked'].fillna(train_data['Embarked'].mode()[0], inplace = True)

    #fill missing fare with median
    dataset['Fare'].fillna(train_data['Fare'].median(), inplace = True)

## 2 - A little bit of visualization and feature engineering
Let's look into our data with the help of visualition. For this purpose we will use `matplotlib` and `seaborn`.

In [None]:
# import necessary modules
from matplotlib import pyplot as plt
import seaborn as sns
sns.set()

Let's see how `Sex` affects the survival odds.

In [None]:
sns.countplot(x='Sex', hue='Survived', data=train_data)
plt.ylabel('Survived count')
plt.show()

It is cleare, that the survival ration for women is much higher. <br>
So `Sex` is very valuable feature for us.

In [None]:
is_male = (train_data['Sex'] == 1)
male_survived_perc = train_data[is_male]['Survived'].mean()
female_survived_perc = train_data[~is_male]['Survived'].mean()
print(f'% of men survived: {male_survived_perc*100:.2f}')
print(f'% for women survived: {female_survived_perc*100:.2f}')

In [None]:
sns.countplot(x='Pclass', hue='Survived', data=train_data)
plt.show()

We see quite the same situation for `Pclass` - the survival ratio for 3rd class is quite low. <br>
And it's quite logical, since the break in the ship was on the lowest lewels.

In [None]:
sns.catplot(x='Sex', hue='Survived', col='Pclass',
            data=train_data, saturation=1,
            kind='count', ci=None, aspect=0.5, height=5)
plt.show()

Well, almost all women from 1st class survived, ~90% of women from 2nd class as well, but 50% for the third class is quite low.<br>
For men the situation was tragical for all `Pclass`'s, but a little bit better for the 1st class.

At this moment we have seen that `Pclass` and `Sex` give us a lot of information. <br>
And i am curious to make a baseline model just for this two features so that we could see how adding new features affects the model. <br>
So let's make a pause and try to fit the basic `LogisticRegression` with the default parameters and see what we got with `cross_val_score`.

In [None]:
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier

In [None]:
train_target = train_data['Survived'] # saving target feature
base_feats = ['Sex', 'Pclass']
X_train_base = train_data[base_feats]
X_test_base = test_data[base_feats]
scores = cross_val_score(estimator=DecisionTreeClassifier(random_state=17),
                         X=X_train_base, y=train_target,
                         # i know that it's a big cv value,
                         # but the data is too small
                         # so we can afford it
                         cv=10, scoring='accuracy', n_jobs=-1)
print(f'The mean accuracy of our baseline is {np.mean(scores)*100} %')
base_est = DecisionTreeClassifier(random_state=17)
base_est.fit(X_train_base, train_target)
preds = base_est.predict(X_test_base)
write_submission_file(preds, 'baseline_preds.csv')

78% and around 70% for test data is quite good for a baseline with only two features. <br>
Let's look further into our data!

In [None]:
sns.catplot(x='Pclass', y='Survived', hue='Sex', col='Embarked',
            data=train_data, kind='point', ci=None, aspect=0.6)
plt.show()

Nothing special here. Women have much higher survival probability than men, but we see that survival rate differs from port to port.

From now on we can try to create new features. <br>
Let's try and see what we can get.

In [None]:
for dataset in train_test:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

In [None]:
sns.pointplot(x='FamilySize',y='Survived', hue='Sex',
              data=train_data, ci=None, scale=0.8)
plt.show()

Looks like there is a relation between `FamilySize` and `Survived` for both men and women.

In [None]:
for dataset in train_test:
    dataset['IsAlone'] = (dataset['FamilySize'] == 1).astype('int')

In [None]:
sns.catplot(x='IsAlone', hue='Survived', col='Sex',
            data=train_data, saturation=1,
            kind='count', ci=None, aspect=.7)
plt.show()

As we can see alone women survived a little more, but for men the situation is opposite. 

In [None]:
for dataset in train_test:
    dataset['Title'] = dataset['Name']\
                       .str.split(",", expand=True)[1]\
                       .str.split(".", expand=True)[0]\
                       .str.strip()
    
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')
    
    rare_count = 10
    title_names = (dataset['Title'].value_counts() < rare_count)
    dataset['Title'] = dataset['Title']\
                       .apply(lambda x: 'Rare' if title_names.loc[x] == True else x)

In [None]:
sns.countplot(x='Title', hue='Survived',
              data=train_data, saturation=1)
plt.show()

In [None]:
for dataset in train_test:
    dataset['FareBin'] = pd.qcut(dataset['Fare'], 4)

In [None]:
sns.catplot(x='FareBin', row='Sex', hue='Survived',
            data=train_data, saturation=1, kind='count',
            ci=None, aspect=1.5, height=4)
plt.show()

In [None]:
for dataset in train_test:
    dataset['AgeBin'] = pd.cut(dataset['Age'].astype(int), 4)

In [None]:
sns.catplot(x='AgeBin', hue='Survived', row='Sex',
            data=train_data, kind='count', saturation=1,
            ci=None, aspect=1.5, height=4)
plt.show()

* The higher the `AgeBin` the lower are chances for men
* The higher the `AgeBin` the higher are chances for women

In [None]:
# get_deck = lambda x: (ord(x[0]) - ord('A') +1) if x[0] != 'T' else 1
# train_data['Deck'] = train_data['Cabin'].map(get_deck, na_action='ignore')

In [None]:
# # now let's fill missing with modes
# modes = train_data.groupby(by='Pclass').agg({'Deck': pd.Series.mode}).values
# pclass_deck_modes = dict(enumerate(modes[:, 0], 1))
# train_data['Deck'].fillna(train_data['Pclass'].map(pclass_deck_modes, na_action=None),
#                           inplace=True)
# # # or you can fill all NaN with special value
# # train_data['Deck'].fillna(0, inplace=True) 

In [None]:
# sns.catplot(x='Sex', y='Survived', col='Deck',
#             data=train_data.sort_values(by='Deck'), kind='bar',
#             ci=None, aspect=.35)
# plt.show()

In [None]:
# test_data['Deck'] = test_data['Cabin'].map(get_deck, na_action='ignore')
# test_data['Deck'].fillna(test_data['Pclass'].map(pclass_deck_modes, na_action=None),
#                           inplace=True)
# # test_data['Deck'].fillna(0, inplace=True)

In [None]:
for dataset in (train_test):
    dataset['Name_Length'] = dataset['Name'].apply(len)
    dataset['Has_Cabin'] = (~dataset['Cabin'].isnull()).astype('int')

In [None]:
sns.catplot(x='Has_Cabin', y='Survived', hue='Sex',
            data=train_data, saturation=1, kind='bar',
            ci=None, aspect=1.5, height=4)
plt.show()

In [None]:
from sklearn.preprocessing import LabelEncoder

label = LabelEncoder()
for dataset in train_test:
    dataset['AgeBin_Code'] = label.fit_transform(dataset['AgeBin'])
    dataset['FareBin_Code'] = label.fit_transform(dataset['FareBin'])

In [None]:
feats_to_exclude = ['Name', 'Age', 'Fare', 'Cabin', #'IsAlone', 'SibSp', 'Parch',
                    'FareBin', 'AgeBin']
X_train = train_data.drop(['Survived']+feats_to_exclude, axis=1)
X_test = test_data.drop(feats_to_exclude, axis=1)

In [None]:
X_train = pd.get_dummies(X_train)
X_test = pd.get_dummies(X_test)
X_train.head(4)

In [None]:
from sklearn.preprocessing import MinMaxScaler

features_to_scale = ['Name_Length',#'Pclass', 'FamilySize', 'Parch', 'SibSp',
                     'AgeBin_Code', 'FareBin_Code']
feats_scaler = MinMaxScaler()
X_train[features_to_scale] = feats_scaler.fit_transform(X_train[features_to_scale])
X_test[features_to_scale] = feats_scaler.transform(X_test[features_to_scale])

In [None]:
X_train.head(4)

## 3 - Selecting and tuning our classifier
Now we can try different Classifiers. <br>

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (RandomForestClassifier, 
                              AdaBoostClassifier,
                              ExtraTreesClassifier)
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

In [None]:
def get_score(estimator, X, y):
    scores = cross_val_score(estimator, X, y,
                             cv=10, n_jobs=-1, scoring='accuracy')
    return scores.mean()

In [None]:
clfs = [RandomForestClassifier(random_state=17), 
        ExtraTreesClassifier(random_state=17),
#         DecisionTreeClassifier(random_state=17),
        AdaBoostClassifier(random_state=17),
        KNeighborsClassifier(),
        SVC(random_state=17),
        LogisticRegression(random_state=17),
        GaussianNB()]
clfs_names = ['RandomForestClassifier', 
              'ExtraTreesClassifier'
#               'DecisionTreeClassifier',
              'AdaBoostClassifier',
              'KNeighborsClassifier',
              'SVC',
              'LogisticRegression',
              'GaussianNB']
clf_scores = dict(zip(clfs_names,(get_score(clf,
                                            X_train,
                                            train_target) for clf in clfs)))

In [None]:
pd.DataFrame.from_dict(clf_scores,
                       orient='index',
                       columns=['Score']).sort_values(by='Score', ascending=False)

In [None]:
nb = GaussianNB()
nb.fit(X_train, train_target)
nb_preds = nb.predict(X_test)
write_submission_file(nb_preds, 'nb_submission.csv')

Also i would like to use RandomForest for a while just to show the feature importances.

In [None]:
rf = RandomForestClassifier(random_state=17).fit(X_train, train_target)
importances = rf.feature_importances_
sorted_idx = np.argsort(importances)
plt.barh(X_train.columns[sorted_idx], importances[sorted_idx])
plt.title('Feature importances: RandomForest')
plt.show()

Looks like some features are almost useless for RandomForest, but i will now drop them right now.

Why dont we just try to stack some models. For example let's get our tuned svc, random forest and logistic regression.<br>
And then add their predictions to the data and train new classifier on this data.<br>

In [None]:
from sklearn.model_selection import StratifiedKFold, GridSearchCV

In [None]:
def get_stacking_preds(estimator, X_train, train_target,
                       X_test, n_splits=10, random_state=17):
    folds = StratifiedKFold(n_splits=n_splits, random_state=random_state)
    train_pred = np.empty((0,1),float)
    # for each fold
    for train_indices,val_indices in folds.split(X_train, train_target):
        # splitting the train set
        x_train, x_val = X_train.iloc[train_indices], X_train.iloc[val_indices]
        y_train, y_val = train_target.iloc[train_indices], train_target.iloc[val_indices]
        # training the model on the training part
        # yeah, like training part of the train set
        estimator.fit(X=x_train, y=y_train)
        # predict the validation set to avoid data leakage
        train_pred = np.append(train_pred, estimator.predict(x_val))
    # fit full train data
    estimator.fit(X_train, train_target)
    # and predict the test data
    test_pred = estimator.predict(X_test)
    return train_pred, test_pred

In [None]:
%%time
svc = SVC(random_state=17)
Cs = np.linspace(0.01, 10, 15)
gamma = ['auto', 'scale']+ list(np.logspace(-5, 1, num=15))
kernel = ['linear', 'rbf', 'sigmoid']
svc_params = {'C': Cs, 'gamma' : gamma, 'kernel': kernel}
svc_grid = GridSearchCV(estimator=svc, param_grid=svc_params, cv=10,
                        n_jobs=-1, scoring='accuracy', verbose=False)
svc_grid.fit(X_train, train_target)

print(svc_grid.best_score_, svc_grid.best_params_)
best_svc = svc_grid.best_estimator_

best_svc.fit(X_train, train_target)
best_svc_preds = best_svc.predict(X_test)
write_submission_file(best_svc_preds, 'best_svc_preds_submission.csv')

In [None]:
%%time
rf = RandomForestClassifier(random_state=17)
rf_params = {'n_estimators': [100, 250, 500, 750], 
             'max_depth': [None, 3, 5, 8],
             'min_samples_leaf': [1, 2, 5],
             'min_samples_split': [2, 4, 10],
             'max_features': [None, 'sqrt', 'log2', 0.6]}
rf_grid = GridSearchCV(estimator=rf, param_grid=rf_params,
                       cv=5, n_jobs=-1, verbose=False)
rf_grid.fit(X_train, train_target)

print(rf_grid.best_score_, rf_grid.best_params_)
best_rf = rf_grid.best_estimator_

best_rf.fit(X_train, train_target)
best_rf_preds = best_rf.predict(X_test)
write_submission_file(best_rf_preds, 'best_rf_preds_submission.csv')

In [None]:
%%time
et = ExtraTreesClassifier(random_state=17)
et_params = rf_params
et_grid = GridSearchCV(estimator=et, param_grid=et_params,
                       cv=5, n_jobs=-1, verbose=False)
et_grid.fit(X_train, train_target)

print(et_grid.best_score_, et_grid.best_params_)
best_et = et_grid.best_estimator_

In [None]:
%%time
ada = AdaBoostClassifier(random_state=17)
ada_params = {'n_estimators': [100, 200, 300, 400, 500, 600],
              'learning_rate': [0.01, 0.025, 0.05, 0.075, 0.1, 0.5]}
ada_grid = GridSearchCV(estimator=ada, param_grid=ada_params,
                        cv=5, n_jobs=-1, verbose=False)
ada_grid.fit(X_train, train_target)

print(ada_grid.best_score_, ada_grid.best_params_)
best_ada = ada_grid.best_estimator_

In [None]:
%%time
knn = KNeighborsClassifier()
params = {'n_neighbors': range(1, 10),
          'weights': ['uniform', 'distance'],
          'p': [1, 2, 3]}
knn_grid = GridSearchCV(estimator=knn,
                        param_grid=params,
                        cv=10, scoring='accuracy',
                        verbose=False, n_jobs=-1)
knn_grid.fit(X_train, train_target)

print(knn_grid.best_score_, knn_grid.best_params_)
best_knn = knn_grid.best_estimator_

In [None]:
%%time
lr = LogisticRegression(random_state=17)
lr_params = {'C': np.linspace(0.001, 10, 40),
            'class_weight': [None, 'balanced'],
            'penalty': ['l1', 'l2']}
lr_grid = GridSearchCV(estimator=lr,
                        param_grid=lr_params,
                        cv=10, scoring='accuracy',
                        verbose=False, n_jobs=-1)
lr_grid.fit(X_train, train_target)

print(lr_grid.best_score_, lr_grid.best_params_)
best_lr = lr_grid.best_estimator_
best_lr.fit(X_train, train_target)
best_lr_preds = best_lr.predict(X_test)
write_submission_file(best_lr_preds, 'best_lr_preds_submission.csv')

best_lr.fit(X_train, train_target)
best_lr_preds = best_lr.predict(X_test)
write_submission_file(best_lr_preds, 'best_logistic_regression_preds_submission.csv')

In [None]:
%%time
clfs = [
    best_svc,
    best_rf, 
    best_et,
    best_ada,
    best_knn,
    best_lr,
    GaussianNB()
]
# seed = 17
# clfs = [
#     SVC(random_state=seed),
#     RandomForestClassifier(random_state=seed), 
#     ExtraTreesClassifier(random_state=seed),
#     AdaBoostClassifier(random_state=seed),
#     KNeighborsClassifier(),
#     LogisticRegression(random_state=seed),
#     GaussianNB()
# ]
clfs_names = [
    'SVC', 
    'RandomForest',
    'ExtraTrees',
    'AdaBoost',
    'KNN',
    'LogisticRegression',
    'GaussianNB'
    ]
base_models_train_preds = pd.DataFrame()
base_models_test_preds = pd.DataFrame()
for clf, name in zip(clfs, clfs_names):
    clf_train_preds, clf_test_preds = get_stacking_preds(clf, X_train,
                                                         train_target, X_test)
    base_models_train_preds[name+'_pred'] = clf_train_preds
    base_models_test_preds[name+'_pred'] = clf_test_preds    
# X_train_stack = pd.concat([X_train, base_models_train_preds], axis=1)
# X_test_stack = pd.concat([X_train, base_models_test_preds], axis=1)
X_train_stack = base_models_train_preds
X_test_stack = base_models_test_preds

In [None]:
# import xgboost as xgb
# aggregator = xgb.XGBClassifier(
#     learning_rate = 0.02,
#     n_estimators= 1000,
#     max_depth= 3,
#     min_child_weight= 2,
#     gamma=0.8,
#     subsample=1,
#     colsample_bytree=0.7,
#     objective= 'binary:logistic',
#     n_jobs=-1,
#     scale_pos_weight=1
# )
# aggregator = SVC()
# aggregator = RandomForestClassifier()
aggregator = LogisticRegression()
print(f'Score: {get_score(aggregator, X_train_stack, train_target)*100:.2f}%')

In [None]:
aggregator.fit(X_train_stack, train_target)
stacked_preds = aggregator.predict(X_test_stack)
write_submission_file(stacked_preds, 'stacked_untuned_submission.csv')

In [None]:
%%time
lr_agg_grid = GridSearchCV(estimator=aggregator,
                        param_grid=lr_params,
                        cv=10, scoring='accuracy',
                        verbose=False, n_jobs=-1)
lr_agg_grid.fit(X_train_stack, train_target)

In [None]:
# %%time
# rf_agg_grid = GridSearchCV(estimator=aggregator,
#                         param_grid=rf_params,
#                         cv=10, scoring='accuracy',
#                         verbose=False, n_jobs=-1)
# rf_agg_grid.fit(X_train_stack, train_target)

In [None]:
# %%time
# svc_agg_grid = GridSearchCV(estimator=aggregator,
#                         param_grid=svc_params,
#                         cv=10, scoring='accuracy',
#                         verbose=False, n_jobs=-1)
# svc_agg_grid.fit(X_train_stack, train_target)

In [None]:
# aggregator = svc_agg_grid.best_estimator_
# aggregator = rf_agg_grid.best_estimator_
aggregator = lr_agg_grid.best_estimator_
print(lr_agg_grid.best_params_)
print(f'Score: {get_score(aggregator, X_train_stack, train_target)*100:.2f}%')
aggregator.fit(X_train_stack, train_target)
stacked_preds = aggregator.predict(X_test_stack)
write_submission_file(stacked_preds, 'stacked_tuned_submission.csv')