> **Problem overview**

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

In [None]:
# import python standard library
import re

# import data manipulation library
import numpy as np
import pandas as pd

# import data visualization library
import matplotlib.pyplot as plt
import seaborn as sns

# import sklearn model class
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier

# import sklearn model selection
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split

# import sklearn model evaluation classification metrics
from sklearn.metrics import accuracy_score, auc, classification_report, confusion_matrix, f1_score, fbeta_score, precision_recall_curve, precision_score, recall_score, roc_auc_score, roc_curve

> **Acquiring training and testing data**

We start by acquiring the training and testing datasets into Pandas DataFrames.

In [None]:
# acquiring training and testing data
df_train = pd.read_csv('../input/train.csv')
df_test = pd.read_csv('../input/test.csv')

In [None]:
# visualize head of the training data
df_train.head(n=5)

In [None]:
# visualize tail of the testing data
df_test.tail(n=5)

In [None]:
# combine training and testing dataframe
df_train['DataType'], df_test['DataType'] = 'training', 'testing'
df_test.insert(1, 'Survived', np.nan)
df_data = pd.concat([df_train, df_test], ignore_index=True)

> **Feature exploration, engineering and cleansing**

Here we generate descriptive statistics that summarize the central tendency, dispersion and shape of a dataset’s distribution together with exploring some data.

In [None]:
# countplot function plot - categorical variable (x-axis) vs. categorical variable (y-axis)
def countplot(x = None, y = None, data = None, ncols = 5, nrows = 3):
    fig, axes = plt.subplots(figsize=(4*ncols , 3*nrows), ncols=ncols, nrows=nrows)
    axes = axes.flatten()
    for i, v in enumerate(x): sns.countplot(x=v, hue=y, data=data, ax=axes[i])

In [None]:
# boxplot function plot - categorical variable (x-axis) vs. numerical variable (y-axis)
def boxplot(cat = None, num = None, data = None, ncols = 5, nrows = 3):
    fig, axes = plt.subplots(figsize=(4*ncols , 3*nrows), ncols=ncols, nrows=nrows)
    axes = axes.flatten()
    if type(cat) == list:
        for i, v in enumerate(cat): sns.boxplot(x=v, y=num, data=data, ax=axes[i])
    else:
        for i, v in enumerate(num): sns.boxplot(x=cat, y=v, data=data, ax=axes[i])

In [None]:
# swarmplot function plot - categorical variable (x-axis) vs. numerical variable (y-axis)
def swarmplot(cat = None, num = None, data = None, ncols = 5, nrows = 3):
    fig, axes = plt.subplots(figsize=(4*ncols , 3*nrows), ncols=ncols, nrows=nrows)
    axes = axes.flatten()
    if type(cat) == list:
        for i, v in enumerate(cat): sns.swarmplot(x=v, y=num, data=data, ax=axes[i])
    else:
        for i, v in enumerate(num): sns.swarmplot(x=cat, y=v, data=data, ax=axes[i])

In [None]:
# violinplot function plot - categorical variable (x-axis) vs. numerical variable (y-axis)
def violinplot(cat = None, num = None, data = None, ncols = 5, nrows = 3):
    fig, axes = plt.subplots(figsize=(4*ncols , 3*nrows), ncols=ncols, nrows=nrows)
    axes = axes.flatten()
    if type(cat) == list:
        for i, v in enumerate(cat): sns.violinplot(x=v, y=num, data=data, ax=axes[i])
    else:
        for i, v in enumerate(num): sns.violinplot(x=cat, y=v, data=data, ax=axes[i])

In [None]:
# scatterplot function plot - numerical variable (x-axis) vs. numerical variable (y-axis)
def scatterplot(x = None, y = None, data = None, ncols = 5, nrows = 3):
    fig, axes = plt.subplots(figsize=(4*ncols , 3*nrows), ncols=ncols, nrows=nrows)
    axes = axes.flatten()
    for i, xi in enumerate(x): sns.scatterplot(x=xi, y=y, data=data, ax=axes[i])

In [None]:
# describe training and testing data
df_data.describe(include='all')

In [None]:
# convert dtypes numeric to object
col_convert = ['Survived', 'Pclass', 'SibSp', 'Parch']
df_data[col_convert] = df_data[col_convert].astype('object')

In [None]:
# list all features type number
col_number = df_data.select_dtypes(include=['number']).columns.tolist()
print('features type number:\n items %s\n length %d' %(col_number, len(col_number)))

# list all features type object
col_object = df_data.select_dtypes(include=['object']).columns.tolist()
print('features type object:\n items %s\n length %d' %(col_object, len(col_object)))

In [None]:
# feature extraction: surname
df_data['Surname'] = df_data['Name'].str.extract(r'([A-Za-z]+),', expand=False)

In [None]:
# feature extraction: title
df_data['Title'] = df_data['Name'].str.extract(r' ([A-Za-z]+)\.', expand=False)
df_data['Title'] = df_data['Title'].replace(['Capt', 'Rev'], 'Crew')
df_data['Title'] = df_data['Title'].replace('Ms', 'Miss')
df_data['Title'] = df_data['Title'].replace(['Col', 'Countess', 'Don', 'Dona', 'Jonkheer', 'Lady', 'Major', 'Mlle', 'Mme', 'Sir'], 'Royal')
df_data['Title'].value_counts()

In [None]:
# feature exploration: sex
col_object = df_data.select_dtypes(include=['object']).columns.drop(['Name', 'Ticket', 'Cabin', 'Surname']).tolist()
countplot(x=col_object, y='Sex', data=df_data, nrows=(len(col_object) - 1) // 5 + 1)

In [None]:
# feature exploration: age
col_object = df_data.select_dtypes(include=['object']).columns.drop(['Name', 'Ticket', 'Cabin', 'Surname']).tolist()
swarmplot(cat=col_object, num='Age', data=df_data, nrows=(len(col_object) - 1) // 5 + 1)

In [None]:
# feature extraction: age
df_data['Age'] = df_data['Age'].fillna(df_data.groupby(['Title'], as_index=True)['Age'].transform('mean'))

In [None]:
# feature extraction: family size
df_data['FamilySize'] = df_data['SibSp'] + df_data['Parch'] + 1

In [None]:
# feature extraction: ticket string
df_data['TicketString'] = df_data['Ticket'].apply(lambda x: ''.join(re.findall(r'[a-zA-Z]+', x)))
df_data['TicketString'] = df_data['TicketString'].replace(['CASOTON', 'SOTONO', 'STONO', 'STONOQ'], 'SOTONOQ')
df_data['TicketString'] = df_data['TicketString'].replace(['SC', 'SCParis'], 'SCPARIS')
df_data['TicketString'] = df_data['TicketString'].replace('FCC', 'FC')
df_data['TicketString'] = df_data['TicketString'].replace(df_data['TicketString'].value_counts()[df_data['TicketString'].value_counts() < 10].index.tolist(), 'OTHER')
df_data['TicketString'].value_counts()

In [None]:
# feature extraction: has ticket string
df_data['HasTicketString'] = df_data['TicketString'].apply(lambda x: 1 if x else 0).astype('object')

In [None]:
# feature exploration: fare
col_object = df_data.select_dtypes(include=['object']).columns.drop(['Name', 'Ticket', 'Cabin', 'Surname']).tolist()
swarmplot(cat=col_object, num='Fare', data=df_data, nrows=(len(col_object) - 1) // 5 + 1)

In [None]:
# feature extraction: fare
df_data['Fare'] = df_data['Fare'].fillna(df_data.groupby(['Pclass'], as_index=True)['Fare'].transform('mean'))

In [None]:
# feature extraction: cabin
df_data['Cabin'] = df_data['Cabin'].fillna(0)

In [None]:
# feature extraction: cabin string
df_data['CabinString'] = df_data['Cabin'].str.extract(r'([A-Za-z]+)', expand=False)

In [None]:
# feature extraction: has cabin
df_data['HasCabin'] = df_data['CabinString'].apply(lambda x: 0 if pd.isnull(x) else 1).astype('object')

In [None]:
# feature exploration: embarked
col_object = df_data.select_dtypes(include=['object']).columns.drop(['Name', 'Ticket', 'Cabin', 'Surname']).tolist()
countplot(x=col_object, y='Embarked', data=df_data, nrows=(len(col_object) - 1) // 5 + 1)

In [None]:
# feature extraction: embarked
df_data['Embarked'] = df_data['Embarked'].fillna(df_data['Embarked'].value_counts().idxmax())

In [None]:
# list all features type number
col_number = df_data.select_dtypes(include=['number']).columns.tolist()
print('features type number:\n items %s\n length %d' %(col_number, len(col_number)))

# list all features type object
col_object = df_data.select_dtypes(include=['object']).columns.tolist()
print('features type object:\n items %s\n length %d' %(col_object, len(col_object)))

In [None]:
# feature exploration: survived
col_number = df_data.select_dtypes(include=['number']).columns.drop(['PassengerId']).tolist()
col_object = df_data.select_dtypes(include=['object']).columns.drop(['Name', 'Ticket', 'Cabin', 'Surname']).tolist()
swarmplot(cat='Survived', num=col_number, data=df_data[df_data['DataType'] == 'training'], nrows=(len(col_number) - 1) // 5 + 1)
countplot(x=col_object, y='Survived', data=df_data[df_data['DataType'] == 'training'], nrows=(len(col_object) - 1) // 5 + 1)

In [None]:
# feature exploration: survived where family size equal to 1
col_number = df_data.select_dtypes(include=['number']).columns.drop(['PassengerId']).tolist()
col_object = df_data.select_dtypes(include=['object']).columns.drop(['Name', 'Ticket', 'Cabin', 'Surname']).tolist()
swarmplot(cat='Survived', num=col_number, data=df_data[(df_data['DataType'] == 'training') & (df_data['FamilySize'] == 1)], nrows=(len(col_number) - 1) // 5 + 1)
countplot(x=col_object, y='Survived', data=df_data[(df_data['DataType'] == 'training') & (df_data['FamilySize'] == 1)], nrows=(len(col_object) - 1) // 5 + 1)

In [None]:
# feature exploration: survived where family size more than 1
col_number = df_data.select_dtypes(include=['number']).columns.drop(['PassengerId']).tolist()
col_object = df_data.select_dtypes(include=['object']).columns.drop(['Name', 'Ticket', 'Cabin', 'Surname']).tolist()
swarmplot(cat='Survived', num=col_number, data=df_data[(df_data['DataType'] == 'training') & (df_data['FamilySize'] > 1)], nrows=(len(col_number) - 1) // 5 + 1)
countplot(x=col_object, y='Survived', data=df_data[(df_data['DataType'] == 'training') & (df_data['FamilySize'] > 1)], nrows=(len(col_object) - 1) // 5 + 1)

The exploratory data analysis resulting in,
* **Pclass:** The 1st class ticket tend to more survived than 2nd class and 3rd class, respectively.
* **Sex:** Female tend to more survived than male.
* **Title:** The master and royal title tend to more survived than other male titles.
* **FamilySize:** The persons who come with family tend to more survived than single.
* **CabinString:** The persons assigned cabin A to F except C tend to more survived than cabin C and G, respectively.
* **HasCabin:** The persons assigned cabin tend to more survived than without assgned cabin.

In [None]:
# feature extraction: ticket dataframe
df_ticket = pd.get_dummies(df_data[df_data['FamilySize'] > 1], columns=['Pclass', 'Sex', 'Embarked', 'DataType', 'Title', 'CabinString', 'HasCabin'], drop_first=False)
df_ticket['Survived'] = df_ticket['Survived'].astype(float)
df_ticket = df_ticket.groupby(['Ticket'], as_index=False).agg({
    'Survived': 'mean',
    'Pclass_1': sum, 'Pclass_2': sum,  'Pclass_3': sum,
    'Sex_male': sum, 'Sex_female': sum,
    'Embarked_C': sum, 'Embarked_Q': sum, 'Embarked_S': sum,
    'DataType_training': sum, 'DataType_testing': sum,
    'Title_Crew': sum, 'Title_Dr': sum, 'Title_Master': sum, 'Title_Miss': sum, 'Title_Mr': sum, 'Title_Mrs': sum, 'Title_Royal': sum,
    'CabinString_A': sum, 'CabinString_B': sum, 'CabinString_C': sum, 'CabinString_D': sum, 'CabinString_E': sum, 'CabinString_F': sum, 'CabinString_G': sum,
    'HasCabin_0': sum, 'HasCabin_1': sum
})

In [None]:
# describe ticket dataframe
df_ticket.describe(include='all')

In [None]:
# convert dtypes numeric to object
col_convert = df_ticket.columns.drop('Ticket').tolist()
df_ticket[col_convert] = df_ticket[col_convert].astype('object')

In [None]:
# convert dtypes object to numeric
col_convert = ['Survived']
df_ticket[col_convert] = df_ticket[col_convert].astype(float)

In [None]:
# feature extraction: together
df_ticket['Together'] = df_ticket['Survived'].apply(lambda x: 1 if x == 0 or x == 1 else 0).astype('object')

In [None]:
# feature exploration: survived
col_object = df_ticket.select_dtypes(include=['object']).columns.drop('Ticket').tolist()
swarmplot(cat=col_object, num='Survived', data=df_ticket, nrows=(len(col_object) - 1) // 5 + 1)
violinplot(cat=col_object, num='Survived', data=df_ticket, nrows=(len(col_object) - 1) // 5 + 1)

In [None]:
# feature extraction: with sex and title
df_data = pd.merge(df_data, df_ticket[['Ticket', 'Sex_male', 'Sex_female', 'Title_Crew', 'Title_Dr', 'Title_Master', 'Title_Miss', 'Title_Mr', 'Title_Mrs', 'Title_Royal']], how='left', left_on='Ticket', right_on='Ticket').rename(columns={
    'Sex_male': 'WithSexMale', 'Sex_female': 'WithSexFemale',
    'Title_Crew': 'WithTitleCrew', 'Title_Dr': 'WithTitleDr', 'Title_Master': 'WithTitleMaster', 'Title_Miss': 'WithTitleMiss', 'Title_Mr': 'WithTitleMr', 'Title_Mrs': 'WithTitleMrs', 'Title_Royal': 'WithTitleRoyal'
})
col_fillnas = ['WithSexMale', 'WithSexFemale', 'WithTitleCrew', 'WithTitleDr', 'WithTitleMaster', 'WithTitleMiss', 'WithTitleMr', 'WithTitleMrs', 'WithTitleRoyal']
for col_fillna in col_fillnas: df_data[col_fillna] = df_data[col_fillna].fillna(0)

In [None]:
# feature extraction: ticket_self dataframe
df_temp = df_data.copy(deep=True)
df_temp['Survived'] = df_temp['Survived'].astype(float)
df_ticket_self = df_temp.groupby(['Ticket'], as_index=True)

# feature extraction: survived peer
count = df_ticket_self['Survived'].transform('count')
mean = df_ticket_self['Survived'].transform('mean')
df_data['SurvivedPeer'] = (mean * count - df_data['Survived'].astype(float)) / (count - 1)
df_data['SurvivedPeer'] = df_data['SurvivedPeer'].astype(float).fillna(-1)

In [None]:
# feature extraction: ticket_title dataframe
df_temp = df_data.copy(deep=True)
df_temp['Survived'] = df_temp['Survived'].astype(float)
col_revises = ['Crew', 'Dr', 'Master', 'Miss', 'Mr', 'Mrs', 'Royal']
for col_revise in col_revises:
    df_temp['Survived' + col_revise] = df_temp['Survived']
    df_temp.loc[df_temp['Title'] != col_revise, 'Survived' + col_revise] = np.nan
df_ticket_title = df_temp.groupby(['Ticket'], as_index=True)

# feature extraction: survived peer title
for col_revise in col_revises:
    count = df_ticket_title['Survived' + col_revise].transform('count')
    mean = df_ticket_title['Survived' + col_revise].transform('mean')
    df_data['SurvivedPeer' + col_revise] = (mean * count - df_temp['Survived' + col_revise].astype(float)) / (count - 1)
    df_data['SurvivedPeer' + col_revise] = df_data['SurvivedPeer' + col_revise].astype(float).fillna(-1)

In [None]:
# feature exploration: survived peer and with sex and title
col_number = df_data.select_dtypes(include=['number']).columns.drop(['PassengerId']).tolist()
swarmplot(cat='Survived', num=col_number, data=df_data[(df_data['DataType'] == 'training') & (df_data['WithTitleMaster'] >= 1)], nrows=(len(col_number) - 1) // 5 + 1)
violinplot(cat='Survived', num=col_number, data=df_data[(df_data['DataType'] == 'training') & (df_data['WithTitleMaster'] >= 1)], nrows=(len(col_number) - 1) // 5 + 1)

The exploratory data analysis resulting in,
* **SurvivedPeer:** The survived peer tend to correlated with the survived status.
* **SurvivedPeerMaster:** The survived peer tend to correlated with the survived status for the persons who has master title.
* **SurvivedPeerMiss:** The survived peer tend to correlated with the survived status for the persons who has miss title.
* **SurvivedPeerMr:** The survived peer tend to correlated with the survived status for the persons who has mr title.

After extracting all features, it is required to convert category features to numerics features, a format suitable to feed into our Machine Learning models.

In [None]:
# feature extraction: survived
df_data['Survived'] = df_data['Survived'].fillna(-1)

In [None]:
# convert category codes for data dataframe
df_data = pd.get_dummies(df_data, columns=['Pclass', 'Sex', 'Embarked', 'DataType', 'Title', 'TicketString', 'HasTicketString', 'CabinString', 'HasCabin'], drop_first=True)

In [None]:
# convert dtypes object to numeric for data dataframe
col_convert = ['Survived', 'SibSp', 'Parch', 'FamilySize']
df_data[col_convert] = df_data[col_convert].astype(int)

In [None]:
# describe data dataframe
df_data.describe(include='all')

In [None]:
# verify dtypes object
df_data.info()

> **Analyze and identify patterns by visualizations**

Let us generate some correlation plots of the features to see how related one feature is to the next. To do so, we will utilize the Seaborn plotting package which allows us to plot very conveniently as follows.

The Pearson Correlation plot can tell us the correlation between features with one another. If there is no strongly correlated between features, this means that there isn't much redundant or superfluous data in our training data. This plot is also useful to determine which features are correlated to the observed value.

The pairplots is also useful to observe the distribution of the training data from one feature to the other.

The pivot table is also another useful method to observe the impact between features.

In [None]:
# compute pairwise correlation of columns, excluding NA/null values and present through heat map
corr = df_data[df_data['DataType_training'] == 1].corr()
fig, axes = plt.subplots(figsize=(20, 15))
heatmap = sns.heatmap(corr, annot=True, cmap=plt.cm.RdBu, fmt='.1f', square=True, vmin=-0.8, vmax=0.8)

> **Model, predict and solve the problem**

Now, it is time to feed the features to Machine Learning models.

In [None]:
# select all features to evaluate the feature importances
x = df_data[df_data['DataType_training'] == 1].drop(['PassengerId', 'Survived', 'Name', 'Ticket', 'Cabin', 'Surname', 'FamilySize', 'DataType_training', 'SurvivedPeerCrew', 'SurvivedPeerDr', 'SurvivedPeerMaster', 'SurvivedPeerMiss', 'SurvivedPeerMr', 'SurvivedPeerMrs', 'SurvivedPeerRoyal'], axis=1)
y = df_data.loc[df_data['DataType_training'] == 1, 'Survived']

In [None]:
# set up random forest classifier to find the feature importances
forestclf = RandomForestClassifier(n_estimators=100, random_state=58).fit(x, y)
feat = pd.DataFrame(data=forestclf.feature_importances_, index=x.columns, columns=['FeatureImportances']).sort_values(['FeatureImportances'], ascending=False)

In [None]:
# plot the feature importances
feat.plot(y='FeatureImportances', figsize=(20, 5), kind='bar', logy=True)
plt.axhline(0.005, color="grey")

In [None]:
# list feature importances
model_feat = feat[feat['FeatureImportances'] > 0.005].index

In [None]:
# select the important features
x = df_data.loc[df_data['DataType_training'] == 1, model_feat]
y = df_data.loc[df_data['DataType_training'] == 1, 'Survived']

In [None]:
# perform train-test (validate) split
x_train, x_validate, y_train, y_validate = train_test_split(x, y, random_state=58, test_size=0.25)

In [None]:
# logistic regression model setup
model_logreg = LogisticRegression()

# logistic regression model fit
model_logreg.fit(x_train, y_train)

# logistic regression model prediction
model_logreg_ypredict = model_logreg.predict(x_validate)

# logistic regression model metrics
model_logreg_f1score = f1_score(y_validate, model_logreg_ypredict)
model_logreg_accuracyscore = accuracy_score(y_validate, model_logreg_ypredict)
model_logreg_cvscores = cross_val_score(model_logreg, x, y, cv=20, scoring='accuracy')
print('logistic regression\n  f1 score: %0.4f, accuracy score: %0.4f, cross validation score: %0.4f (+/- %0.4f)' %(model_logreg_f1score, model_logreg_accuracyscore, model_logreg_cvscores.mean(), 2 * model_logreg_cvscores.std()))

In [None]:
# decision tree classifier model setup
model_treeclf = DecisionTreeClassifier(splitter='best', min_samples_split=5)

# decision tree classifier model fit
model_treeclf.fit(x_train, y_train)

# decision tree classifier model prediction
model_treeclf_ypredict = model_treeclf.predict(x_validate)

# decision tree classifier model metrics
model_treeclf_f1score = f1_score(y_validate, model_treeclf_ypredict)
model_treeclf_accuracyscore = accuracy_score(y_validate, model_treeclf_ypredict)
model_treeclf_cvscores = cross_val_score(model_treeclf, x, y, cv=20, scoring='accuracy')
print('decision tree classifier\n  f1 score: %0.4f, accuracy score: %0.4f, cross validation score: %0.4f (+/- %0.4f)' %(model_treeclf_f1score, model_treeclf_accuracyscore, model_treeclf_cvscores.mean(), 2 * model_treeclf_cvscores.std()))

In [None]:
# random forest classifier model setup
model_forestclf = RandomForestClassifier(n_estimators=100, min_samples_split=5, random_state=58)

# random forest classifier model fit
model_forestclf.fit(x_train, y_train)

# random forest classifier model prediction
model_forestclf_ypredict = model_forestclf.predict(x_validate)

# random forest classifier model metrics
model_forestclf_f1score = f1_score(y_validate, model_forestclf_ypredict)
model_forestclf_accuracyscore = accuracy_score(y_validate, model_forestclf_ypredict)
model_forestclf_cvscores = cross_val_score(model_forestclf, x, y, cv=20, scoring='accuracy')
print('random forest classifier\n  f1 score: %0.4f, accuracy score: %0.4f, cross validation score: %0.4f (+/- %0.4f)' %(model_forestclf_f1score, model_forestclf_accuracyscore, model_forestclf_cvscores.mean(), 2 * model_forestclf_cvscores.std()))

In [None]:
# specify the hyperparameter space
params = {'n_estimators': [100],
          'max_depth': [10, 20, None],
          'min_samples_split': [3, 5, 7, 9],
          'random_state': [58],
}

# random forest classifier grid search model setup
model_forestclf_cv = GridSearchCV(model_forestclf, params, cv=5)

# random forest classifier grid search model fit
model_forestclf_cv.fit(x_train, y_train)

# random forest classifier grid search model prediction
model_forestclf_cv_ypredict = model_forestclf_cv.predict(x_validate)

# random forest classifier grid search model metrics
model_forestclf_cv_f1score = f1_score(y_validate, model_forestclf_cv_ypredict)
model_forestclf_cv_accuracyscore = accuracy_score(y_validate, model_forestclf_cv_ypredict)
model_forestclf_cv_cvscores = cross_val_score(model_forestclf_cv, x, y, cv=20, scoring='accuracy')
print('random forest classifier grid search\n  f1 score: %0.4f, accuracy score: %0.4f, cross validation score: %0.4f (+/- %0.4f)' %(model_forestclf_cv_f1score, model_forestclf_cv_accuracyscore, model_forestclf_cv_cvscores.mean(), 2 * model_forestclf_cv_cvscores.std()))
print('  best parameters: %s' %model_forestclf_cv.best_params_)

> **Supply or submit the results**

Our submission to the competition site Kaggle is ready. Any suggestions to improve our score are welcome.

In [None]:
# model selection
final_model = model_forestclf

# prepare testing data and compute the observed value
x_test = df_data.loc[df_data['DataType_training'] == 0, model_feat]
y_test = pd.DataFrame(final_model.predict(x_test),
                      columns=['Survived'], index=df_data.loc[df_data['DataType_training'] == 0, 'PassengerId'])

In [None]:
# summit the results
out = pd.DataFrame({'PassengerId': y_test.index, 'Survived': y_test['Survived']})
out.to_csv('submission.csv', index=False)