# Titanic: Machine Learning from Disaster - Optimize solution 

## Data definition

| Variable | Definition | Key |
| --- | --- | --- |
| survival | Survival |	0 = No, 1 = Yes |
| pclass | Ticket class | 1 = 1st, 2 = 2nd, 3 = 3rd |
| sex | Sex	| |
| Age | Age in years | |
| sibsp | # of siblings / spouses aboard the Titanic | | 
| parch	| # of parents / children aboard the Titanic | |	
| ticket | Ticket number | |
| fare | Passenger fare | |	
| cabin | Cabin number | |
| embarked | Port of Embarkation | C = Cherbourg, Q = Queenstown, S = Southampton |

Here we can observe that some of the data are not that much meaningful while prediction, So we will remove some of the columns in preprocessing of data.  

In [39]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import timeit

from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn import linear_model
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score
from tpot import TPOTClassifier

In [43]:
# Read the data from given csv file
df = pd.read_csv("./data/train.csv") 
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


In [45]:
# Remove the non meaningful columns from the data
df = df[['Pclass','Sex','Age','SibSp','Parch','Survived']]
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Survived    891 non-null int64
dtypes: float64(1), int64(4), object(1)
memory usage: 41.9+ KB


In [46]:
# Observe the data with survival ratio
sex_survived= df.groupby(by=['Sex','Survived'])['Survived'].agg(['count']).reset_index()
sex_survived

Unnamed: 0,Sex,Survived,count
0,female,0,81
1,female,1,233
2,male,0,468
3,male,1,109


In [47]:
# Observe the data with survival ratio
Pclass_survived= df.groupby(by=['Pclass','Sex','Survived'])['Pclass'].agg(['count']).reset_index()
Pclass_survived

Unnamed: 0,Pclass,Sex,Survived,count
0,1,female,0,3
1,1,female,1,91
2,1,male,0,77
3,1,male,1,45
4,2,female,0,6
5,2,female,1,70
6,2,male,0,91
7,2,male,1,17
8,3,female,0,72
9,3,female,1,72


In [48]:
# Interpolate the Age column data as the column has the less data compare to other columns
df[['Age']] = df[['Age']].interpolate()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
Pclass      891 non-null int64
Sex         891 non-null object
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Survived    891 non-null int64
dtypes: float64(1), int64(4), object(1)
memory usage: 41.9+ KB


In [49]:
# Simple encode the Sex column data, We can also use the Mapping 
number = LabelEncoder()
df['Sex'] = number.fit_transform(df['Sex'].astype('str'))
df.head()
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 6 columns):
Pclass      891 non-null int64
Sex         891 non-null int64
Age         891 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Survived    891 non-null int64
dtypes: float64(1), int64(5)
memory usage: 41.9 KB


In [51]:
# Copy the data 
X = df.copy()

# Split the data into training/testing sets
X_train = X.iloc[:,:-1].values
Y_train = X.iloc[:,5].values

X_train, X_test, y_train, y_test = train_test_split(X_train, Y_train, test_size=0.3, random_state=0)

In [52]:
# Define the TPOT Genetic algorithm for dynamically finding the suitable pipeline for the problem 
tpot = TPOTClassifier(verbosity=3, 
                      scoring="balanced_accuracy", 
                      random_state=23, 
                      periodic_checkpoint_folder="tpot_mnst1.txt", 
                      n_jobs=-1, 
                      generations=10, 
                      population_size=100)

In [53]:
# Initialize variables for comparision
times = []
winning_pipes = []
scores = []

# run three iterations and time them
for x in range(3):
    start_time = timeit.default_timer()
    tpot.fit(X_train, y_train)
    elapsed = timeit.default_timer() - start_time
    times.append(elapsed)
    winning_pipes.append(tpot.fitted_pipeline_)
    scores.append(tpot.score(X_test, y_test))
    tpot.export('tpot_titanic_pipeline.py')

30 operators have been imported by TPOT.


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=1100, style=ProgressStyle(descrip…

_pre_test decorator: _mate_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _mate_operator: num_test=1 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _mate_operator: num_test=2 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _mate_operator: num_test=3 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _random_mutation_op

Generation 5 - Current Pareto front scores:
-1	0.7956758780067051	DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=5, DecisionTreeClassifier__min_samples_leaf=1, DecisionTreeClassifier__min_samples_split=2)
-2	0.7991763092796926	RandomForestClassifier(Normalizer(input_matrix, Normalizer__norm=max), RandomForestClassifier__bootstrap=False, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.45, RandomForestClassifier__min_samples_leaf=6, RandomForestClassifier__min_samples_split=13, RandomForestClassifier__n_estimators=100)

_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 78.
_pre_test decorator: _random_mutation_operator: num_test=0 manhattan was provided as affinity. Ward can only work with euclidean distances..
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination

_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False.
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 77.
Generation 10 - Current Pareto front scores:
-1	0.7975565944080982	GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=0.1, GradientBoostingClassifier__max_depth=5, GradientBoostingClassifier__max_features=0.750000000000



30 operators have been imported by TPOT.


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=1100, style=ProgressStyle(descrip…

_pre_test decorator: _mate_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _mate_operator: num_test=1 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _mate_operator: num_test=2 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _mate_operator: num_test=3 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _random_mutation_op

Generation 5 - Current Pareto front scores:
-1	0.7956758780067051	DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=5, DecisionTreeClassifier__min_samples_leaf=1, DecisionTreeClassifier__min_samples_split=2)
-2	0.7991763092796926	RandomForestClassifier(Normalizer(input_matrix, Normalizer__norm=max), RandomForestClassifier__bootstrap=False, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.45, RandomForestClassifier__min_samples_leaf=6, RandomForestClassifier__min_samples_split=13, RandomForestClassifier__n_estimators=100)

_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 78.
_pre_test decorator: _random_mutation_operator: num_test=0 manhattan was provided as affinity. Ward can only work with euclidean distances..
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination

_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False.
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 77.
Generation 10 - Current Pareto front scores:
-1	0.7975565944080982	GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=0.1, GradientBoostingClassifier__max_depth=5, GradientBoostingClassifier__max_features=0.750000000000



30 operators have been imported by TPOT.


HBox(children=(IntProgress(value=0, description='Optimization Progress', max=1100, style=ProgressStyle(descrip…

_pre_test decorator: _mate_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _mate_operator: num_test=1 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _mate_operator: num_test=2 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _mate_operator: num_test=3 Unsupported set of arguments: The combination of penalty='l1' and loss='logistic_regression' are not supported when dual=True, Parameters: penalty='l1', loss='logistic_regression', dual=True.
_pre_test decorator: _random_mutation_op

Generation 5 - Current Pareto front scores:
-1	0.7956758780067051	DecisionTreeClassifier(input_matrix, DecisionTreeClassifier__criterion=gini, DecisionTreeClassifier__max_depth=5, DecisionTreeClassifier__min_samples_leaf=1, DecisionTreeClassifier__min_samples_split=2)
-2	0.7991763092796926	RandomForestClassifier(Normalizer(input_matrix, Normalizer__norm=max), RandomForestClassifier__bootstrap=False, RandomForestClassifier__criterion=entropy, RandomForestClassifier__max_features=0.45, RandomForestClassifier__min_samples_leaf=6, RandomForestClassifier__min_samples_split=13, RandomForestClassifier__n_estimators=100)

_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 78.
_pre_test decorator: _random_mutation_operator: num_test=0 manhattan was provided as affinity. Ward can only work with euclidean distances..
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination

_pre_test decorator: _random_mutation_operator: num_test=0 Found array with 0 feature(s) (shape=(50, 0)) while a minimum of 1 is required..
_pre_test decorator: _random_mutation_operator: num_test=0 Unsupported set of arguments: The combination of penalty='l2' and loss='hinge' are not supported when dual=False, Parameters: penalty='l2', loss='hinge', dual=False.
_pre_test decorator: _random_mutation_operator: num_test=0 Expected n_neighbors <= n_samples,  but n_samples = 50, n_neighbors = 77.
Generation 10 - Current Pareto front scores:
-1	0.7975565944080982	GradientBoostingClassifier(input_matrix, GradientBoostingClassifier__learning_rate=0.1, GradientBoostingClassifier__max_depth=5, GradientBoostingClassifier__max_features=0.7500000000000001, GradientBoostingClassifier__min_samples_leaf=18, GradientBoostingClassifier__min_samples_split=4, GradientBoostingClassifier__n_estimators=100, GradientBoostingClassifier__subsample=0.8)
-2	0.7997233343098005	GradientBoostingClassifier(MinMaxSca



In [54]:
times = [time/60 for time in times]
print('Times:', times)
print('Scores:', scores)   
print('Winning pipelines:', winning_pipes)

Times: [6.766525390766644, 6.819817945249997, 6.930600284583306]
Scores: [0.7904761904761904, 0.7904761904761904, 0.7904761904761904]
Winning pipelines: [Pipeline(memory=None,
         steps=[('featureunion',
                 FeatureUnion(n_jobs=None,
                              transformer_list=[('functiontransformer',
                                                 FunctionTransformer(accept_sparse=False,
                                                                     check_inverse=True,
                                                                     func=<function copy at 0x7fa22bb95158>,
                                                                     inv_kw_args=None,
                                                                     inverse_func=None,
                                                                     kw_args=None,
                                                                     pass_y='deprecated',
                                        

### Here we can observe the best fitted machine learning pipeline for given dataset 

Let's implement this pipeline. Above code will generate the file with best pipeline implementation. We will use the pipeline directly in below section for testing data.

In [57]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.pipeline import make_pipeline, make_union
from tpot.builtins import StackingEstimator
from xgboost import XGBClassifier
from sklearn.preprocessing import FunctionTransformer
from copy import copy

In [75]:
exported_pipeline = make_pipeline(
    make_union(
        FunctionTransformer(copy,validate=False),
        StackingEstimator(estimator=make_pipeline(
            StackingEstimator(estimator=XGBClassifier(learning_rate=0.1, max_depth=9, min_child_weight=15, n_estimators=100, nthread=1, subsample=0.35000000000000003)),
            BernoulliNB(alpha=0.001, fit_prior=False)
        ))
    ),
    RandomForestClassifier(bootstrap=True, criterion="entropy", max_features=0.1, min_samples_leaf=6, min_samples_split=18, n_estimators=100)
)

In [76]:
exported_pipeline.fit(X_train, y_train)
results = exported_pipeline.predict(X_test)
exported_pipeline.score(X_test, y_test)

0.8059701492537313