## 4. Prepare the data

We will now prepare the data for machine learning modelling.

Let's start by separating the predictors and labels since we do not want to apply the same transformations to both.

In [20]:
titanic_pred = titanic_train.drop("Survived", axis=1)
titanic_labels = titanic_train["Survived"].copy()

In [21]:
titanic_pred.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 10 columns):
Pclass      891 non-null int64
Name        891 non-null object
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Ticket      891 non-null object
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(3), object(5)
memory usage: 76.6+ KB


Two attributes seems to be meaningles for machine learning model are `PassengerId` and `Name` which is why let's drop this column. As we have seen earlier analysing text attributes unique values also `Ticket` attribute seems to be meaningless so we also drop its values.

In [22]:
titanic_pred = titanic_pred.drop(["Name", "Ticket"], axis=1)

In [23]:
titanic_pred.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 891 entries, 1 to 891
Data columns (total 8 columns):
Pclass      891 non-null int64
Sex         891 non-null object
Age         714 non-null float64
SibSp       891 non-null int64
Parch       891 non-null int64
Fare        891 non-null float64
Cabin       204 non-null object
Embarked    889 non-null object
dtypes: float64(2), int64(3), object(3)
memory usage: 62.6+ KB


Now we need to deal with missing values we found in the dataset. 

Again the most missing value are in `Cabin` column. That column seems not so much informative and the data if rather scarse which would suggesting droping this feature from the training data set. 


In [24]:
titanic_pred = titanic_pred.drop("Cabin", axis=1)

In `Embarked` column features we miss only two values so we can easily drop these two cases.

In [25]:
indexes_to_drop = titanic_pred[titanic_pred["Embarked"].isnull()].index.values
titanic_pred = titanic_pred.drop(indexes_to_drop, axis=0)
titanic_labels = titanic_labels.drop(indexes_to_drop, axis=0)
titanic_pred = titanic_pred.reset_index(drop=True)
titanic_labels = titanic_labels.reset_index(drop=True)

We have also a lot of missing values in `Age` column however this column seems crucial for our analysis. As starter we decide to fill missing values with median value. Other option is to throw away passengers with missing `Age` value, but considering that it is over 80  passengers data record we would like not to do it. 

Median can only be computed on numerical attributes, so we need to make a copy of the data - one containing only numerical data and one containing only categorical data.

In [26]:
titanic_pred_num = titanic_pred.drop(["Sex", "Embarked"], axis=1)
titanic_pred_cat = titanic_pred[["Sex", "Embarked"]]

We have now only 2 categorical attributes since we dropped `Name` (meaningless), `Cabin` (to many missing values) and `Ticket` (meaningless) attributes.

Let's now use Scikit-Learn Imputer class to fill `Age` attribute missing values and to standardize all values.

In [27]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import Imputer 

num_pipeline = Pipeline([
    ('imputer', Imputer(strategy="median")),
#     ('std_scaler', StandardScaler()),
])
titanic_pred_num_prepared = num_pipeline.fit_transform(titanic_pred_num)

In [28]:
titanic_pred_num_prepared

array([[  3.    ,  22.    ,   1.    ,   0.    ,   7.25  ],
       [  1.    ,  38.    ,   1.    ,   0.    ,  71.2833],
       [  3.    ,  26.    ,   0.    ,   0.    ,   7.925 ],
       ..., 
       [  3.    ,  28.    ,   1.    ,   2.    ,  23.45  ],
       [  1.    ,  26.    ,   0.    ,   0.    ,  30.    ],
       [  3.    ,  32.    ,   0.    ,   0.    ,   7.75  ]])

We have all our numerical attributes ready to go.

Let's now take care of out categorical attributes values.

In [29]:
titanic_pred_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 2 columns):
Sex         889 non-null object
Embarked    889 non-null object
dtypes: object(2)
memory usage: 14.0+ KB


Since most machine learning algorithms do not deal with text data so well let's convert text labels we have to numbers. We will use one-hot encoding for this job. First we will convert our attributes from text categories to integer categories, then from integer categories to one-hot vectors using the LabelBinarizer class.

In [30]:
from sklearn.preprocessing import LabelBinarizer
encoder = LabelBinarizer()
titanic_pred_cat_sex_1hot = encoder.fit_transform(titanic_pred_cat["Sex"])
encoder = LabelBinarizer()
titanic_pred_cat_embarked_1hot = encoder.fit_transform(titanic_pred_cat["Embarked"])

We now have all our data prepared. The last thing is to combine it again into one data set.

In [31]:
import numpy as np

titanic_pred_prepared = np.concatenate([titanic_pred_num_prepared, titanic_pred_cat_sex_1hot, titanic_pred_cat_embarked_1hot], axis=1)

## 5. Short-list promising models

Our data is prepared so we are ready to fit multiple quick models to check how they behave. Our measure of success is classification accuracy.

In [32]:
from sklearn.linear_model import LogisticRegression

In [33]:
log_clf = LogisticRegression()
log_clf.fit(titanic_pred_prepared, titanic_labels)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)

We will test our model using cross-validation.

In [34]:
from sklearn.model_selection import cross_val_score
cross_val_score(log_clf, titanic_pred_prepared, titanic_labels, cv=3, scoring="accuracy")

array([ 0.79124579,  0.79054054,  0.78378378])

Let's try Stochastic Gradient Descent classifier using Scikit-Learn's SGDClassifier class.

In [35]:
from sklearn.linear_model import SGDClassifier

sgd_clf = SGDClassifier(random_state=42)
sgd_clf.fit(titanic_pred_prepared, titanic_labels)

SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', n_iter=5, n_jobs=1,
       penalty='l2', power_t=0.5, random_state=42, shuffle=True, verbose=0,
       warm_start=False)

Let's now use cross-validation to validate our model.

In [36]:
from sklearn.model_selection import cross_val_score
cross_val_score(sgd_clf, titanic_pred_prepared, titanic_labels, cv=3, scoring="accuracy")

array([ 0.62626263,  0.7027027 ,  0.72635135])

Another model will be K-Nearest Neighbors classifier.

In [37]:
from sklearn.neighbors import KNeighborsClassifier
knn_clf = KNeighborsClassifier()
knn_clf.fit(titanic_pred_prepared, titanic_labels)

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
           metric_params=None, n_jobs=1, n_neighbors=5, p=2,
           weights='uniform')

In [38]:
cross_val_score(knn_clf, titanic_pred_prepared, titanic_labels, cv=3, scoring="accuracy")

array([ 0.68686869,  0.71283784,  0.71621622])

So far the best results we got from Logistic Regression model. Let's try different polynomials versions of it.

In [39]:
from sklearn.preprocessing import PolynomialFeatures
polynomial_features_2 = PolynomialFeatures(degree=2, include_bias=False)
polynomial_features_4 = PolynomialFeatures(degree=4, include_bias=False)
polynomial_features_8 = PolynomialFeatures(degree=8, include_bias=False)

In [40]:
titanic_pred_prepared_poly_2 = polynomial_features_2.fit_transform(titanic_pred_prepared)
titanic_pred_prepared_poly_4 = polynomial_features_4.fit_transform(titanic_pred_prepared)
titanic_pred_prepared_poly_8 = polynomial_features_8.fit_transform(titanic_pred_prepared)

In [41]:
titanic_pred_prepared.shape

(889, 9)

In [42]:
titanic_pred_prepared_poly_2.shape

(889, 54)

In [43]:
titanic_pred_prepared_poly_4.shape

(889, 714)

In [44]:
titanic_pred_prepared_poly_8.shape

(889, 24309)

In [45]:
log_clf_2 = LogisticRegression()
log_clf_2.fit(titanic_pred_prepared_poly_2, titanic_labels)
cross_val_score(log_clf_2, titanic_pred_prepared_poly_2, titanic_labels, cv=3, scoring="accuracy")

array([ 0.80808081,  0.81418919,  0.81418919])

In [46]:
log_clf_4 = LogisticRegression()
log_clf_4.fit(titanic_pred_prepared_poly_4, titanic_labels)
cross_val_score(log_clf_4, titanic_pred_prepared_poly_4, titanic_labels, cv=3, scoring="accuracy")

array([ 0.7037037 ,  0.72635135,  0.71283784])

In [47]:
log_clf_8 = LogisticRegression()
log_clf_8.fit(titanic_pred_prepared_poly_8, titanic_labels)
cross_val_score(log_clf_8, titanic_pred_prepared_poly_8, titanic_labels, cv=3, scoring="accuracy")

array([ 0.63973064,  0.61486486,  0.67567568])

So using square polynomial features and logistic regression model improves results a bit. Using higher degree polynomials lower accuracy result.

TODO: Check why higher degree polynomial lowers accuracy result in cross-validation. Visualize how the decision boundary looks like.

Let's try another - more complex - model. Now we will use Random Forest classifier.

In [48]:
from sklearn.ensemble import RandomForestClassifier

forest_clf = RandomForestClassifier()
forest_clf.fit(titanic_pred_prepared, titanic_labels)
cross_val_score(forest_clf, titanic_pred_prepared, titanic_labels, cv=3, scoring="accuracy")

array([ 0.76430976,  0.79391892,  0.80067568])

No we reached a little bit better accuracy in our model.

In [49]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import LinearSVC

svm_clf = Pipeline([
        ("scaler", StandardScaler()),
        ("linear_svc", LinearSVC(C=1, loss="hinge")),
    ])
svm_clf.fit(titanic_pred_prepared, titanic_labels)


Pipeline(steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('linear_svc', LinearSVC(C=1, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='hinge', max_iter=1000, multi_class='ovr',
     penalty='l2', random_state=None, tol=0.0001, verbose=0))])

In [50]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

poly_kernel_svm_clf = Pipeline((
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel="poly", degree=6, coef0=1, C=5))
))
poly_kernel_svm_clf.fit(titanic_pred_prepared, titanic_labels)
cross_val_score(poly_kernel_svm_clf, titanic_pred_prepared, titanic_labels, cv=3, scoring="accuracy")

array([ 0.75420875,  0.80405405,  0.78716216])

In [51]:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.svm import SVC

rbf_kernel_svm_clf = Pipeline((
        ("scaler", StandardScaler()),
        ("svm_clf", SVC(kernel="rbf", gamma=5, C=0.001))
))
rbf_kernel_svm_clf.fit(titanic_pred_prepared, titanic_labels)
cross_val_score(rbf_kernel_svm_clf, titanic_pred_prepared, titanic_labels, cv=3, scoring="accuracy")

array([ 0.61616162,  0.61824324,  0.61824324])

## 6. Fine-tune models

We have two promissing models. Let's try to improve their performance by fine-tuning their parameters.

## 7. Submit solution

In [52]:
titanic_test = pd.read_csv("datasets/test.csv")

In [53]:
titanic_test = titanic_test.drop(["PassengerId", "Name", "Ticket"], axis=1)
titanic_test = titanic_test.drop("Cabin", axis=1)

In [54]:
titanic_test_num = titanic_test.drop(["Sex", "Embarked"], axis=1)
titanic_test_cat = titanic_test[["Sex", "Embarked"]]

In [55]:
titanic_test_num.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 5 columns):
Pclass    418 non-null int64
Age       332 non-null float64
SibSp     418 non-null int64
Parch     418 non-null int64
Fare      417 non-null float64
dtypes: float64(2), int64(3)
memory usage: 16.4 KB


In [56]:
imputer = Imputer(strategy="median")
imputer.fit(titanic_test_num)
titanic_test_num_prepared = imputer.transform(titanic_test_num)

In [57]:
titanic_test_cat.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 2 columns):
Sex         418 non-null object
Embarked    418 non-null object
dtypes: object(2)
memory usage: 6.6+ KB


In [58]:
encoder = LabelBinarizer()
titanic_test_cat_sex_1hot = encoder.fit_transform(titanic_test_cat["Sex"])
encoder = LabelBinarizer()
titanic_test_cat_embarked_1hot = encoder.fit_transform(titanic_test_cat["Embarked"])

In [59]:
titanic_test_prepared = np.concatenate([titanic_test_num_prepared, titanic_test_cat_sex_1hot, titanic_test_cat_embarked_1hot], axis=1)

In [60]:
final_predictions = forest_clf.predict(titanic_test_prepared)

In [61]:
final_df = pd.DataFrame({'PassengerId': range(892, len(final_predictions)+892), 'Survived': final_predictions})
final_df.to_csv("datasets/submission.csv", index=False)

Your submission scored 0.73205 (position 8732).