# Titanic - Machine Learning from Disaster
* [Data](https://www.kaggle.com/competitions/titanic)
   - Will Cukierski, Titanic - Machine Learning from Disaster, Kaggle, 2012
 
## Problem Statement and Objective
Use machine learning to create a model that predicts which passengers survived the Titanic shipwreck.
This is a binary classification problem.


## Libraries

In [1]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, StandardScaler
from sklearn.metrics import confusion_matrix, accuracy_score
from sklearn.impute import SimpleImputer
from sklearn.svm import SVC
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV

## Data

| Label       | Meaning                  | Values                                                                                       |
| ----------- | ------------------------ | ---------------------------------------------------------------------------------------------|
| PassengerID | sequential               | irrelevant for training                                                                      |                 
| Survival    |	Survival                 | 0 = No, 1 = Yes                                                                              |
| Pclass	  | Ticket class	         | 1 = 1st, 2 = 2nd, 3 = 3rd; proxy for social status                                           |
| Name        | Last, Title First Middle | quoted string; title can give marital status for women, names can give hints to ethnicity    |
| Sex	      | Gender                   | "male" or "female"                                                                           |
| Age	      | Age in years	         | may be non-integer; may be blank (cannot assume 0 in that case - use average?)               |
| SibSp	      | # of siblings/spouses aboard the Titanic   | integer, may be zero                                                       |
| Parch	      | # of parents / children aboard the Titanic | integer, may be zero                                                       |
| Ticket	  | Ticket number	         | not too useful, unless we can parse, inconsistently formatted                                |
| Fare	      | Passenger fare           | may also reflect social status                                                               |
| Cabin	      | Cabin number	         | may be useful if we can parse into deck, etc; some have multiple cabins, or none listed      |
| Embarked	  | Port of Embarkation      | C = Cherbourg, Q = Queenstown, S = Southampton; may also reflect social status               |

In [2]:
dataset = pd.read_csv('train.csv')
dataset

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.2500,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.9250,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1000,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.0500,,S
...,...,...,...,...,...,...,...,...,...,...,...,...
886,887,0,2,"Montvila, Rev. Juozas",male,27.0,0,0,211536,13.0000,,S
887,888,1,1,"Graham, Miss. Margaret Edith",female,19.0,0,0,112053,30.0000,B42,S
888,889,0,3,"Johnston, Miss. Catherine Helen ""Carrie""",female,,1,2,W./C. 6607,23.4500,,S
889,890,1,1,"Behr, Mr. Karl Howell",male,26.0,0,0,111369,30.0000,C148,C


Clean up data and extract useful values.  Does not imput missing values - will do that next.  Written as a function so we can later apply it to `test.csv`.

In [3]:
def clean_data(dataset):
    X = dataset.iloc[:, [2,5,6,7,9,4,11]].values # PClass, Age, SibSp, Parch, Fare, Sex, Embarked
    # To do: Cabin
    ct = ColumnTransformer(transformers=[('encodePClass', OneHotEncoder(), [0]), ('encodeSex', OrdinalEncoder(), [5]), ('encodeEmbarked', OneHotEncoder(), [6])], remainder='passthrough') # PClass
    X = np.array(ct.fit_transform(X))
    y = dataset.iloc[:, 1].values # Survived
    return (X,y)

In [4]:
(X,y) = clean_data(dataset)

In [5]:
X

array([[0.0, 0.0, 1.0, ..., 1, 0, 7.25],
       [1.0, 0.0, 0.0, ..., 1, 0, 71.2833],
       [0.0, 0.0, 1.0, ..., 0, 0, 7.925],
       ...,
       [0.0, 0.0, 1.0, ..., 1, 2, 23.45],
       [1.0, 0.0, 0.0, ..., 0, 0, 30.0],
       [0.0, 0.0, 1.0, ..., 0, 0, 7.75]], dtype=object)

In [6]:
X[0]

array([0.0, 0.0, 1.0, 1.0, 0.0, 0.0, 1.0, 0.0, 22.0, 1, 0, 7.25],
      dtype=object)

In [7]:
X[1]

array([1.0, 0.0, 0.0, 0.0, 1.0, 0.0, 0.0, 0.0, 38.0, 1, 0, 71.2833],
      dtype=object)

Impute missing values.

In [8]:
imputer = SimpleImputer(missing_values=np.nan, strategy='mean')
X = imputer.fit_transform(X)
print(X)

[[ 0.      0.      1.     ...  1.      0.      7.25  ]
 [ 1.      0.      0.     ...  1.      0.     71.2833]
 [ 0.      0.      1.     ...  0.      0.      7.925 ]
 ...
 [ 0.      0.      1.     ...  1.      2.     23.45  ]
 [ 1.      0.      0.     ...  0.      0.     30.    ]
 [ 0.      0.      1.     ...  0.      0.      7.75  ]]


In [9]:
print(y)

[0 1 1 1 0 0 0 0 1 1 1 1 0 0 0 1 0 1 0 1 0 1 1 1 0 1 0 0 1 0 0 1 1 0 0 0 1
 0 0 1 0 0 0 1 1 0 0 1 0 0 0 0 1 1 0 1 1 0 1 0 0 1 0 0 0 1 1 0 1 0 0 0 0 0
 1 0 0 0 1 1 0 1 1 0 1 1 0 0 1 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 0
 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 1 1 0 0 0 0 1 0 0 1 0 0 0 0 1 1 0 0 0 1 0
 0 0 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 1 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 1
 0 1 1 0 0 1 0 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 1 1 1 0 1 0 0 0 1 1 0 1 0 1 0
 0 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 0 0 1 0 0 0 0 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 1 0 0 0 0 0 1 1 1 0 1 1 0 1 1 0 0 0 1 0 0 0 1 0 0 1 0 1 1 1 1 0 0 0 0
 0 0 1 1 1 1 0 1 0 1 1 1 0 1 1 1 0 0 0 1 1 0 1 1 0 0 1 1 0 1 0 1 1 1 1 0 0
 0 1 0 0 1 1 0 1 1 0 0 0 1 1 1 1 0 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 1 1 1 1
 1 0 0 0 0 1 1 0 0 0 1 1 0 1 0 0 0 1 0 1 1 1 0 1 1 0 0 0 0 1 1 0 0 0 0 0 0
 1 0 0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 0 0 1 1 0 1 1 1 1 0 0 1 0 1 0 0 1 0 0 1
 1 1 1 1 1 1 0 0 0 1 0 1 0 1 1 0 1 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 1 0
 0 0 1 1 0 1 0 0 1 0 0 0 

## Train/Test Split

In [10]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.20, random_state = 0, stratify = y)

## Feature Scaling

In [11]:
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)

## Kernel SVC

In [12]:
svc_classifier = SVC(C = 1.0, gamma='scale', kernel = 'rbf', random_state = 0)
svc_classifier.fit(X_train, y_train)

### Confusion Matrix, Accuracy, and Cross-Validation
For default values of hyperparameters for SVC.

In [13]:
y_pred = svc_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[100  10]
 [ 26  43]]


Accuracy against held-out test data

In [14]:
accuracy_score(y_test, y_pred)

0.7988826815642458

Cross-validation

In [15]:
accuracies = cross_val_score(estimator = svc_classifier, X = X_train, y = y_train, cv = 5)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 81.74 %
Standard Deviation: 1.78 %


### Grid Search

In [16]:
cstep = 0.05
gstep = 0.05
parameters = [{'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['linear']},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['poly'], 'degree': [2, 3, 4]},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['sigmoid'], 'gamma': ['scale']},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['sigmoid'], 'gamma': np.arange(gstep,1+gstep,gstep)},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['rbf'], 'gamma': ['scale']},
              {'C': np.arange(cstep,1+cstep,cstep), 'kernel': ['rbf'], 'gamma': np.arange(gstep,1+gstep,gstep)}]
svc_grid_search = GridSearchCV(estimator = svc_classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 5,
                           n_jobs = -1)
svc_grid_search.fit(X_train, y_train)
best_accuracy = svc_grid_search.best_score_
best_parameters = svc_grid_search.best_params_
cv_results = pd.DataFrame(svc_grid_search.cv_results_)
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)
print("Cross-Validation Results:\n", cv_results)

Best Accuracy: 82.44 %
Best Parameters: {'C': 0.45, 'degree': 3, 'kernel': 'poly'}
Cross-Validation Results:
      mean_fit_time  std_fit_time  mean_score_time  std_score_time  param_C   
0         0.004099      0.000465         0.001153        0.000063     0.05  \
1         0.004258      0.000807         0.001190        0.000065     0.10   
2         0.004563      0.000924         0.001168        0.000106     0.15   
3         0.004141      0.001269         0.001111        0.000196     0.20   
4         0.004723      0.001048         0.001139        0.000201     0.25   
..             ...           ...              ...             ...      ...   
915       0.005585      0.001349         0.001840        0.000311     1.00   
916       0.006084      0.000847         0.002056        0.000133     1.00   
917       0.005323      0.000694         0.001813        0.000271     1.00   
918       0.005689      0.000907         0.001639        0.000381     1.00   
919       0.005702      0.001005

### Confusion Matrix
For the SVC model found via grid search.

In [17]:
y_pred = svc_grid_search.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[106   4]
 [ 28  41]]


### Accuracy
Against held-out test data.

In [18]:
accuracy_score(y_test, y_pred)

0.8212290502793296

## XGBoost

Train XGBoost classifier.

In [19]:
xgb_classifier = XGBClassifier()
xgb_classifier.fit(X_train, y_train)

### Confusion Matrix

In [20]:
y_pred = xgb_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[97 13]
 [22 47]]


### Accuracy

In [21]:
accuracy_score(y_test, y_pred)

0.8044692737430168

### Cross-Validation

In [22]:
accuracies = cross_val_score(estimator = xgb_classifier, X = X_train, y = y_train, cv = 5)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 81.31 %
Standard Deviation: 2.77 %


## Random Forest Classifier

In [23]:
rf_classifier = RandomForestClassifier(n_estimators = 10, criterion = 'entropy', random_state = 0)
rf_classifier.fit(X_train, y_train)

### Confusion Matrix

In [24]:
y_pred = rf_classifier.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[96 14]
 [25 44]]


### Accuracy

In [25]:
accuracy_score(y_test, y_pred)

0.7821229050279329

### Cross-Validation

In [26]:
accuracies = cross_val_score(estimator = rf_classifier, X = X_train, y = y_train, cv = 5)
print("Accuracy: {:.2f} %".format(accuracies.mean()*100))
print("Standard Deviation: {:.2f} %".format(accuracies.std()*100))

Accuracy: 79.77 %
Standard Deviation: 1.56 %


### Grid Search

In [27]:
parameters = [{'n_estimators': range(10,500,5), 'criterion': ['entropy']}]
rf_grid_search = GridSearchCV(estimator = rf_classifier,
                           param_grid = parameters,
                           scoring = 'accuracy',
                           cv = 5,
                           n_jobs = -1)
rf_grid_search.fit(X_train, y_train)
best_accuracy = rf_grid_search.best_score_
best_parameters = rf_grid_search.best_params_
cv_results = pd.DataFrame(rf_grid_search.cv_results_)
print("Best Accuracy: {:.2f} %".format(best_accuracy*100))
print("Best Parameters:", best_parameters)
print("Cross-Validation Results:\n", cv_results)

Best Accuracy: 81.17 %
Best Parameters: {'criterion': 'entropy', 'n_estimators': 435}
Cross-Validation Results:
     mean_fit_time  std_fit_time  mean_score_time  std_score_time   
0        0.011068      0.000151         0.001146        0.000125  \
1        0.016197      0.000147         0.001694        0.000013   
2        0.019246      0.003356         0.001701        0.000354   
3        0.024976      0.001439         0.001957        0.000246   
4        0.029938      0.001057         0.002329        0.000266   
..            ...           ...              ...             ...   
93       0.677404      0.010794         0.024239        0.002555   
94       0.638109      0.026583         0.023748        0.005168   
95       0.554648      0.043767         0.018761        0.006192   
96       0.508622      0.057114         0.014016        0.000162   
97       0.393082      0.039395         0.014072        0.000168   

   param_criterion  param_n_estimators   
0          entropy          

### Confusion Matrix

In [28]:
y_pred = rf_grid_search.predict(X_test)
cm = confusion_matrix(y_test, y_pred)
print(cm)

[[99 11]
 [25 44]]


### Accuracy

In [29]:
accuracy_score(y_test, y_pred)

0.7988826815642458