`Titanic` dataset from the the [Kaggle](https://www.kaggle.com/c/titanic/overview) 

Created machine learning models to predict which passengers survived the Titanic shipwreck. Utlized `Decision Tree` and `Random Forest`

Original dataset contains the following 3 files
* `train.csv`
    * The original training dataset containing all descriptive variables and the target variable `Survived`, which is a binary variable indicating if the corresponding passenger survived.
* `test.csv`
    * The original test dataset with all descriptive variables, the corresponding target variable `Survived` is missing.
* `test_labels.csv`
    * The missing target variable `Survived`, which can be matched back to the test dataset via the column `PassengerId`.
    
**Data Dictionary**

| Variable | Definition                                 | Key                                            |
|----------|--------------------------------------------|------------------------------------------------|
| survival | Survival                                   | 0 = No, 1 = Yes                                |
| pclass   | Ticket class                               | 1 = 1st, 2 = 2nd, 3 = 3rd                      |
| sex      | Sex                                        |                                                |
| Age      | Age in years                               |                                                |
| sibsp    | # of siblings / spouses aboard the Titanic |                                                |
| parch    | # of parents / children aboard the Titanic |                                                |
| ticket   | Ticket number                              |                                                |
| fare     | Passenger fare                             |                                                |
| cabin    | Cabin number                               |                                                |
| embarked | Port of Embarkation                        | C = Cherbourg, Q = Queenstown, S = Southampton |

**Variable Notes**

* `pclass`: A proxy for socio-economic status (SES)
    * 1st = Upper
    * 2nd = Middle
    * 3rd = Lower
* `age`: Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
* `sibsp`: The dataset defines family relations in this way...
    * Sibling = brother, sister, stepbrother, stepsister
    * Spouse = husband, wife (mistresses and fiancés were ignored)
* `parch`: The dataset defines family relations in this way...
    * Parent = mother, father
    * Child = daughter, son, stepdaughter, stepson
    * Some children travelled only with a nanny, therefore parch=0 for them.

## Import the core libraries

In [146]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

pd.options.display.max_rows=None
pd.options.display.max_columns=None

## Q1. Load train and test set, combine them into one dataset (total: 5 points)

### 1.1 Import train and test set from given csv files (1 point)

In [147]:
train_df = pd.read_csv('titanic_train.csv')
test_df = pd.read_csv('titanic_test.csv')
test_labels = pd.read_csv('titanic_test_labels.csv')

In [148]:
## check the first 3 rows in train_df
train_df.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S


In [149]:
## check the first 3 rows in test_df
test_df.head(3)

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


In [150]:
## check the first 3 rows in test_labels
test_labels.head(3)

Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0


### 1.2 Append the target variable `Survived` back to the test dataset, format the dataframe to make the data structure the same as `train_df` (2 points)

In [151]:
## append the target variable Survived to test_df

surv = test_labels['Survived']
surv
test_df_wlabel = test_df
test_df_wlabel['Survived'] = surv

## change the order of columns in test_df to make the data structure of test_df the same as train_df
test_df_wlabel = test_df_wlabel.reindex(columns=['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked'])
test_df2 = test_df_wlabel
test_df2.head(3)

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,0,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,1,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,0,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q


### 1.3 Create a new dataframe `full_df` by concatenating the data from `train_df` and `test_df2`, such that `full_df` contains records from both `train_df` and `test_df2`. Apply additional data cleaning following the guidelines (2 points)

In [152]:
## create a new dataframe full_df by concatenating data from train_df and test_df2

full = [train_df, test_df2]
full_df = pd.concat(full)

In [153]:
full_df.shape

(1309, 12)

In [154]:
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 12 columns):
PassengerId    1309 non-null int64
Survived       1309 non-null int64
Pclass         1309 non-null int64
Name           1309 non-null object
Sex            1309 non-null object
Age            1046 non-null float64
SibSp          1309 non-null int64
Parch          1309 non-null int64
Ticket         1309 non-null object
Fare           1308 non-null float64
Cabin          295 non-null object
Embarked       1307 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 132.9+ KB


In [155]:
## drop the column Cabin since it contains too many missing values, such that full_df contains all columns except Cabin
## drop column Ticket and PassengerId

full_df = full_df.drop(columns = ['Cabin','PassengerId','Ticket'])
full_df.head(3)


Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S


## Q2. Prepare the data in `full_df` for model fitting (total: 5 points)

## 2.1 Let's fill the column `Embarked` with most frequent value (1 point)

In [156]:
## find the most frequent value in column Embarked

mod = full_df['Embarked'].mode()[0]
mod

'S'

In [157]:
## fill the missing values in column Embarked with the most frequent value you found in previous step
# full_df['Embarked'] = full_df['Embarked'].fillna(full_df['Embarked'].mode()) 

full_df['Embarked'] = full_df['Embarked'].fillna(mod)
full_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 9 columns):
Survived    1309 non-null int64
Pclass      1309 non-null int64
Name        1309 non-null object
Sex         1309 non-null object
Age         1046 non-null float64
SibSp       1309 non-null int64
Parch       1309 non-null int64
Fare        1308 non-null float64
Embarked    1309 non-null object
dtypes: float64(2), int64(4), object(3)
memory usage: 102.3+ KB


## 2.2 Additional data transformations (you don't need to do anything for this step, just run the given code below)

In [158]:
## Add new columns to full_df
full_df['Title'] = full_df['Name'].str.extract(' ([A-Za-z]+)\.')
full_df['Title'] = full_df['Title'].replace(['Ms', 'Mlle'], 'Miss')
full_df['Title'] = full_df['Title'].replace(['Mme', 'Countess', 'Lady', 'Dona'], 'Mrs')
full_df['Title'] = full_df['Title'].replace(['Dr', 'Major', 'Col', 'Sir', 'Rev', 'Jonkheer', 'Capt', 'Don'], 'Mr')
full_df = full_df.drop(['Name'], axis=1)
full_df['FamilySize'] = full_df['SibSp'] + full_df['Parch'] + 1
full_df['IsAlone'] = full_df['FamilySize'].apply(lambda x: 1 if x == 1 else 0)

## 2.3 Split X (descriptive features) and y (target variable) in full_df (1 point)

In [159]:
## split X and y from full_df

X = full_df.drop(columns = ['Survived'])
y = full_df['Survived']

X.head(3)

Unnamed: 0,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title,FamilySize,IsAlone
0,3,male,22.0,1,0,7.25,S,Mr,2,0
1,1,female,38.0,1,0,71.2833,C,Mrs,2,0
2,3,female,26.0,0,0,7.925,S,Miss,1,1


## 2.4 Set up a `scikit-learn pipeline` to handle missing values in numeric features of `X`. Missing value should be filled with median value of corresponding columns (1 points)

In [160]:
## set up the pipeline num_pipeline to fill the missing values in numeric columns with 
## the corresponding median values of the columns

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

num_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy="median"))
    ])

## 2.5 Utilize the `ColumnTransformer` module in `scikit-learn` to handle the data transformations of both numeric and categorical variables in `X`. For categorical variables, please apply `OneHot Encoding` (2 points)

In [161]:
## find out the numeric and categorical variables in full_df

X_num = X.drop(columns = ['Sex','Embarked','Title'])
num_attribs = list(X_num.columns)
cat_attribs = ['Sex','Embarked','Title']

In [162]:
## create a full pipeline with the help of ColumnTransformer model in scikit-learn to handle
## the data transformation of both numeric and categorical variables

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

full_pipeline = ColumnTransformer([
        ('num', num_pipeline, num_attribs),
        ('cat', OneHotEncoder(), cat_attribs)
    ])

In [163]:
## Transform the dataframe full_df with the full_pipeline you just created, assign the transformed value to X_prepared
X_prepared = full_pipeline.fit_transform(X)

In [164]:
X_prepared.shape

(1309, 16)

In [165]:
X.shape

(1309, 10)

## Q3. Fit a Decision Tree model to the prepared data, and evaluate the model accuracy (total: 5 points)

## 3.1 Split the data into training and test sets. Test set should cover 20% of the original data records (1 point)
* Make sure to define random_state as 123 just so that the split can be reproduced

In [166]:
## import the scikit-learn module for train and test split
from sklearn.model_selection import train_test_split

## split X and y into X_train, X_test, y_train, y_test following the 80%-20% ratio
X_train, X_test, y_train, y_test = train_test_split(X_prepared,y,test_size=0.2,random_state=123)

## 3.2 Fit a decision tree model with the scikit-learn module  (2 points)

In [167]:
## import the scikit-learn module for decision tree model
from sklearn.tree import DecisionTreeClassifier

## create a decision tree model object
dtree = DecisionTreeClassifier()

## fit the model on the training data you got from step 3.1
dtree.fit(X_train, y_train)


DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

## 3.3 Evaluate the model, calculate the accuracy of the model prediction on training and test data sets (2 points)

In [168]:
## import the scikit-learn module for model accuracy
from sklearn.metrics import accuracy_score

## calculate the predicted y values in training data
y_train_pred = dtree.predict(X_train)

## compute the model accuracy
accuracy_dtree_train = accuracy_score(y_train, y_train_pred)

accuracy_dtree_train

0.9789875835721108

In [169]:
## calculate the predicted y values in test data
y_test_pred = dtree.predict(X_test)

## compute the model accuracy
accuracy_dtree_test = accuracy_score(y_test, y_test_pred)

accuracy_dtree_test

0.8053435114503816

## Q4. Fit a Random Forest model to the prepared data, and evaluate the model accuracy (total: 5 points)

## 4.1 Fit a Random Forest model (with 200 decision trees, and maximum tree depth to be 10) on the training data with scikit-learn module (1 points)

In [170]:
## import the scikit-learn module for random forest model
from sklearn.ensemble import RandomForestClassifier

## create a random forest model object
rf = RandomForestClassifier()

## fit the model on the training data you got from step 3.1
rf.fit(X_train, y_train)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=None,
                       verbose=0, warm_start=False)

## 4.2 Evaluate the model, calculate the accuracy of the model prediction on training and test data sets (2 points)

In [171]:
## import the scikit-learn module for model accuracy
from sklearn.metrics import accuracy_score

## calculate the predicted y values in training data
y_train_pred2 = rf.predict(X_train)

## compute the model accuracy
accuracy_rf_train = accuracy_score(y_train, y_train_pred2)
accuracy_rf_train

0.9789875835721108

In [172]:
## calculate the predicted y values in test data
y_test_pred2 = rf.predict(X_test)

## compute the model accuracy
accuracy_rf_test = accuracy_score(y_test, y_test_pred2)
accuracy_rf_test

0.8587786259541985

## 4.3 Use `GridSearchCV` module in `scikit-learn` to optimize the parameters we used in the random forest model fitting. Use `'accuracy'` as the scoring parameter value (2 points)

In [173]:
%%time

## import the scikit-learn module for grid search with cross-validation
from sklearn.model_selection import GridSearchCV

## define the parameters to try
param_grid = [
    {'n_estimators': [5, 10, 50, 100, 200], 'max_depth': [2, 4, 6, 8, 10, 12, 14]}
  ]

## define a random forest model object
rf2 = RandomForestClassifier(random_state=123)

## create grid search modeling object
grid_search = GridSearchCV(rf2, param_grid, cv=5,
                           scoring='neg_mean_squared_error',
                           return_train_score=True)

## fit the grid search model on X_train and y_train
grid_search.fit(X_train, y_train)

Wall time: 1min 1s


GridSearchCV(cv=5, error_score=nan,
             estimator=RandomForestClassifier(bootstrap=True, ccp_alpha=0.0,
                                              class_weight=None,
                                              criterion='gini', max_depth=None,
                                              max_features='auto',
                                              max_leaf_nodes=None,
                                              max_samples=None,
                                              min_impurity_decrease=0.0,
                                              min_impurity_split=None,
                                              min_samples_leaf=1,
                                              min_samples_split=2,
                                              min_weight_fraction_leaf=0.0,
                                              n_estimators=100, n_jobs=None,
                                              oob_score=False, random_state=123,
                                 

In [174]:
## check the parameter setting of the best model from gridsearch
grid_search.best_params_

{'max_depth': 6, 'n_estimators': 10}