
## Titanic - Machine Learning from Disaster - Predicting the Survival of Titanic Passengers


### Load the libraries

In [None]:
import re
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import seaborn as sns 
import warnings 

# Pre-processing, model selection and tuning techniques 
from sklearn.preprocessing import LabelEncoder
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split 
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn import metrics

# Algorithms - Unsupervised Learning
from sklearn.decomposition import PCA 
from sklearn.cluster import KMeans, AgglomerativeClustering, DBSCAN

# Algorithms - Supervised Learnin - classifiers
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC 
from sklearn.tree import DecisionTreeClassifier 
from sklearn.ensemble import RandomForestClassifier

# Other classifiers if you would like to explore (optional) 
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import ExtraTreesClassifier, VotingClassifier

pd.set_option('display.max_columns', 100)
pd.set_option("display.precision", 2)

%matplotlib inline
warnings.simplefilter(action='ignore', category=FutureWarning)

### Load the data

Kaggle’s Titanic Machine Learning Dataset – a classic open-source introduction to the realm of machine learning. The link for the competition and data is https://www.kaggle.com/competitions/titanic. The dataset has information about different passengers on the Titanic ship and their survival. We have the following features in our dataset: 

- survival -> Survival -> 0 = No, 1 = Yes
- pclass -> Ticket class -> 1 = 1st, 2 = 2nd, 3 = 3rd
- sex -> Sex
- Age -> Age in years
- sibsp -> # of siblings / spouses aboard the Titanic
- parch -> # of parents / children aboard the Titanic
- ticket -> Ticket number
- fare -> Passenger fare
- cabin -> Cabin number
- embarked -> Port of Embarkment

#### Data Notes 

- As part of the Kaggle competition, based on the patterns you find in titanic_known.csv, you will have to predict whether the other 418 passengers on board in titanic_unknown.csv survived.

**Note that titanic_unknown.csv does not have a "Survived" column - this information is hidden from you, and how well you do at predicting these hidden values will determine how highly you score in the competition!**

In [None]:
# The data in this use case are provided to you as titanic_known.csv and titanic_unknown.csv files. 
# Load both of the files into two new DataFrames titanicData and unknownData respectively


Get the dimensionality and preview the known titanicData:

In [None]:
# Print the dimensionality of titanicData and preview its first rows
# Take a moment to familiarize yourselves with the entries in the dataset (and their data types) 
# Have you detected the class variable that we are trying to predict? 


The training-set has 891 examples and 11 features + the target variable (survived).  

Get the dimensionality and preview the test data:

In [None]:
# Print the dimensionality of unknownData and preview its first rows
# Familiarize yourselves with the entries in the dataset (and their data types) 

# !! Note that titanic_unknown.csv does not have a "Survived" column - this information is hidden from you, 
# and how well you do at predicting these hidden values will determine how highly you score in the actual Kaggle competition! 
# (they screen the results against the hidden classes only the owners possess) 

# !! The unknownData set mimics a real-world scenario of new, unseen data arriving at some point through any real-world (production) model,
# and the model's task will be to predict them using all the learning and generalization capabilities we have leveraged from our known data. 



#### Observations 

Note: From the previews above, we can note a few things. 
- First of all, that we need to convert a lot of features into numeric ones later on, so that the machine learning algorithms can process them. 
- Furthermore, we can see that the features have widely different ranges, that we will need to convert into roughly the same scale. 
- We can also spot some more features, that contain missing values (NaN = not a number), that we'll need to deal with.

### Check the data types and missing values of each column 

- Which features are categorical?
- Which features are numerical?

#### 1) Train data

2 of the features are floats, 5 are integers and 5 are objects.

In [None]:
# Get the data types or relevant info of titanicData 


In [None]:
# Check for missing values per column in titanicData  


In [None]:
# Extra - optional : Plot the null values heatmap for titanicData



We notice from the above that the known set has missing values in the Age, Cabin and Embarked columns

#### 2) Unknown data

In [None]:
# Get the data types or relevant info of unknownData 


In [None]:
# Check for missing values in unknownData 


In [None]:
# Extra - optional : Plot the null values heatmap for unknownData


**The unknown set has missing values in Age, Cabin and Fare columns**

The Embarked feature has only 2 missing values in the train set, which can easily be filled. It will be much more tricky, to deal with the ‘Age’ feature, which has 177 missing values. The ‘Cabin’ feature needs further investigation.

### Feature engineering and feature creation 

#### 1. Drop any unnecessary values 

##### Handling the passenger ID

In [None]:
# How would you deal with PassengerId in titanicData? Come up with a solution. Ensure your changes have gone through 


In [None]:
# How would you deal with PassengerId in unknownData? Come up with a solution. Ensure your changes have gone through  


#### 2. Optional --- Feature creation 

##### Optional --- Extract the titles 

We could use the Name feature to extract the Titles so that we can build a new feature out of that.

In [None]:
titanicData['Title'] = ''
unknownData['Title'] = ''

for i in titanicData:
    titanicData['Title'] = titanicData.Name.str.extract('([A-Za-z]+)\.') 

for i in unknownData:
    unknownData['Title'] = unknownData.Name.str.extract('([A-Za-z]+)\.')
    
titanicData.head(2)

In [None]:
# Optional: execute pd.crosstab() on the combination of titanicData['Title'] and titanicData['Sex'] 



We can replace many titles with a more common name or classify them as a new value 'Other'

In [None]:
for dataset in [titanicData, unknownData]:
    dataset['Title'] = dataset['Title'].replace(['Lady', 'Countess','Capt', 'Col', 'Don', 'Dr', 'Major', 'Rev', 'Sir', 'Jonkheer', 'Dona'], 'Other')
    dataset['Title'] = dataset['Title'].replace('Mlle', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Ms', 'Miss')
    dataset['Title'] = dataset['Title'].replace('Mme', 'Mrs')

titanicData[['Title', 'Survived']].groupby(['Title'], as_index=False).count()

##### 3. Optional --- Work with the Cabin data

As a first thought, we can delete the ‘Cabin’ feature but there's something rather interesting. A cabin number looks like ‘C123’ and the letter refers to the deck. Therefore we could extract these and create a new feature, that contains a person's deck. Afterwards, we will convert the feature into a numeric variable. The missing values will be converted to zero. The actual decks of the titanic are ranging from A to G.

In [None]:
deck = {"A": 1, "B": 2, "C": 3, "D": 4, "E": 5, "F": 6, "G": 7, "U": 8}
data = [titanicData, unknownData]

for dataset in data:
    dataset['Cabin'] = dataset['Cabin'].fillna("X")
    dataset['Deck']  = dataset['Cabin'].map(lambda x: re.compile("([a-zA-Z]+)").search(x).group())
    dataset['Deck']  = dataset['Deck'].map(deck)
    dataset['Deck']  = dataset['Deck'].fillna(0)
    dataset['Deck']  = dataset['Deck'].astype(int)

titanicData.Deck.value_counts()

##### 4. Optional --- Combine SibSp and Parch

SibSp and Parch would make more sense as a combined feature, that shows the total number of relatives, a person has on the Titanic. We can create it as follows and also create a feature that shows if someone is not alone.

In [None]:
data = [titanicData, unknownData]

for dataset in data:
    dataset['FamilySize'] = dataset['SibSp'] + dataset['Parch'] + 1

titanicData[['FamilySize', 'Survived']].groupby(['FamilySize'], as_index=False).count().sort_values(by='Survived', ascending=False)

In [None]:
for dataset in data:
    dataset['IsAlone'] = 0
    dataset.loc[dataset['FamilySize'] == 1, 'IsAlone'] = 1

titanicData[['IsAlone', 'Survived']].groupby(['IsAlone'], as_index=False).count()

##### 5. Handling (Dropping) the Name,  Ticket and Cabin columns if NOT used

In [None]:
# Drop the features ['Name', 'Ticket','Cabin'] of your titanicData - set the axis accordingly - inplace or with replacement. 
# Preview the first 2 rows of titanicData



In [None]:
# Drop the features ['Name', 'Ticket','Cabin'] of your unknownData - set the axis accordingly - inplace or with replacement. 
# Preview the first 2 rows of unknownData



### Split to X and y

In [None]:
# The most common step across all Supervised Machine Learning models in Python
# The first thing you have to ask is what is the label (dependent) variable/column? 

# Store the feature data from titanicData into a new variable named "X" - Extract all columns **except** 
# from the label column using either indexing (.loc / .iloc) or, easier, .drop()

# Store the target data (label/class column) from titanicData into a new variable named "y"
# Get only the label (class) from titanicData 

# Print the dimensions (using "shape") for both X and y 


In [None]:
# Print the entries of y. Do you need to apply LabelEncoding on y? What do you think?



### Check the Survival rate (target variable y)

In [None]:
# In order to check the counts of the binary class, use the function .value_counts() on y 


In [None]:
# Use the sns.countplot() function from seaborn to plot the feature "Survived" from titanicData
# Optional: set also the hue to "Survived" 


It is evident that not many passengers survived the accident.

Out of 891 passengers in training set, only around 350 survived i.e Only 38.4% of the total training set survived the crash. 

### Train/test split

As mentioned previously, the test (unknown) file provided is used for competition submission. 

So, we will use the sklearn function to split the known data in two datasets. This is important, so we don't overfit our model. Meaning, the algorithm is so specific to a given subset, it cannot accurately generalize another subset, from the same dataset. It's important our algorithm has not seen the subset we will use to test, so it doesn't "cheat" by memorizing the answers. We will use sklearn's train_test_split function.

In [None]:
# Use the train_test_split() function from sklearn. Use 30% of your data for the test set. Use stratification if needed. 
# Set random_state=1 (for reproducibility). Print the dimensionality (shape) of X_train, X_test, y_train, y_test 



### Imputing 

In the case of missing values, we should take care to replace them in the dataset as they prevent machine learning algorithms to run. There are many functions such as `.fillna()` and `.SimpleImputer` from `sklearn` as well as **many strategies** to help with missing data and they depend on whether the missing data is **numeric** or **categorical**.  

What strategy is best for you problem very much depends on the specifics of your dataset. However, generally speaking it is not worth to remove large chunks of data.

- simply removing rows where there is missing data (e.g. `dropna()` can achieve this)
- imputing the values with a summary statistic such as mean or median or most frequent value (e.g. `.fillna()` from `pandas` or `.SimpleImputer()` from `sklearn` module)
- replace the values with a reasonable estimate

In [None]:
# Let's quickly print the NAs per dataset 

print("Nulls in X_train","-"*20)
print(X_train.isnull().sum())

print("Nulls in X_test","-"*20)
print(X_test.isnull().sum())

print("Nulls in unknownData","-"*20)
print(unknownData.isnull().sum())

In [None]:
# We will need one Imputer for the numerical features and one for the categorical features 
# 1. Instantiate the SimpleImputer for the numerical features using as strategy the 'median'. Assign to a variable named 'imp_num'
# 2. Instantiate the SimpleImputer for the categorical features using as strategy the 'most_frequent'. Assign to a variable named 'imp_cat' 



In [None]:
# Impute your numerical data using the imp_num Imputer : 

# 1. .fit_tranform() your imp_num Imputer **ONLY** on the numerical columns of X_train that contain NAs (X_train[['Age', 'Fare']]). 
# Remember: we only fit() on the trainData but transform() all the DataFrames after the learning process. 
# Assign back only to X_train[['Age', 'Fare']] 
# 2. .transform() the numerical columns of X_test that contain NAs (X_test[['Age', 'Fare']]). Assign back only to X_test[['Age', 'Fare']] 
# 3. .transform() the numerical columns of unknownData that contain NAs (unknownData[['Age', 'Fare']]). Assign back only to unknownData[['Age', 'Fare']]  



In [None]:
# Impute your categorical data using the imp_cat Imputer: 

# 1. .fit_tranform() your imp_cat Imputer ONLY on the categorical columns of X_train that contain NAs (X_train[['Embarked']]). 
# Remember: we only fit() on the trainData but transform() all the DataFrames after the learning process.
# Assign back only to X_train[['Embarked']] 
# 2. .transform() the numerical columns of X_test that contain NAs (X_test[['Embarked']]). Assign back only to X_test[['Embarked']]  
# 3. .transform() the numerical columns of unknownData that contain NAs (unknownData[['Embarked']]). Assign back only to unknownData[['Embarked']]  



In [None]:
# Print once more the NA count as above. Did your changes go through? 



### Convert the categorical features with One-Hot-Encoding (OHE) 

In [None]:
# Check the X_train data types with .dtypes once more to detect the categorical columns 
# Which are the categorical features in this case? You should detect three of them after our feature engineering/creation 



In [None]:
# Apply one-hot-encoding (OHE) to the categorical features of X_train using pd.get_dummies: 

# 1. Create a new variable "X_train_ohe" and assign the subset of X_train that contains only the (three) categorical features that need to be encoded
# 2. As a second step, apply pd.get_dummies(X_train_ohe, dtype=int) and assign back to the variable 'X_train_ohe' to overwrite the entries of the dataframe.  
# 3. Remove (drop) the (three) categorical features from X_train (you may need to set the axis accordingly!). Drop with inplace=True OR with assignment back to X_train
# 4. Use .join() on X_train with X_train_ohe (contains the result of one hot encoding). Assign back to X_train to overwrite the entries. 
# Preview the first few rows of X_train. Did your changes go through? 



In [None]:
# Similarly, apply one-hot-encoding (OHE) to the categorical features of *X_test* using pd.get_dummies: 



In [None]:
# Similarly, apply one-hot-encoding (OHE) to the categorical features of *unknownData* using pd.get_dummies: 




#### Check for correlations in the data - Pearson Correlation Heatmap

Let's look at the correlations among the numerical variables in our dataset. This information is important to know as there are Machine Learning algorithms (for example, linear and logistic regression) that do not handle highly correlated input variables well.

First, we will use the method [`corr()`](https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.corr.html) on a `DataFrame` that calculates the correlation between each pair of features. Then, we pass the resulting *correlation matrix* to [`heatmap()`](https://seaborn.pydata.org/generated/seaborn.heatmap.html) from `seaborn`, which renders a color-coded matrix for the provided values:

In [None]:
# Calculate the correlation matrix of X_train DataFrame using the .corr() function and 
# save it in a variable called 'corr_matrix'. Then pass the corr_matrix to the sns.heatmap() function for plotting. 
# Optional: you can use f, ax = plt.subplots(figsize=(12, 8))  
# Optional sns.heatmap arguments : annot=True,  annot_kws={'size': 8} and cmap="Spectral_r" 

########## FILL IN YOUR SOLUTION HERE ########## 



### Check the distributions

#### What are the ranges of the various features? 

In [None]:
# Apply the .describe() function on the X_train



In [None]:
# Visualise the X_train data in a seaborn boxplot. Optional: set the plt.figure(figsize=(15,10))



### Scaling

Most Machine Learning algorithms are not scale invariant, so it is highly recommended to scale your data. For example, scale each attribute on the input vector X_train to [0,1] or [-1,+1], or standardize it to have mean 0 and variance 1. Note that the same scaling must be applied to the test vector to obtain meaningful results.

In [None]:
# 1. Instantiate the StandardScaler() or MinMaxScaler(). Store into a variable named "scaler" 

# 2. Fit the scaler ONLY on the X_train data - Use the scaler.fit_transform() on the X_train set 
# You will also need to convert the results of scaler.fit_transform() back to a pandas DataFrame by calling pd.DataFrame() with
# columns=X_train.columns. Assign the result back to the variable X_train. 

# 3. Transform (do not fit!) the test dataset using the (fitted) scaler. Use the scaler.transform() on the X_test. 
# You will also need to convert the results of scaler.fit_transform() back to a pandas DataFrame by calling pd.DataFrame() with
# columns=X_test.columns. Assign the result back to the variable X_test.  

# 4. Transform (do not fit!) the test dataset using the (fitted) scaler. Use the scaler.transform() on the unknownData. 
# You will also need to convert the results of scaler.fit_transform() back to a pandas DataFrame by calling pd.DataFrame() with
# columns=unknownData.columns. Assign the result back to the variable unknownData. 



In [None]:
# Run once more the boxplot on the scaled X_train data


### Kmeans clustering on the results of PCA 

#### PCA 

In scikit-learn, PCA is implemented as a transformer object that learns `n` components in its `fit()` method, and can be used on new data to project it on these components. More information on how to use the `pca()` function and its parameters can be found at http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

In [None]:
# 1) Instantiate the PCA() object by passing as argument the value 0.90 (the cut-off variance value we would like to reach) 
# and store in a new variable "pca"
# 2) Apply pca.fit_transform() on the X_train data and store into a new variable named "pc_scores"
# 3) Print the PCA cummulative variance 



In [None]:
# Create a feature name for each Principal Component (such as PC1, PC2, ...) up to the detected number of current dimensions

PCs = ['PC'+str(i+1) for i in range(pc_scores.shape[1])]

In [None]:
# Convert the pc_scores to a pandas DataFrame using pd.DataFrame() and pass as the columns argument the PCs list that was created in the prevous step
# Assign back to pc_scores. Preview the first entries of pc_scores 



In [None]:
# Plot the first 2 Principal Components ("PC1" vs. "PC2") in a seaborn scatterplot 
# Optional/extra: Should you wish, you can join the y_train class label in the pc_scores and use it as hue in the scatterplot 



#### 2. K-means

#### Elbow method 

Elbow Method : There’s a sweet spot where the SSE curve starts to bend known as the elbow point. The x-value of this point is thought to be a reasonable trade-off between error and number of clusters. The elbowpoint is the point where the rate of decrease of mean distance i.e. SSE will not change significantly with increase in number of clusters.

In [None]:
# Can you run the elbow rule to find the optimal number of K for kmeans that we need to use on our pc_scores?? 


In [None]:
# 1) Instantiate a KMeans object with n_clusters equal to the **OPTIMAL k** as found above and store it in a variable named "kmodel" 
# 2) .fit() the kmodel model on the pc_scores(!!) data 
# 3) Use the .labels_ parameter on kmodel to get the assinged clusters and save the results in a new variable "cluster_assignment".
# 4) Save the output of cluster_assignment to pc_scores["KMCluster"]  
# 5) Preview the first few rows of pc_scores


In [None]:
# Plot the 2 Principal Components (PC scores) in a scatterplot using the "KMCluster" as hue



## Model, predict and solve

### Classifiers - Predict using the *subset* test set (not the unknown yet) 

In [None]:
# Create an empty DataFrame to store and compare all evaluation results (performance metrics) across all classifiers 

clf_results = pd.DataFrame([])

#### 1a. Support Vector Machine (RBF) - benchmark model (default parameters)

In [None]:
# Step 1 - Instantiate the SVC() classifier using the default parameters (Default hyperparameters are C=1.0, kernel=rbf and gamma=auto) 
# Assign the result into a new variable named "rbf_svm" 
# Step 2 - Fit the rbf_svm model to the training set (X_train, y_train)
# Step 3 - Predict the test data (X_test) using the rbf_svm model and assign to "y_pred_rbf_svm" 
# Step 4 - Print the final overall accuracy and classification_report for the test set (X_test) 



In [None]:
# Add the results of the benchmark SVM classifier as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'Benchmark RBF SVM', 
                                                         'Accuracy': round(metrics.accuracy_score(y_test, y_pred_rbf_svm), 3), 
                                                         'F1': round(metrics.f1_score(y_test, y_pred_rbf_svm), 3)})])

clf_results

#### 1b. Linear SVM - benchmark model (default parameters)

In [None]:
# 1. Instantiate the SVC classifier with a linear kernel. Assign the result into a new variable named "linear_svm" 
# Step 2 - Fit the linear_svm model to the training set (X_train, y_train)
# Step 3 - Predict the test data (X_test) using the linear_svm model and assign to "y_pred_linear_svm"
# Step 4 - Print the final overall accuracy and classification_report for the test set (X_test) 



In [None]:
# Add the results of the benchmark SVM classifier as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'Benchmark Linear SVM', 
                                                         'Accuracy': round(metrics.accuracy_score(y_test, y_pred_linear_svm), 3), 
                                                         'F1': round(metrics.f1_score(y_test, y_pred_linear_svm), 3)})])

clf_results

#### 1c. Optimal SVM - SVM hyperparameter tuning

Proper choice of C and gamma is critical to the SVM’s performance. **One is advised to use GridSearchCV with `C` and `gamma` spaced exponentially far apart to choose good values.** Detailed information on the SVM hyperparameters can be found at https://scikit-learn.org/stable/auto_examples/svm/plot_rbf_parameters.html#sphx-glr-auto-examples-svm-plot-rbf-parameters-py 

As a first step, create a dictionary of hyperparameter ranges and conduct a grid or random search with cross-validation:

In [None]:
# Use GridSearchCV() with 5-fold or 10-fold cross-validation (cv=5 or cv=10) on the SVC() classifier 
# (more cv folds reduces the chances of overfitting but also increases the run time) 
# 1. Create the dictionary of hyperparameters for the SVM (SVC classifier) 
# 2. Set up the GridSearchCV and assign to a new variable named svm_grid_cv. 
# Optional: You can experiment with the scoring options (you need to check the documentation)
# 3. Fit svm_grid_cv to X_train and y_train 
# 4. Report the optimal parameters using 'cv_svm.best_params_'



In [None]:
# 1. Build the (optimal) classifier using the optimal parameters detected by the tuning process. Assign to a new variable svm_opt. 
# Note: you can use either svm_grid_cv.best_estimator_ to retrieve the model or the svm_grid_cv.best_params_ from above into a new SVC()
# 2 - Fit the svm_opt model to the training set (X_train)
# 3 - Predict the test data (X_test) using the svm_opt model. Assign to a variable named y_pred_svm_opt
# 4 - Print the final overall accuracy and classification_report for the test set (X_test) 



In [None]:
# Add the results of the optimal SVM classifier as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'Tuned SVM', 
                                                         'Accuracy': round(metrics.accuracy_score(y_test, y_pred_svm_opt), 3), 
                                                         'F1': round(metrics.f1_score(y_test, y_pred_svm_opt), 3)})])

clf_results

#### 2a. KNN - benchmark model (default parameters)

In [None]:
# Step 1 - Instantiate the KNeighborsClassifier() classifier using the default parameters  
# Assign the result into a new variable named "knn" 
# Step 2 - Fit the knn model to the training set (X_train, y_train)
# Step 3 - Predict the test data (X_test) using the knn model. Assign to y_pred_knn
# Step 4 - Print the final overall accuracy and classification_report for the test set (X_test) 



In [None]:
# Use GridSearchCV() with 5-fold or 10-fold cross-validation (cv=5 or cv=10) on KNeighborsClassifier() 
# (more cv folds reduces the chances of overfitting but also increases the run time) on the KNeighborsClassifier classifier 
# 1. Create the dictionary of hyperparameters for KNN
# 2. Set up the GridSearchCV and assign to a new variable named knn_grid_cv
# 3. Fit the grid or random search model to X_train and y_train 
# 4. Report the optimal parameters



In [None]:
# Build the classifier using the optimal parameters detected by the tuning process
# Note: you can use either knn_grid_cv.best_estimator_ to retrieve the optimal model. Assign to knn_opt.  
# 2 - Fit the knn_opt model to the training set (X_train, y_train)
# 3 - Predict the test data (X_test) using the knn_opt model. Assign to a variable named y_pred_knn_opt
# 4 - Print the final overall accuracy and classification_report for the test set (X_test) 



In [None]:
# Add the results of the optimal KNN classifier as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'Tuned KNN',
                                                         'Accuracy': round(metrics.accuracy_score(y_test, y_pred_knn_opt),3), 
                                                         'F1': round(metrics.f1_score(y_test, y_pred_knn_opt), 3)})])

clf_results

#### 3a. Random Forest - benchmark model (default parameters)

In [None]:
# Step 1 - Instantiate the RandomForestClassifier(random_state=0) classifier using the default parameters  
# Assign the result into a new variable named "rf" 
# Step 2 - Fit the rf model to the training set (X_train, y_train)
# Step 3 - Predict the test data (X_test) using the rf model. Assign to y_pred_rf 
# Step 4 - Print the final overall accuracy and classification_report for the test set (X_test) 



In [None]:
# Add the results as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'Benchmark RF',
                                                         'Accuracy': round(metrics.accuracy_score(y_test, y_pred_rf),3), 
                                                         'F1': round(metrics.f1_score(y_test, y_pred_rf), 3)})])

clf_results

#### 3b. Optimal RF - tune the RF hyperparameters with RandomizedSearchCV( )

In [None]:
# Use GridSearchCV() with 5-fold or 10-fold cross-validation (cv=5 or cv=10)
# (more cv folds reduces the chances of overfitting but also increases the run time) on the RandomForestClassifier(random_state=0) classifier 
# 1. Create the dictionary of hyperparameters for RF
# 2. Set up the **RandomSearchCV**  and assign to a new variable named cv_rf
# 3. Fit cv_rf to X_train and y_train 
# 4. Report the optimal parameters




In [None]:
# Build the classifier using the optimal parameters detected by the tuning process.
# Note: you can use either cv_rf.best_estimator_ to retrieve the optimal model. Assign to rf_opt.  
# 2 - Fit the rf_opt model to the training set (X_train, y_train)
# 3 - Predict the test data (X_test) using the rf_opt model. Assign to a variable named y_pred_rf_opt
# 4 - Print the final overall accuracy and classification_report for the test set (X_test) 



In [None]:
# Add the results of the optimal KNN classifier as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'Tuned Random Forest',
                                                         'Accuracy': round(metrics.accuracy_score(y_test, y_pred_rf_opt),3),  
                                                         'F1': round(metrics.f1_score(y_test, y_pred_rf_opt), 3)})])

clf_results

In [None]:
# Get the feature importance from the rf classifier using rf_opt.feature_importances_
# Cast it into a pd.DataFrame and use sort_values to sort by the importance 
# Plot the rf_opt.feature_importances_ in a barplot



#### 4. Decision Tree - benchmark model (default parameters)

In [None]:
# Step 1 - Instantiate the DecisionTreeClassifier() classifier using the default parameters  
# Assign the result into a new variable named "dt" 
# Step 2 - Fit the dt model to the training set (X_train, y_train)
# Step 3 - Predict the test data (X_test) using the dt model. Assign to y_pred_dt
# Step 4 - Print the final overall accuracy and classification_report for the test set (X_test) 




In [None]:
##### (time-dependent): YOU CAN TUNE THE PARAMETERS OF DT TO FIND THE OPTIMAL MODEL #####

In [None]:
# Add the results of the optimal KNN classifier as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'Benchmark Decision Tree',
                                                         'Accuracy': round(metrics.accuracy_score(y_test, y_pred_dt),3), 
                                                         'F1': round(metrics.f1_score(y_test, y_pred_dt), 3)})])

clf_results

### Compare the models 

In [None]:
clf_results.sort_values(by='F1', ascending=False, inplace=True)
clf_results

#### (OPTIONAL) Other models: Logistic Regression - benchmark model (default parameters) 

Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. 

In [None]:
# Step 1 - Instantiate the LogisticRegression() classifier using the default parameters  
# Assign the result into a new variable named "logreg" 
# Step 2 - Fit the logreg model to the training set (X_train)
# Step 3 - Predict the test data (X_test) using the logreg model 
# Step 4 - Print the final overall accuracy and classification_report for the test set (X_test) 

logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_logreg = logreg.predict(X_test)

print('Test set accuracy: ', round(metrics.accuracy_score(y_test, y_pred_logreg), 3))
print('\n', metrics.classification_report(y_test, y_pred_logreg))

In [None]:
##### YOU CAN TUNE THE PARAMETERS OF LOGISTIC REGRESSION TO FIND THE OPTIMAL MODEL #####

In [None]:
# Add the results of the optimal KNN classifier as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'LogisticRegression',
                                                         'Accuracy': round(metrics.accuracy_score(y_test, y_pred_logreg),3), 
                                                         'F1': round(metrics.f1_score(y_test, y_pred_logreg), 3)})])

clf_results

#### (OPTIONAL) Other models: ExtraTreesClassifier - benchmark model (default parameters)

In [None]:
# Step 1 - Instantiate the ExtraTreesClassifier() classifier using the default parameters  
# Assign the result into a new variable named "extc" 
# Step 2 - Fit the extc model to the training set (X_train)
# Step 3 - Predict the test data (X_test) using the extc model 
# Step 4 - Print the final overall accuracy and classification_report for the test set (X_test) 

extc = ExtraTreesClassifier()
extc.fit(X_train, y_train)
y_pred_extc = extc.predict(X_test)

print('Test set accuracy: ', round(metrics.accuracy_score(y_test, y_pred_extc), 3))
print('\n', metrics.classification_report(y_test, y_pred_extc))

In [None]:
## Search grid for optimal parameters
ex_param_grid = {"max_depth": [None],
                  "max_features": [1, 3, 10],
                  "min_samples_split": [2, 3, 10],
                  "min_samples_leaf": [1, 3, 10],
                  "bootstrap": [False],
                  "n_estimators" :[100,300],
                  "criterion": ["gini"]}


cv_extc = RandomizedSearchCV(ExtraTreesClassifier(random_state=0), 
                           param_distributions=ex_param_grid, 
                           n_iter = 50,
                           cv = 5,
                           n_jobs = -1,
                           random_state=0)

cv_extc.fit(X_train, y_train)

print('Best Parameters using grid search: \n', cv_extc.best_params_)

In [None]:
# Build the classifier using the optimal parameters detected by the tuning process

extc_opt = cv_extc.best_estimator_
extc_opt.fit(X_train, y_train) 
y_pred_extc_opt = extc_opt.predict(X_test)

print('Test set accuracy: ', round(metrics.accuracy_score(y_test, y_pred_extc_opt), 3))
print('\n', metrics.classification_report(y_test, y_pred_extc_opt))

In [None]:
# Get the feature importance from the extc_opt classifier using extc_opt.feature_importances_
# Cast it into a pd.DataFrame and use sort_values to sort by the importance 

feature_scores = pd.DataFrame(extc_opt.feature_importances_, 
                              index=X_train.columns, 
                              columns=['Importance'])
feature_scores.sort_values(by='Importance', ascending=False, inplace=True) 

# Plot the rf_opt.feature_importances_ in a barplot 
f, ax = plt.subplots(figsize=(30, 20))
ax = sns.barplot(x='Importance', y=feature_scores.index, data=feature_scores)
ax.set_title("Extra Trees feature importance", size = 20)
ax.set_yticklabels(feature_scores.index, size = 20)
ax.set_xlabel("Feature importance score", size = 20)
ax.set_ylabel("Features", size = 20)
plt.show()

In [None]:
# Add the results of the optimal Extra Trees classifier as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'Extra Trees',
                                                         'Accuracy': round(metrics.accuracy_score(y_test, y_pred_extc_opt),3), 
                                                         'F1': round(metrics.f1_score(y_test, y_pred_extc_opt), 3)})])

clf_results

#### (OPTIONAL) Other models: Ensemble with Voting (heterogeneous models)

In [None]:
from sklearn.ensemble import VotingClassifier

# Build a Voting Ensemble model with some heterogeneous models with pre-defined parameters 
# Alternatively, you can use the combination of optimized (tuned) models 

ensemble =VotingClassifier(estimators=[('KNN',KNeighborsClassifier(n_neighbors=10)),
                                      ('RBF', SVC(probability=True, kernel='rbf', C=0.5,gamma=0.1)),
                                      ('RFor',RandomForestClassifier(n_estimators=500,random_state=0)),
                                      ('LR',  LogisticRegression(C=0.05)),
                                      ('DT',  DecisionTreeClassifier(random_state=0)),
                                      ('NB',  GaussianNB()),
                                      ('svm', SVC(kernel='rbf',probability=True))
                                     ], 
                       voting='soft')

ensemble.fit(X_train, y_train) 
y_pred_ensemble = ensemble.predict(X_test)

print('The accuracy for ensembled model is:', ensemble.score(X_test, y_test))
print('\n', metrics.classification_report(y_test, y_pred_ensemble))

In [None]:
##### YOU CAN TUNE THE PARAMETERS OF THE VOTING ENSEMBLE TO FIND THE OPTIMAL MODEL 
# OR COMBINE THE OPTIMAL MODELS FROM THE VARIOUS TUNING PROCESSES #####

In [None]:
# Add the results of the Voting Ensemble model (heterogeneous) as a new row into clf_results that collects all the results across all models

clf_results = pd.concat([clf_results, pd.json_normalize({'Model': 'Voting Ensemble model (heterogeneous)',
                                                         'Accuracy': round(ensemble.score(X_test, y_test),3), 
                                                         'F1': round(metrics.f1_score(y_test, y_pred_ensemble), 3)})])

clf_results

#### Compare once more all the models

In [None]:
clf_results.sort_values(by='F1', ascending=False, inplace=True)
clf_results

### Classifiers - Predict the unknown test set

Using the patterns you find in train.csv, you have to predict whether the other 418 passengers on board (in test.csv) survived.

##### Case 1: use a single optimal (tuned) model 

Let's assume we want to use the optimal Random Forest to make the predictions on our final test (unknown) dataset: 

In [None]:
##  Can you use any optimal model to predict the classes of your unknownData? 


In [None]:
## Check the distribution of your y_pred to investigate how your model did 

In [None]:
# Save as csv file 
# output.to_csv('submission.csv', index=False)