# Titanic - Machine Learning from Disaster (ML Project)

## Overview:

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone onboard, resulting in the death of 1502 out of 2224 passengers and crew. While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others. In this challenge, we ask you to build a predictive model that answers the question: “what sorts of people were more likely to survive?” using passenger data (ie name, age, gender, socio-economic class, etc).

## Goal:

Use machine learning to create a model that predicts which passengers (In test.csv dataset) survived the Titanic shipwreck.

## ML Models Used:

EDA, Random Forest Classifier



In [133]:
import pandas as pd
import numpy as np

## Helper Functions:

In [134]:
def pc_print(number: float, rounding: int = 2):
    return round(number, rounding)

## Datasets:

![Alt text](images/image.png)
![Alt text](images/image-1.png)

In [135]:
df_train = pd.read_csv('titanic/train.csv')
df_test = pd.read_csv('titanic/test.csv')
df_gender = pd.read_csv('titanic/gender_submission.csv') # <- Expected output

df_test['Survived'] = df_gender['Survived']

print("Number of rows in training set: {}".format(len(df_train)))
print("Number of rows in test set: {}".format(len(df_test)))


Number of rows in training set: 891
Number of rows in test set: 418


## EDA:

In [136]:
df_train.head(3)
df_train.describe()
df_train.info()
df_train.dtypes

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

In [137]:
df_test.head(3)
df_test.describe()
df_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  418 non-null    int64  
 1   Pclass       418 non-null    int64  
 2   Name         418 non-null    object 
 3   Sex          418 non-null    object 
 4   Age          332 non-null    float64
 5   SibSp        418 non-null    int64  
 6   Parch        418 non-null    int64  
 7   Ticket       418 non-null    object 
 8   Fare         417 non-null    float64
 9   Cabin        91 non-null     object 
 10  Embarked     418 non-null    object 
 11  Survived     418 non-null    int64  
dtypes: float64(2), int64(5), object(5)
memory usage: 39.3+ KB


In [138]:
# Rates of each gender that survived
women = df_train.loc[df_train['Sex'] == 'female']['Survived']
rate_women = sum(women) / len(women)
print(f"Rate of women who survived: {pc_print(rate_women * 100)}%")

men = df_train.loc[df_train['Sex'] == 'male']['Survived']
rate_men = sum(men) / len(men)
print(f"Rate of men who survived: {pc_print(rate_men * 100)}%")

# Number of records where the age is null
null_ages = df_train.loc[(df_train['Age'].isna())]
print(f"No. of Records with NaN Age: {len(null_ages)}")

# Records where passenger is under 18
under_18 = df_train.loc[df_train['Age'] < 18]
print(f"No. of Passengers under 18 years old: {len(under_18)}")

Rate of women who survived: 74.2%
Rate of men who survived: 18.89%
No. of Records with NaN Age: 177
No. of Passengers under 18 years old: 113


# Preprocessing:

In [139]:
# Features to use for analysis
features = ['Pclass', 'Sex', 'SibSp', 'Parch', 'Embarked']

In [140]:
# Why is there such a range in fare prices? Seems unrelated to whether the passenger has children/parensts, or siblings (Grouped ticket). Perhaps due to Cabin number and where they embarked from?
df_train.groupby(by=['Pclass'])['Fare'].describe()

Unnamed: 0_level_0,count,mean,std,min,25%,50%,75%,max
Pclass,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1
1,216.0,84.154687,78.380373,0.0,30.92395,60.2875,93.5,512.3292
2,184.0,20.662183,13.417399,0.0,13.0,14.25,26.0,73.5
3,491.0,13.67555,11.778142,0.0,7.75,8.05,15.5,69.55


In [141]:
# Split out surname from Name feature to better identify who is related
df_train['Surname'] = df_train['Name'].str.split(pat=',', expand=True)[0]
df_test['Surname'] = df_train['Name'].str.split(pat=',', expand=True)[0]

## Models:
Random Forest & SVM

In [142]:
from sklearn.preprocessing import MinMaxScaler

In [143]:
y = df_train['Survived']

# Perform one-hot encoding on the features columns withoutput type int
x_train = pd.get_dummies(df_train[features], dtype='int')
x_test =  pd.get_dummies(df_test[features], dtype='int')

# Fit transform both train/test x datasets using MinMaxScaler
scaler = MinMaxScaler()
x_train = scaler.fit_transform(x_train)
x_test = scaler.fit_transform(x_test)

## Model 1: Random Forest Classifier

In [144]:
from sklearn.ensemble import RandomForestClassifier

# Train a random forest classifier using the training data
model = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=1)
model.fit(x_train, y)

## Model 2: SVM

In [145]:
from sklearn.svm import SVC

# Train a support vector machine using the training data
model = SVC(kernel = "linear", C = 10, gamma = 0.00001)
model.fit(x_train, y)

# Prediction:

In [146]:
# Make prediction 
predictions = model.predict(x_test)

# Parse output to a dataframe
output = pd.DataFrame({
    'PassengerId': df_test['PassengerId'],
    'Survived': predictions
})

In [147]:
# Save output dataframe to a csv
output.to_csv('submission.csv', index=False)
print("Submission Saved")
output.head()

Submission Saved


Unnamed: 0,PassengerId,Survived
0,892,0
1,893,1
2,894,0
3,895,0
4,896,1


# Metrics:

In [148]:
from sklearn import metrics

# Calculate accuracy using the metrics package
print(f"Accuracy: {metrics.accuracy_score(df_test['Survived'], predictions)*100}%")

Accuracy: 100.0%


## Comparison to Example:

In [149]:
# Some additional metrics comparing to the provided example output file

print(f"Shape Match between Prediction and Example: {output.shape == df_gender.shape}")
print(f"PassengerId Match between Prediction and Example: {output['PassengerId'].equals(df_gender['PassengerId'])}")
print(f"Survived Match between Prediction and Example: {output['Survived'].equals(df_gender['Survived'])}")

print(f"Count Survived Prediction: {sum(output['Survived'])}")
print(f"Count Survived Example: {sum(df_gender['Survived'])}")

print(f"Accuracy: {round(sum(output['Survived']) / sum(df_gender['Survived']), 4) * 100}%")

Shape Match between Prediction and Example: True
PassengerId Match between Prediction and Example: True
Survived Match between Prediction and Example: True
Count Survived Prediction: 152
Count Survived Example: 152
Accuracy: 100.0%
