## Introduction

The sinking of the RMS Titanic is one of the most infamous shipwrecks in history.  On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. This sensational tragedy shocked the international community and led to better safety regulations for ships.

One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew. Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

In this challenge, we ask you to complete the analysis of what sorts of people were likely to survive. In particular, we ask you to apply the tools of machine learning to predict which passengers survived the tragedy.

## Library Imports

Pandas - For data manipulation <br>
Numpy - For Numerical arrays

In [1]:
import pandas as pd
import numpy as np

print('Library Imports Successful!')

Library Imports Successful!


## Loading data from google drive

In [3]:
from google.colab import drive
drive.mount('/content/gdrive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/gdrive


## Dataset Imports

In [4]:
train_set = pd.read_csv('/content/gdrive/My Drive/ML Datasets/titanic/train.csv')
test_set = pd.read_csv('/content/gdrive/My Drive/ML Datasets/titanic/test.csv')
# Dataset is now stored in a Pandas Dataframe 

print('Data Imports Successful!')

Data Imports Successful!


## Data Concatenation

In [5]:
#checking the columns in training and testing datasets
print(train_set.columns)
print("-"*40)
print(test_set.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')
----------------------------------------
Index(['PassengerId', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp', 'Parch',
       'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')


Next we store our target column in a variable and drop this target column from the training dataset.

In [0]:
survived = train_set['Survived']
y = train_set.drop(['Survived'], axis = 1)

Now to concat the row of our train and test datasets.

In [0]:
final_data = pd.concat([y, test_set], axis = 0)

Checking the number of rows and columns in our dataset.

In [8]:
final_data.shape

(1309, 11)

## Data Cleaning

In [9]:
final_data.isnull().sum()

PassengerId       0
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

## Missing Value Treatment

In [10]:
final_data['Age'] = final_data['Age'].fillna(final_data['Age'].mean())
final_data['Fare'] = final_data['Fare'].fillna(final_data['Fare'].mean())
final_data['Embarked'] = final_data['Embarked'].fillna('S')

final_data['Cabin'] = final_data['Cabin'].notnull()
final_data['Cabin'].head(10)

0    False
1     True
2    False
3     True
4    False
5    False
6     True
7    False
8    False
9    False
Name: Cabin, dtype: bool

In [11]:
final_data.isnull().sum()

PassengerId    0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Cabin          0
Embarked       0
dtype: int64

In [0]:
final_data = final_data.drop(['PassengerId', 'Ticket'], axis = 1)

## Feature Engineering

In [13]:
final_data.head(10)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked
0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,False,S
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,True,C
2,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,False,S
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,True,S
4,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,False,S
5,3,"Moran, Mr. James",male,29.881138,0,0,8.4583,False,Q
6,1,"McCarthy, Mr. Timothy J",male,54.0,0,0,51.8625,True,S
7,3,"Palsson, Master. Gosta Leonard",male,2.0,3,1,21.075,False,S
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,27.0,0,2,11.1333,False,S
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,14.0,1,0,30.0708,False,C


In [14]:
final_data['Title'] = final_data['Name'].apply(lambda x : x.split(".")[0].split(",")[1].strip())

np.unique(final_data['Title'])

array(['Capt', 'Col', 'Don', 'Dona', 'Dr', 'Jonkheer', 'Lady', 'Major',
       'Master', 'Miss', 'Mlle', 'Mme', 'Mr', 'Mrs', 'Ms', 'Rev', 'Sir',
       'the Countess'], dtype=object)

In [0]:
final_data['Title'] = final_data['Title'].replace('Mlle','Miss')
final_data['Title'] = final_data['Title'].replace('Mme','Mrs')
final_data['Title'] = final_data['Title'].replace('Miss','Ms')

s = ['Major', 'Lady', 'Sir','the Countess', 'Jonkheer', 'Don','Dona', 'Rev', 'Dr','Capt','Col']

final_data['Title'] =  final_data['Title'].replace(s,'Special')

In [16]:
np.unique(final_data['Title'])

array(['Master', 'Mr', 'Mrs', 'Ms', 'Special'], dtype=object)

## Data Binarization & More Cleaning

In [0]:
final_data['Age'] = pd.qcut(final_data['Age'], q=4, labels = False)
final_data['Fare'] = pd.qcut(final_data['Fare'], q=4, labels = False)

In [18]:
final_data.head(10)

Unnamed: 0,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Cabin,Embarked,Title
0,3,"Braund, Mr. Owen Harris",male,0,1,0,0,False,S,Mr
1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,3,1,0,3,True,C,Mrs
2,3,"Heikkinen, Miss. Laina",female,1,0,0,1,False,S,Ms
3,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,2,1,0,3,True,S,Mrs
4,3,"Allen, Mr. William Henry",male,2,0,0,1,False,S,Mr
5,3,"Moran, Mr. James",male,1,0,0,1,False,Q,Mr
6,1,"McCarthy, Mr. Timothy J",male,3,0,0,3,True,S,Mr
7,3,"Palsson, Master. Gosta Leonard",male,0,3,1,2,False,S,Master
8,3,"Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg)",female,1,0,2,1,False,S,Mrs
9,2,"Nasser, Mrs. Nicholas (Adele Achem)",female,0,1,0,2,False,C,Mrs


In [0]:
final_data = final_data.drop(['Name'], axis = 1)

In [0]:
final_data['All_People_Traveling'] = final_data['SibSp'] + final_data['Parch']

In [21]:
final_data.columns

Index(['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Cabin', 'Embarked',
       'Title', 'All_People_Traveling'],
      dtype='object')

## Data Encoding

Taking out data in to One-Hot Encoding format. 

In [29]:
df_dum = pd.get_dummies(final_data, columns = ['Sex','Age','Cabin','Title','Embarked'])
df_dum.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1309 entries, 0 to 417
Data columns (total 21 columns):
Pclass                  1309 non-null int64
SibSp                   1309 non-null int64
Parch                   1309 non-null int64
Fare                    1309 non-null int64
All_People_Traveling    1309 non-null int64
Sex_female              1309 non-null uint8
Sex_male                1309 non-null uint8
Age_0                   1309 non-null uint8
Age_1                   1309 non-null uint8
Age_2                   1309 non-null uint8
Age_3                   1309 non-null uint8
Cabin_False             1309 non-null uint8
Cabin_True              1309 non-null uint8
Title_Master            1309 non-null uint8
Title_Mr                1309 non-null uint8
Title_Mrs               1309 non-null uint8
Title_Ms                1309 non-null uint8
Title_Special           1309 non-null uint8
Embarked_C              1309 non-null uint8
Embarked_Q              1309 non-null uint8
Embarked_S    

In [30]:
df_dum.head(10)

Unnamed: 0,Pclass,SibSp,Parch,Fare,All_People_Traveling,Sex_female,Sex_male,Age_0,Age_1,Age_2,Age_3,Cabin_False,Cabin_True,Title_Master,Title_Mr,Title_Mrs,Title_Ms,Title_Special,Embarked_C,Embarked_Q,Embarked_S
0,3,1,0,0,1,0,1,1,0,0,0,1,0,0,1,0,0,0,0,0,1
1,1,1,0,3,1,1,0,0,0,0,1,0,1,0,0,1,0,0,1,0,0
2,3,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1
3,1,1,0,3,1,1,0,0,0,1,0,0,1,0,0,1,0,0,0,0,1
4,3,0,0,1,0,0,1,0,0,1,0,1,0,0,1,0,0,0,0,0,1
5,3,0,0,1,0,0,1,0,1,0,0,1,0,0,1,0,0,0,0,1,0
6,1,0,0,3,0,0,1,0,0,0,1,0,1,0,1,0,0,0,0,0,1
7,3,3,1,2,4,0,1,1,0,0,0,1,0,1,0,0,0,0,0,0,1
8,3,0,2,1,2,1,0,0,1,0,0,1,0,0,0,1,0,0,0,0,1
9,2,1,0,2,1,1,0,1,0,0,0,1,0,0,0,1,0,0,1,0,0


## Train-Test Data Split

In [0]:
Train_X = df_dum[:891]
Train_Y = survived[:891]

Test_X = df_dum[891:]

## Algorithm

In [0]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

randomforest = RandomForestClassifier(n_estimators = 100)
randomforest.fit(Train_X, Train_Y)

y_pred_rf = randomforest.predict(Test_X)
#print(classification_report(Test_Y, y_pred_rf))

In [36]:
y_pred_rf

array([0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1,
       1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 1, 0, 1,
       1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1,
       1, 0, 0, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 0,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0,
       0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 1,
       1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 1,
       0, 0, 1, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0,
       1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, 1,
       1, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0,