# **Titanic - Machine Learning from Disaster**

This is the part 1 of solving the Titanic competition from kaggle where

1. we will be studying the dataset given in the competition https://www.kaggle.com/competitions/titanic/data
2. Preprocess the data by dealing with null values and dropping unwanted columns.
3. Feature engineer by extracting Titles from Names of the passengers.
4. Finalize the data which will be ready to train with Machine Learning algorithms.





Importing **Pandas** Library
Pandas is one of the most popular and widely used data science tool for manipulating data and we can learn more about it from its official documentation https://pandas.pydata.org/

Load the csv dataset into dataframes

In [None]:
import pandas as pd

train_data = pd.read_csv('train.csv')
test_data = pd.read_csv('test.csv')

In [None]:
train_data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [None]:
test_data.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,892,3,"Kelly, Mr. James",male,34.5,0,0,330911,7.8292,,Q
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,363272,7.0,,S
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,240276,9.6875,,Q
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,315154,8.6625,,S
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,3101298,12.2875,,S


We are removing unnecessary columns which do not give much advantage for us while training data

In [None]:
train_data_1 = train_data.drop(['PassengerId', 'Ticket', 'Cabin'], axis=1)
test_data_1 = test_data.drop(['Ticket', 'Cabin'], axis=1)

In [None]:
train_data_1.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,7.25,S
1,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,71.2833,C
2,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,7.925,S
3,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,53.1,S
4,0,3,"Allen, Mr. William Henry",male,35.0,0,0,8.05,S


In [None]:
train_data_1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 9 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Name      891 non-null    object 
 3   Sex       891 non-null    object 
 4   Age       714 non-null    float64
 5   SibSp     891 non-null    int64  
 6   Parch     891 non-null    int64  
 7   Fare      891 non-null    float64
 8   Embarked  889 non-null    object 
dtypes: float64(2), int64(4), object(3)
memory usage: 62.8+ KB


Check if any null values present in any columns

In [None]:
train_data_1.isnull().sum()

Unnamed: 0,0
Survived,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Fare,0
Embarked,0


In [None]:
test_data_1.isnull().sum()

Unnamed: 0,0
PassengerId,0
Pclass,0
Name,0
Sex,0
Age,0
SibSp,0
Parch,0
Fare,0
Embarked,0


In [None]:
train_data_1['Embarked'] = train_data_1['Embarked'].fillna(train_data_1['Embarked'].mode()[0])

Getting statistical values from train dataset

In [None]:
train_data_1.describe()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare
count,891.0,891.0,714.0,891.0,891.0,891.0
mean,0.383838,2.308642,29.699118,0.523008,0.381594,32.204208
std,0.486592,0.836071,14.526497,1.102743,0.806057,49.693429
min,0.0,1.0,0.42,0.0,0.0,0.0
25%,0.0,2.0,20.125,0.0,0.0,7.9104
50%,0.0,3.0,28.0,0.0,0.0,14.4542
75%,1.0,3.0,38.0,1.0,0.0,31.0
max,1.0,3.0,80.0,8.0,6.0,512.3292


In [None]:
train_data_1['Age'] = train_data_1['Age'].fillna(train_data_1['Age'].median())
test_data_1['Age'] = test_data_1['Age'].fillna(test_data_1['Age'].median())
test_data_1['Fare'] = test_data_1['Fare'].fillna(test_data_1['Fare'].median())

## Feature Engineering

Extracting Title from Name using a pattern

In [None]:
train_data_1['Title'] = train_data_1['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

In [None]:
test_data_1['Title'] = test_data_1['Name'].str.extract(' ([A-Za-z]+)\.', expand=False)

In [None]:
test_data_1.head()

Unnamed: 0,PassengerId,Pclass,Name,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,892,3,"Kelly, Mr. James",male,34.5,0,0,7.8292,Q,Mr
1,893,3,"Wilkes, Mrs. James (Ellen Needs)",female,47.0,1,0,7.0,S,Mrs
2,894,2,"Myles, Mr. Thomas Francis",male,62.0,0,0,9.6875,Q,Mr
3,895,3,"Wirz, Mr. Albert",male,27.0,0,0,8.6625,S,Mr
4,896,3,"Hirvonen, Mrs. Alexander (Helga E Lindqvist)",female,22.0,1,1,12.2875,S,Mrs


In [None]:
train_data_2 = train_data_1.drop('Name', axis=1)
test_data_2 = test_data_1.drop('Name', axis=1)

In [None]:
train_data_2.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,0,22.0,1,0,7.25,0,Mr
1,1,1,1,38.0,1,0,71.2833,1,Mrs
2,1,3,1,26.0,0,0,7.925,0,Miss
3,1,1,1,35.0,1,0,53.1,0,Mrs
4,0,3,0,35.0,0,0,8.05,0,Mr


Converting non number values to number values which can be used to train ML models

In [None]:
train_data_2['Sex'] = train_data_2['Sex'].map({'male': 0, 'female': 1})
test_data_2['Sex'] = test_data_2['Sex'].map({'male': 0, 'female': 1})

train_data_2['Embarked'] = train_data_2['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})
test_data_2['Embarked'] = test_data_2['Embarked'].map({'S': 0, 'C': 1, 'Q': 2})

In [None]:
train_data_2['Title'].value_counts()

Unnamed: 0_level_0,count
Title,Unnamed: 1_level_1
Mr,517
Miss,182
Mrs,125
Master,40
Rare,27


In [None]:
test_data_2['Title'].value_counts()

Unnamed: 0_level_0,count
Title,Unnamed: 1_level_1
Mr,240
Miss,78
Mrs,72
Master,21
Rare,7


In [None]:
train_data_2['Title'] = train_data_2['Title'].replace(['Dr', 'Rev', 'Col', 'Mlle', 'Major', 'Ms', 'Mme', 'Don', 'Lady', 'Sir', 'Capt', 'Countess', 'Jonkheer'], 'Rare')
test_data_2['Title'] = test_data_2['Title'].replace(['Dr', 'Rev', 'Col', 'Mlle', 'Major', 'Ms', 'Mme', 'Don', 'Lady', 'Sir', 'Capt', 'Countess', 'Jonkheer', 'Dona'], 'Rare')

In [None]:
train_data_2.isnull().sum()

Unnamed: 0,0
Survived,0
Pclass,0
Sex,0
Age,0
SibSp,0
Parch,0
Fare,0
Embarked,0
Title,0


In [None]:
train_data_2.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,0,22.0,1,0,7.25,0,Mr
1,1,1,1,38.0,1,0,71.2833,1,Mrs
2,1,3,1,26.0,0,0,7.925,0,Miss
3,1,1,1,35.0,1,0,53.1,0,Mrs
4,0,3,0,35.0,0,0,8.05,0,Mr


In [None]:
train_data_2['Title'] = train_data_2['Title'].map({'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4})
test_data_2['Title'] = test_data_2['Title'].map({'Mr': 0, 'Miss': 1, 'Mrs': 2, 'Master': 3, 'Rare': 4})

### Finally Data is ready for training

In [None]:
train_data_2.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked,Title
0,0,3,0,22.0,1,0,7.25,0,0
1,1,1,1,38.0,1,0,71.2833,1,2
2,1,3,1,26.0,0,0,7.925,0,1
3,1,1,1,35.0,1,0,53.1,0,2
4,0,3,0,35.0,0,0,8.05,0,0
