# `Titanic Dataset`

### `Background`

On April 15, 1912, during her maiden voyage, the widely considered “unsinkable” RMS Titanic sank after colliding with an iceberg. Unfortunately, there weren’t enough lifeboats for everyone on board, resulting in the death of 1502 out of 2224 passengers and crew.

While there was some element of luck involved in surviving, it seems some groups of people were more likely to survive than others.

### `Objective`

To build a predictive model that answers the question, `"what sorts of people were more likely to survive?"` using the passenger data.

### `Data Understanding`

The titanic dataset comprises:
- `PassengerId:` Id representing unique passenger records
- `Survived:` A binary indicator that shows whether the passenger survived (1) or not (0)
- `Pclass:` Ticket class indicating the socio-economic status of the passenger | 1 = Upper, 2 = Middle, 3 = Lower
- `Name:` The full name of the passenger
- `Sex:` The gender of the passenger | Denoted as either male or female
- `Age:` The age of the passenger in years
- `SibSp:` The number of siblings/spouses aboard the Titanic for the respective passenger
- `Parch:` The number of parents/children aboard the Titanic for the respective passenger
- `Ticket:` The passenger ticket number
-  `Fare:` The fare paid by the passenger for the ticket
- `Cabin:` The cabin number assigned to the passenger, if available
- `Home.dest:` The home/destination of the passenger
- `Embarked:` The port of embarkation for the passenger | C = Cherbourg, Q = Queenstown, and S = Southampton
- `Boat:` Identifier for the lifeboat that rescued the survivor
- `Body:` Identification number of the recovered body, if the passenger did not survive

In [1]:
# Libraries
import pandas as pd 
import numpy as np 
import matplotlib.pyplot as plt 
%matplotlib inline
import seaborn as sns

In [2]:
# Load the data
df = pd.read_csv('archive/Titanic Dataset.csv', sep=',')
df.head(10)

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
0,1,1,"Allen, Miss. Elisabeth Walton",female,29.0,0,0,24160,211.3375,B5,S,2,,"St Louis, MO"
1,1,1,"Allison, Master. Hudson Trevor",male,0.92,1,2,113781,151.55,C22 C26,S,11,,"Montreal, PQ / Chesterville, ON"
2,1,0,"Allison, Miss. Helen Loraine",female,2.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
3,1,0,"Allison, Mr. Hudson Joshua Creighton",male,30.0,1,2,113781,151.55,C22 C26,S,,135.0,"Montreal, PQ / Chesterville, ON"
4,1,0,"Allison, Mrs. Hudson J C (Bessie Waldo Daniels)",female,25.0,1,2,113781,151.55,C22 C26,S,,,"Montreal, PQ / Chesterville, ON"
5,1,1,"Anderson, Mr. Harry",male,48.0,0,0,19952,26.55,E12,S,3,,"New York, NY"
6,1,1,"Andrews, Miss. Kornelia Theodosia",female,63.0,1,0,13502,77.9583,D7,S,10,,"Hudson, NY"
7,1,0,"Andrews, Mr. Thomas Jr",male,39.0,0,0,112050,0.0,A36,S,,,"Belfast, NI"
8,1,1,"Appleton, Mrs. Edward Dale (Charlotte Lamson)",female,53.0,2,0,11769,51.4792,C101,S,D,,"Bayside, Queens, NY"
9,1,0,"Artagaveytia, Mr. Ramon",male,71.0,0,0,PC 17609,49.5042,,C,,22.0,"Montevideo, Uruguay"


In [3]:
# Data info
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 14 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   pclass     1309 non-null   int64  
 1   survived   1309 non-null   int64  
 2   name       1309 non-null   object 
 3   sex        1309 non-null   object 
 4   age        1046 non-null   float64
 5   sibsp      1309 non-null   int64  
 6   parch      1309 non-null   int64  
 7   ticket     1309 non-null   object 
 8   fare       1308 non-null   float64
 9   cabin      295 non-null    object 
 10  embarked   1307 non-null   object 
 11  boat       486 non-null    object 
 12  body       121 non-null    float64
 13  home.dest  745 non-null    object 
dtypes: float64(3), int64(4), object(7)
memory usage: 143.3+ KB


In [4]:
# Check missing and duplications
def data_check(df):
    print('Missing values:\n', df.isnull().sum())
    print('\nDuplicated rows:', df.duplicated().sum())

data_check(df)

Missing values:
 pclass          0
survived        0
name            0
sex             0
age           263
sibsp           0
parch           0
ticket          0
fare            1
cabin        1014
embarked        2
boat          823
body         1188
home.dest     564
dtype: int64

Duplicated rows: 0


### `Data Handling`

`1. Age`

`Replace the missing age values with the median age based on the sex and socio-economic status.`

In [5]:
df.groupby(['sex','pclass'])['age'].median()

sex     pclass
female  1         36.0
        2         28.0
        3         22.0
male    1         42.0
        2         29.5
        3         25.0
Name: age, dtype: float64

In [6]:
df['age'] = df['age'].fillna(
    df.groupby(['sex','pclass'])['age'].transform('median')
)


`2. Fare`

In [7]:
df[df['fare'].isnull()]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,cabin,embarked,boat,body,home.dest
1225,3,0,"Storey, Mr. Thomas",male,60.5,0,0,3701,,,S,,261.0,


In [8]:
df.groupby(['pclass'])['fare'].median()

pclass
1    60.0000
2    15.0458
3     8.0500
Name: fare, dtype: float64

`Fill the missing fare with the median of its class - 3`

In [9]:
df['fare'] = df['fare'].fillna(
    df.groupby(['pclass'])['fare'].transform('median')
)

`3. Cabin`

`New feature "cabin_known" to represent whether the cabin is known or not. Then delete the "cabin" feature since it contains a high percentage of null values.`

In [10]:
df['cabin_known'] = df['cabin'].notnull().astype(int)

In [11]:
# Drop the cabin column
df.drop(columns=['cabin'], inplace=True)

`4. Embarked`

`Drop the null records of the "embarked" column`

In [12]:
df[df['embarked'].isnull()]

Unnamed: 0,pclass,survived,name,sex,age,sibsp,parch,ticket,fare,embarked,boat,body,home.dest,cabin_known
168,1,1,"Icard, Miss. Amelie",female,38.0,0,0,113572,80.0,,6,,,1
284,1,1,"Stone, Mrs. George Nelson (Martha Evelyn)",female,62.0,0,0,113572,80.0,,6,,"Cincinatti, OH",1


In [13]:
df.dropna(subset=['embarked'], inplace=True)

`5. Boat and Body`

`Since the boat feature represents the survivors, and the body feature represents the non-survivors, we can drop it.`

In [14]:
df.drop(columns=['boat', 'body'], inplace=True)

`6. home.dest`

`Drop the home.dest feature, since they represent the home/destination of ech passenger, which are irrelevant for this model.`

In [15]:
df.drop(columns=['home.dest'], inplace=True)

`Final Data Check`

In [16]:
data_check(df)

Missing values:
 pclass         0
survived       0
name           0
sex            0
age            0
sibsp          0
parch          0
ticket         0
fare           0
embarked       0
cabin_known    0
dtype: int64

Duplicated rows: 0


`The data is now ready for EDA.`

In [18]:
# Save the cleaned data
df.to_csv('cleaned_titanic_data.csv', index=False)