In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
data = pd.read_csv("Titanic+Data+Set.csv")
data.head()

Unnamed: 0,PassengerId,Survived,Pclass,Name,Sex,Age,SibSp,Parch,Ticket,Fare,Cabin,Embarked
0,1,0,3,"Braund, Mr. Owen Harris",male,22.0,1,0,A/5 21171,7.25,,S
1,2,1,1,"Cumings, Mrs. John Bradley (Florence Briggs Th...",female,38.0,1,0,PC 17599,71.2833,C85,C
2,3,1,3,"Heikkinen, Miss. Laina",female,26.0,0,0,STON/O2. 3101282,7.925,,S
3,4,1,1,"Futrelle, Mrs. Jacques Heath (Lily May Peel)",female,35.0,1,0,113803,53.1,C123,S
4,5,0,3,"Allen, Mr. William Henry",male,35.0,0,0,373450,8.05,,S


In [3]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB


## EDA

In [4]:
#Since we won't be using PassengerId and Name for training the model let's drop that column
del_cols = ["PassengerId","Name"]

In [5]:
#finding missing values in % per column
missing = round(100*(data.isnull().sum()/len(data.PassengerId)), 2)
missing.loc[missing > 0]

Age         19.87
Cabin       77.10
Embarked     0.22
dtype: float64

In [6]:
data.Survived.value_counts()

0    549
1    342
Name: Survived, dtype: int64

**Insights:**
- Only **38%** of the passengers **survived**.

In [7]:
data.SibSp.value_counts()

0    608
1    209
2     28
4     18
3     16
8      7
5      5
Name: SibSp, dtype: int64

In [8]:
data.Parch.value_counts()

0    678
1    118
2     80
3      5
5      5
4      4
6      1
Name: Parch, dtype: int64

In [9]:
data.groupby(['SibSp','Survived'])[["PassengerId"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId
SibSp,Survived,Unnamed: 2_level_1
0,0,398
0,1,210
1,0,97
1,1,112
2,0,15
2,1,13
3,0,12
3,1,4
4,0,15
4,1,3


Insight:
- People **with Siblings/spouses** had rougly **50%** probability of surviving 
- People **without Siblings/spouses** had a **35%** probability of surviving 
- People with Siblings/spouses had a better chance of surviving probably because they might have been given preference. 
- Similar trend can be seen for people with parents/children v/s those without.

In [10]:
data.Sex.value_counts()

male      577
female    314
Name: Sex, dtype: int64

In [11]:
data.Survived.value_counts()

0    549
1    342
Name: Survived, dtype: int64

In [12]:
data.groupby(['Sex','Survived'])[["PassengerId"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId
Sex,Survived,Unnamed: 2_level_1
female,0,81
female,1,233
male,0,468
male,1,109


Insight:
- **Females** had a survival rate of **74%**
- **Males** had a survival rate of **19%** 

In [13]:
data.Pclass.value_counts()

3    491
1    216
2    184
Name: Pclass, dtype: int64

- 1 : Upper class
- 2 : Middle class
- 3 : Lower class

In [14]:
data.groupby(['Pclass','Survived'])[["PassengerId"]].count()

Unnamed: 0_level_0,Unnamed: 1_level_0,PassengerId
Pclass,Survived,Unnamed: 2_level_1
1,0,80
1,1,136
2,0,97
2,1,87
3,0,372
3,1,119


Insight:
- **Upper class** had a survival rate of **63%**
- **Middle class** had a survival rate of **47%** 
- **Lower class** had a survival rate of **41%** 

In [15]:
data.Ticket.value_counts()

CA. 2343      7
1601          7
347082        7
CA 2144       6
347088        6
             ..
A/5. 13032    1
2624          1
330935        1
349219        1
315094        1
Name: Ticket, Length: 681, dtype: int64

There are roughly 76% of unique values in ticket - and the column itself does not directly effect the chances of survival - dropping Ticket column

In [16]:
del_cols.append("Ticket")

In [17]:
data.Cabin.value_counts() 

G6             4
C23 C25 C27    4
B96 B98        4
F33            3
F2             3
              ..
F38            1
B50            1
C62 C64        1
B101           1
E46            1
Name: Cabin, Length: 147, dtype: int64

Cabin column has 72% unique values and 77% missing values - dropping this column

In [18]:
del_cols.append("Cabin")

In [19]:
data.Embarked.value_counts()

S    644
C    168
Q     77
Name: Embarked, dtype: int64

In [20]:
Embarked_median = data["Embarked"].mode()[0]
Age_median = data["Age"].median()
Embarked_median , Age_median

('S', 28.0)

In [21]:
data["Embarked"].fillna( Embarked_median , inplace=True)
data["Age"].fillna( Age_median , inplace=True)

In [22]:
# Dropping unneccessary columns
del_cols

['PassengerId', 'Name', 'Ticket', 'Cabin']

In [23]:
data.drop(del_cols, axis=1, inplace=True)

In [24]:
#finding missing values in % per column
missing = round(100*(data.isnull().sum()/len(data)), 2)
missing.loc[missing > 0]

Series([], dtype: float64)

In [25]:
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 8 columns):
 #   Column    Non-Null Count  Dtype  
---  ------    --------------  -----  
 0   Survived  891 non-null    int64  
 1   Pclass    891 non-null    int64  
 2   Sex       891 non-null    object 
 3   Age       891 non-null    float64
 4   SibSp     891 non-null    int64  
 5   Parch     891 non-null    int64  
 6   Fare      891 non-null    float64
 7   Embarked  891 non-null    object 
dtypes: float64(2), int64(4), object(2)
memory usage: 55.8+ KB


#### One - hot encoding

In [26]:
data.head()

Unnamed: 0,Survived,Pclass,Sex,Age,SibSp,Parch,Fare,Embarked
0,0,3,male,22.0,1,0,7.25,S
1,1,1,female,38.0,1,0,71.2833,C
2,1,3,female,26.0,0,0,7.925,S
3,1,1,female,35.0,1,0,53.1,S
4,0,3,male,35.0,0,0,8.05,S


In [27]:
cat_variables = list(data.columns[ data.dtypes == "object"])
cat_variables

['Sex', 'Embarked']

In [28]:
for i in cat_variables:
    data[i] = pd.Categorical(data[i])

In [29]:
processed_data = pd.get_dummies(data, columns=cat_variables, drop_first=True)
processed_data.head()

Unnamed: 0,Survived,Pclass,Age,SibSp,Parch,Fare,Sex_male,Embarked_Q,Embarked_S
0,0,3,22.0,1,0,7.25,1,0,1
1,1,1,38.0,1,0,71.2833,0,0,0
2,1,3,26.0,0,0,7.925,0,0,1
3,1,1,35.0,1,0,53.1,0,0,1
4,0,3,35.0,0,0,8.05,1,0,1


In [30]:
# Now the data is ready for modelling so we will save this data
processed_data.to_csv("titanic_processed_data.csv")

## EDA Insights :


- Only **38%** of the passengers **survived**
- Did any **gender** have a better chance of survival?
    - **Females** had a survival rate of **74%**
    - **Males** had a survival rate of **19%** 
- Did belonging a particular **socio-economic class** have an impact chances of survival?
    - **Upper class** had a survival rate of **63%**
    - **Middle class** had a survival rate of **47%** 
    - **Lower class** had a survival rate of **41%** 
- Did people with **family** have a better chance of surviving?
    - People **with Siblings/spouses** had rougly **50%** probability of surviving 
    - People **without Siblings/spouses** had a **35%** probability of surviving 
    - People with Siblings/spouses had a better chance of surviving probably because they might have been given preference. 
    - Similar trend can be seen for people with parents/children v/s those without.