**Project 1: Data Storytelling - Analyzing Survival on the Titanic**

**Workshop:** Geeks for Geeks 21 Projects, 21 Days: ML, Deep Learning & GenAI

**Date:** October 10, 2025

**Author:** Harsh Bhanushali

**Objective:** Perform an Exploratory Data Analysis (EDA) on the Titanic dataset, clean the data, engineer features, visualize insights using Plotly, and generate a ydata-profiling report for submission.

**Step 1: Import Libraries**

Import libraries for data manipulation, visualization, statistical analysis, and profiling.

In [1]:
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
from scipy import stats
from ydata_profiling import ProfileReport
!pip install ydata-profiling -q
import plotly.io as pio
pio.renderers.default = 'iframe'

**Step 2: Load the Dataset**

Load the Titanic dataset from Seaborn, an online resource.

In [2]:
df = sns.load_dataset('titanic')
print("Dataset Loaded Successfully!")
print("\nFirst 5 Rows:")
display(df.head())
print("\nShape:", df.shape)
print("\nColumns:", df.columns.tolist())

Dataset Loaded Successfully!

First 5 Rows:


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True



Shape: (891, 15)

Columns: ['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'fare', 'embarked', 'class', 'who', 'adult_male', 'deck', 'embark_town', 'alive', 'alone']


**Step 3: Data Cleaning**

Handle missing values and remove redundant columns to prepare the dataset.


**Missing Values:**

    Fill age with median.
    
    Fill embarked and embark_town with mode.

    For deck, add 'Unknown' to categories and fill missing values.



**Drop Columns:** Remove derived columns (alive, who, adult_male, class, alone).



**Rename Columns:** Standardize to match typical Titanic dataset conventions.

In [3]:
# Handle missing values
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])
df['embark_town'] = df['embark_town'].fillna(df['embark_town'].mode()[0])

# Handle categorical 'deck' column
if df['deck'].dtype.name == 'category':
    df['deck'] = df['deck'].cat.add_categories(['Unknown'])
df['deck'] = df['deck'].fillna('Unknown')

# Drop redundant columns
df = df.drop(['alive', 'who', 'adult_male', 'class', 'alone'], axis=1, errors='ignore')

# Rename columns
df.rename(columns={
    'survived': 'Survived',
    'pclass': 'Pclass',
    'sex': 'Sex',
    'age': 'Age',
    'sibsp': 'SibSp',
    'parch': 'Parch',
    'fare': 'Fare',
    'embarked': 'Embarked',
    'deck': 'Cabin',
    'embark_town': 'Embark_town'
}, inplace=True)

print("\nCleaned DataFrame Info:")
df.info()


Cleaned DataFrame Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 10 columns):
 #   Column       Non-Null Count  Dtype   
---  ------       --------------  -----   
 0   Survived     891 non-null    int64   
 1   Pclass       891 non-null    int64   
 2   Sex          891 non-null    object  
 3   Age          891 non-null    float64 
 4   SibSp        891 non-null    int64   
 5   Parch        891 non-null    int64   
 6   Fare         891 non-null    float64 
 7   Embarked     891 non-null    object  
 8   Cabin        891 non-null    category
 9   Embark_town  891 non-null    object  
dtypes: category(1), float64(2), int64(4), object(3)
memory usage: 64.0+ KB


**Step 4: Feature Engineering**

Create features to identify patterns.

    **FamilySize:** Sum of SibSp, Parch, and 1 (self).
    
    **IsAlone:** 1 if FamilySize is 1, else 0.

    **Title:** Use Sex as a proxy (no Name column in Seaborn’s dataset).

    **AgeGroup:** Bin Age into categories.

    **FareBin:** Bin Fare into quartiles.

    **HasCabin:** 1 if Cabin is not 'Unknown'.

In [4]:
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1
df['IsAlone'] = (df['FamilySize'] == 1).astype(int)
df['Title'] = df['Sex']
df['AgeGroup'] = pd.cut(df['Age'], bins=[0, 12, 18, 35, 60, 100], 
                        labels=['Child', 'Teen', 'Adult', 'Middle-Aged', 'Senior'])
df['FareBin'] = pd.qcut(df['Fare'], q=4, labels=['Low', 'Medium', 'High', 'Very High'])
df['HasCabin'] = (df['Cabin'] != 'Unknown').astype(int)

print("\nEngineered Features Head:")
display(df[['FamilySize', 'IsAlone', 'Title', 'AgeGroup', 'FareBin', 'HasCabin']].head())


Engineered Features Head:


Unnamed: 0,FamilySize,IsAlone,Title,AgeGroup,FareBin,HasCabin
0,2,0,male,Adult,Low,0
1,2,0,female,Middle-Aged,Very High,1
2,1,1,female,Adult,Medium,0
3,2,0,female,Adult,Very High,1
4,1,1,male,Adult,Medium,0


**Step 5: Exploratory Data Analysis (EDA)**

Visualize patterns using interactive Plotly plots.

**5.1 Overall Survival Rate**

Calculate and display the survival percentage.

In [5]:
survival_rate = df['Survived'].mean() * 100
print(f"Overall Survival Rate: {survival_rate:.2f}%")

fig = px.pie(df, names='Survived', title='Survival Distribution', 
             labels={'Survived': 'Outcome', 0: 'Did Not Survive', 1: 'Survived'}, 
             hole=0.3, color_discrete_sequence=['#FF6347', '#32CD32'])
fig.update_layout(title_x=0.5)
fig.show()

Overall Survival Rate: 38.38%


**5.2 Survival by Passenger Class and Sex**

Analyze survival rates by class and gender.

In [6]:
fig = px.bar(df.groupby(['Pclass', 'Sex'])['Survived'].mean().reset_index(),
             x='Pclass', y='Survived', color='Sex', barmode='group',
             title='Survival Rate by Passenger Class and Sex',
             labels={'Survived': 'Survival Rate', 'Pclass': 'Passenger Class'},
             category_orders={'Pclass': [1, 2, 3]},
             color_discrete_sequence=['#FF69B4', '#4682B4'], height=400)
fig.update_layout(xaxis_title='Passenger Class', yaxis_title='Survival Rate', title_x=0.5)
fig.show()

**5.3 Age Distribution by Survival**

Examine age distribution for survivors and non-survivors.

In [7]:
fig = px.histogram(df, x='Age', color='Survived', nbins=30, marginal='box',
                   title='Age Distribution by Survival Status',
                   labels={'Survived': 'Outcome', 0: 'Did Not Survive', 1: 'Survived'},
                   color_discrete_sequence=['#FF6347', '#32CD32'])
fig.update_layout(title_x=0.5)
fig.show()

**5.4 Fare vs. Survival**

Investigate the relationship between fare and survival.

In [8]:
fig = px.box(df, x='Survived', y='Fare', color='Survived',
             title='Fare Distribution by Survival Status',
             labels={'Survived': 'Outcome', 0: 'Did Not Survive', 1: 'Survived'},
             points='all', color_discrete_sequence=['#FF6347', '#32CD32'])
fig.update_layout(title_x=0.5)
fig.show()

**5.5 Family Size Impact**

Analyze the effect of family size on survival.

In [9]:
family_survival = df.groupby('FamilySize')['Survived'].mean().reset_index()
fig = px.line(family_survival, x='FamilySize', y='Survived', markers=True,
              title='Survival Rate by Family Size',
              labels={'Survived': 'Survival Rate'})
fig.update_layout(title_x=0.5)
fig.show()

display(family_survival.round(2))

Unnamed: 0,FamilySize,Survived
0,1,0.3
1,2,0.55
2,3,0.58
3,4,0.72
4,5,0.2
5,6,0.14
6,7,0.33
7,8,0.0
8,11,0.0


**5.6 Statistical Test: Age Difference**

Test for significant age differences between survivors and non-survivors.

In [10]:
survivors_age = df[df['Survived'] == 1]['Age']
non_survivors_age = df[df['Survived'] == 0]['Age']
t_stat, p_value = stats.ttest_ind(survivors_age, non_survivors_age)
print(f"T-statistic: {t_stat:.2f}, P-value: {p_value:.4f}")

T-statistic: -1.94, P-value: 0.0528


**Step 6: Correlation Analysis**

Visualize relationships between numerical features.

In [11]:
corr = df.select_dtypes(include=['number']).corr()
fig = go.Figure(data=go.Heatmap(z=corr.values, x=corr.columns, y=corr.columns,
                                colorscale='RdBu', zmin=-1, zmax=1))
fig.update_layout(title='Correlation Heatmap of Numerical Features', title_x=0.5)
fig.show()

**Step 7: Generate ydata-profiling Report**

Generate an HTML report summarizing the analysis.

In [12]:
profile = ProfileReport(df, title="Titanic Survival Analysis Report", explorative=True)
profile.to_notebook_iframe()
profile.to_file("titanic_profiling_report.html")
print("Profiling report saved as 'titanic_profiling_report.html'")

Summarize dataset:   0%|          | 0/5 [00:00<?, ?it/s]


  0%|          | 0/16 [00:00<?, ?it/s][A
 44%|████▍     | 7/16 [00:00<00:00, 59.33it/s][A
100%|██████████| 16/16 [00:00<00:00, 62.63it/s]


Generate report structure:   0%|          | 0/1 [00:00<?, ?it/s]

Render HTML:   0%|          | 0/1 [00:00<?, ?it/s]

Export report to file:   0%|          | 0/1 [00:00<?, ?it/s]

Profiling report saved as 'titanic_profiling_report.html'
