# Task 5: Exploratory Data Analysis (Titanic - Synthetic)

This notebook performs EDA on a Titanic-like synthetic dataset generated for practice.  
**Deliverables:** EDA with visuals and observations.

**Contents:**
1. Load data & quick look (`.info()`, `.describe()`)
2. Missing values analysis
3. Univariate analysis (histograms, boxplots, counts)
4. Bivariate analysis (correlation heatmap, cross-tabs)
5. Multivariate glimpses (pairplot on selected features)
6. Insights summary


In [None]:
# Setup
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

# Inline plotting (if running in Jupyter)
%matplotlib inline

# Load data
df = pd.read_csv(r"/mnt/data/task5_titanic_eda/titanic_train_synthetic.csv")
df.head()

In [None]:
# Quick info and summary
display(df.info())
display(df.describe(include='all'))

# Basic sanity checks
display(df.isna().sum().sort_values(ascending=False))

## Missing Values
We'll inspect missingness and decide on simple imputation strategies for numerical columns like `Age`.


In [None]:
# Visualize missingness (simple bar)
na_counts = df.isna().sum().sort_values(ascending=False)
na_counts.plot(kind='bar', figsize=(10,4), title='Missing Values per Column')
plt.tight_layout()
plt.show()

# Simple imputation for Age to proceed with plots (median)
df['Age_imputed'] = df['Age'].fillna(df['Age'].median())

## Univariate Analysis

In [None]:
# Histograms for numeric columns
num_cols = ['Age_imputed','Fare','SibSp','Parch']
for col in num_cols:
    plt.figure(figsize=(6,4))
    plt.hist(df[col].dropna(), bins=30)
    plt.title(f'Histogram of {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

In [None]:
# Boxplots to check outliers
for col in ['Age_imputed','Fare']:
    plt.figure(figsize=(6,4))
    plt.boxplot(df[col].dropna(), vert=True)
    plt.title(f'Boxplot of {col}')
    plt.ylabel(col)
    plt.tight_layout()
    plt.show()

In [None]:
# Categorical counts
cat_cols = ['Survived','Pclass','Sex','Embarked']
for col in cat_cols:
    plt.figure(figsize=(6,4))
    df[col].value_counts().plot(kind='bar')
    plt.title(f'Count of {col} categories')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.tight_layout()
    plt.show()

## Bivariate Analysis

In [None]:
# Survival rate by Sex and Pclass
ct1 = pd.crosstab(df['Sex'], df['Survived'], normalize='index') * 100
ct2 = pd.crosstab(df['Pclass'], df['Survived'], normalize='index') * 100
display(ct1.round(2))
display(ct2.round(2))

# Bar charts for survival by groups
for col in ['Sex', 'Pclass', 'Embarked']:
    rate = df.groupby(col)['Survived'].mean() * 100
    plt.figure(figsize=(6,4))
    rate.plot(kind='bar')
    plt.title(f'Survival Rate by {col} (%)')
    plt.ylabel('Survival Rate (%)')
    plt.tight_layout()
    plt.show()

In [None]:
# Correlation heatmap (numeric)
corr = df[['Survived','Pclass','Age_imputed','SibSp','Parch','Fare']].corr()
plt.figure(figsize=(6,5))
sns.heatmap(corr, annot=True, fmt='.2f', square=True)
plt.title('Correlation Heatmap')
plt.tight_layout()
plt.show()

In [None]:
# Pairplot on a subset to visualize relationships
sns.pairplot(df[['Survived','Pclass','Age_imputed','Fare']], hue='Survived', diag_kind='hist')
plt.show()

## Observations (write-up)
- **Sex vs Survival:** Females show notably higher survival rates compared to males, which aligns with historical reports.
- **Class vs Survival:** Passengers in higher classes (1 > 2 > 3) tend to have higher survival rates.
- **Fare:** Higher fare (proxy for class/amenities) is mildly associated with higher survival.
- **Age:** The effect of age is subtle; extreme ages can show different patterns. We imputed missing ages with the median to enable plots.
- **Family (SibSp/Parch):** Small family sizes sometimes correlate with slightly better outcomes than very large groups.
- **Outliers:** Fare shows some outliers, which is expected (a few premium tickets/cabins).