# Colab Setup
1. Upload `Titanic-Dataset.csv` into Colab (left sidebar > Files > Upload).
2. Run the cells in order. Plots will render inline under each cell.


# CA6003 Assignment: Titanic Survival Analysis
## Group [X] - [Names]


## Table of Contents
1. Introduction & Research Question
2. Data Loading & Initial Profiling
3. Missing Value Analysis
4. Bias Detection & Quantification (6 types)
5. Data Quality Assessment
6. Exploratory Data Analysis
7. Analytical Fallacies Discussion
8. Data Preparation - Scenario A (Minimal)
9. Data Preparation - Scenario B (Full)
10. Feature Engineering
11. Machine Learning - Experimental Setup
12. Model Training & Evaluation
13. Fairness Analysis
14. Results Comparison & Key Insights
15. Conclusion & Limitations


---
**Section Author: [Member Name]**  
**Contribution:** Introduction & research question
---

## 1. Introduction
**Research Question:**  
How do data preparation choices affect survival prediction reliability and fairness?

**Dataset:** Titanic passenger data (891 passengers, 12 features)

**Historical Context:** RMS Titanic sank April 15, 1912.


In [None]:
# Imports
import os
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

RANDOM_SEED = 42
TEST_SIZE = 0.2


---
**Section Author: [Member Name]**  
**Contribution:** Data loading & initial profiling
---

## 2. Data Loading & Initial Profiling


In [None]:
# Load dataset
csv_path = "/content/Titanic-Dataset.csv"
df = pd.read_csv(csv_path)

df.head()


In [None]:
# Basic profiling
print(df.shape)
print(df.dtypes)
print(df.isna().sum())


---
**Section Author: [Member Name]**  
**Contribution:** Missing value analysis
---

## 3. Missing Value Analysis


In [None]:
missing_pct = df.isna().mean().sort_values(ascending=False)
missing_pct


---
**Section Author: [Member Name]**  
**Contribution:** Bias detection & quantification
---

## 4. Bias Detection & Quantification (6 types)
Add calculations for gender, class, age, MNAR missingness, survivorship bias, and Simpson’s paradox.


---
**Section Author: [Member Name]**  
**Contribution:** EDA
---

## 6. Exploratory Data Analysis


In [None]:
# Survival distribution
sns.countplot(x="Survived", data=df)
plt.title("Survival Distribution")
plt.show()


In [None]:
# Survival by sex
sns.countplot(x="Sex", hue="Survived", data=df)
plt.title("Survival by Sex")
plt.show()


In [None]:
# Survival by class
sns.countplot(x="Pclass", hue="Survived", data=df)
plt.title("Survival by Class")
plt.show()


---
**Section Author: [Member Name]**  
**Contribution:** Data preparation - minimal vs full
---

## 8. Data Preparation - Scenario A (Minimal)


In [None]:
# Minimal preparation
minimal = df.dropna().copy()
minimal["Sex"] = minimal["Sex"].map({"male": 0, "female": 1})
minimal["Embarked"] = minimal["Embarked"].map({"S": 0, "C": 1, "Q": 2})

minimal = minimal.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"])

X_min = minimal.drop(columns=["Survived"])
y_min = minimal["Survived"].astype(int)

print(minimal.shape)


## 9. Data Preparation - Scenario B (Full)


In [None]:
# Full preparation
full = df.copy()
full["Embarked"] = full["Embarked"].fillna(full["Embarked"].mode()[0])
full["Age"] = full.groupby("Pclass")["Age"].transform(lambda x: x.fillna(x.median()))

full["Deck"] = full["Cabin"].str[0].fillna("Unknown")
full["Has_Cabin"] = full["Cabin"].notna().astype(int)

full["Title"] = full["Name"].str.extract(r" ([A-Za-z]+)\.")

full["FamilySize"] = full["SibSp"] + full["Parch"] + 1
full["IsAlone"] = (full["FamilySize"] == 1).astype(int)

full["AgeGroup"] = pd.cut(full["Age"], bins=[0,12,18,35,60,100], labels=["Child","Teen","Adult","Middle","Senior"])
full["Fare_Log"] = np.log1p(full["Fare"])

full = full.drop(columns=["PassengerId", "Name", "Ticket", "Cabin"])

X_full = full.drop(columns=["Survived"])
y_full = full["Survived"].astype(int)

X_full = pd.get_dummies(X_full, columns=["Sex", "Embarked", "Title", "Deck", "AgeGroup"], drop_first=False)

print(full.shape)


---
**Section Author: [Member Name]**  
**Contribution:** ML and evaluation
---

## 11. Machine Learning - Experimental Setup


In [None]:
# Train/test split and model helpers

def train_eval(X, y, model):
    X_train, X_test, y_train, y_test = train_test_split(
        X, y, test_size=TEST_SIZE, random_state=RANDOM_SEED, stratify=y
    )
    model.fit(X_train, y_train)
    pred = model.predict(X_test)
    prob = model.predict_proba(X_test)[:,1] if hasattr(model, "predict_proba") else None

    metrics = {
        "accuracy": accuracy_score(y_test, pred),
        "precision": precision_score(y_test, pred, zero_division=0),
        "recall": recall_score(y_test, pred, zero_division=0),
        "f1": f1_score(y_test, pred, zero_division=0),
    }
    if prob is not None:
        metrics["roc_auc"] = roc_auc_score(y_test, prob)
    return metrics


In [None]:
# Logistic Regression
logreg = LogisticRegression(max_iter=1000, solver="liblinear")

metrics_min_logreg = train_eval(X_min, y_min, logreg)
metrics_full_logreg = train_eval(X_full, y_full, logreg)

metrics_min_logreg, metrics_full_logreg


In [None]:
# Decision Tree

tree = DecisionTreeClassifier(random_state=RANDOM_SEED, max_depth=5, min_samples_leaf=10)

metrics_min_tree = train_eval(X_min, y_min, tree)
metrics_full_tree = train_eval(X_full, y_full, tree)

metrics_min_tree, metrics_full_tree


---
**Section Author: [Member Name]**  
**Contribution:** Results and conclusions
---

## 14. Results Comparison & Key Insights
Fill after running to report model metrics and fairness by gender/class.

## 15. Conclusion & Limitations
Summarize main findings and constraints.
