In [None]:
%matplotlib inline

In [None]:
%matplotlib inline

# Classification Analysis: Titanic Dataset
**Name:** Katie  
**Date:** April 6, 2025  

## Introduction  
The objective of this project is to build a machine learning model to predict survival on the Titanic. The dataset contains demographic and travel-related features, and the target variable is whether or not a passenger survived. We will explore the data, engineer features, train classification models, and compare their performance.


In [None]:
%matplotlib inline

In [None]:
# 0. Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import (
    accuracy_score,
    classification_report,
    confusion_matrix
)

# Plot style
sns.set(style="whitegrid")

## 1. Import and Inspect the Data

In [None]:
# Load the training dataset
df = pd.read_csv("data/train.csv")

# Display first 10 rows
df.head(10)

In [None]:
# Check for missing values
df.isnull().sum()

# Summary statistics
df.describe(include='all')

### Reflection 1:
The dataset contains a mix of categorical (e.g., `Sex`, `Embarked`) and numerical (e.g., `Age`, `Fare`) features. There are missing values in the `Age`, `Cabin`, and `Embarked` columns. The `Cabin` column has a lot of missing values and may need to be dropped. Overall, the dataset is relatively clean but needs some preprocessing.

## 2. Data Exploration and Preparation

In [None]:
# Histogram of Age
plt.figure(figsize=(8,4))
sns.histplot(df['Age'].dropna(), kde=True)
plt.title("Distribution of Age")
plt.show()

# Boxplot of Fare
plt.figure(figsize=(8,4))
sns.boxplot(x='Fare', data=df)
plt.title("Boxplot of Fare")
plt.show()

# Count plot for Survived
sns.countplot(x='Survived', data=df)
plt.title("Survival Count (Target Variable)")
plt.xticks([0, 1], ['Did Not Survive', 'Survived'])
plt.show()

# Count plot for Pclass
sns.countplot(x='Pclass', data=df)
plt.title("Passenger Class Distribution")
plt.show()

# Count plot for Sex
sns.countplot(x='Sex', data=df)
plt.title("Passenger Sex Distribution")
plt.show()

### Reflection 2:
The Age column has a fairly normal distribution but contains some missing values. The Fare column has a long tail with some high outliers. There is a class imbalance in the target variable: more people died than survived. Most passengers are in 3rd class, and more males than females are aboard. These patterns will be important during feature selection and modeling.

In [None]:
# Check again which columns have missing values
df.isnull().sum()

# Drop Cabin (too many missing values)
df.drop('Cabin', axis=1, inplace=True)

# Fill missing Age values with median
df['Age'].fillna(df['Age'].median(), inplace=True)

# Fill missing Embarked values with mode
df['Embarked'].fillna(df['Embarked'].mode()[0], inplace=True)

# Double-check missing values now
df.isnull().sum()

In [None]:
# Convert Sex to numerical (male=0, female=1)
df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

# One-hot encode Embarked (C, Q, S)
df = pd.get_dummies(df, columns=['Embarked'], drop_first=True)

# Scale Age and Fare
scaler = StandardScaler()
df[['Age', 'Fare']] = scaler.fit_transform(df[['Age', 'Fare']])

In [None]:
# Create new feature: FamilySize = SibSp + Parch + 1 (self)
df['FamilySize'] = df['SibSp'] + df['Parch'] + 1

# Drop unused columns
df.drop(['Name', 'Ticket', 'PassengerId'], axis=1, inplace=True)

### Reflection 2 (continued):
I dropped the Cabin column due to a large number of missing values. Age was filled with the median and Embarked with the mode. I converted Sex to numerical values and applied one-hot encoding to Embarked. Finally, I scaled Age and Fare to normalize their ranges. These steps help clean the dataset and prepare it for model training.

## 3. Feature Selection and Justification

In [None]:
# Define target variable
y = df['Survived']

# Define feature variables
X = df.drop('Survived', axis=1)

### Reflection 3:
I selected all cleaned and engineered features (excluding the target `Survived`) for `X`. These include `Pclass`, `Sex`, `Age`, `Fare`, `SibSp`, `Parch`, `FamilySize`, and the one-hot encoded `Embarked` columns. These features provide information about passenger demographics, socioeconomic status, and group size — all of which may influence survival. I believe this set captures a well-rounded view of the passengers.

## 4. Train a Model: Logistic Regression

In [None]:
# Split data (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Initialize and train logistic regression
model = LogisticRegression(max_iter=1000)
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Confusion matrix and classification report
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))

### Reflection 4:
The logistic regression model performed reasonably well, with accuracy around 80%. Precision and recall varied slightly between classes. It seems the model is slightly better at predicting non-survivors than survivors, which might reflect the class imbalance. Overall, it's a solid baseline model for this classification task.

## 5. Improve the Model: Try Random Forest

In [None]:
# Train Random Forest model
rf_model = RandomForestClassifier(random_state=42)
rf_model.fit(X_train, y_train)

# Predict and evaluate
rf_preds = rf_model.predict(X_test)

print("Confusion Matrix:\n", confusion_matrix(y_test, rf_preds))
print("\nClassification Report:\n", classification_report(y_test, rf_preds))
print("Accuracy:", accuracy_score(y_test, rf_preds))

### Reflection 5:
The Random Forest model performed better than logistic regression, with improved recall and accuracy. This is likely due to its ability to handle complex feature interactions. It’s a good fit for this dataset.

## 6. Final Thoughts & Insights

### 6.1 Summary of Findings:
The models performed reasonably well, with Random Forest outperforming logistic regression. Key predictors were `Sex`, `Pclass`, and `Fare`.

### 6.2 Challenges Faced:
Handling missing values, selecting features, and interpreting evaluation metrics were challenging but educational.

### 6.3 What I'd Do With More Time:
I’d try hyperparameter tuning, cross-validation, and more advanced models like XGBoost.

### Reflection 6:
This project gave me hands-on practice with the full ML pipeline — from raw data to models and reflection. It also reinforced how important data cleaning and thoughtful modeling choices are.