ML Preprocessing Notebook
This notebook walks through data loading, exploration, preprocessing, and feature engineering on a classification dataset (Titanic).

1. Load Required Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Load Dataset

The Titanic dataset from seaborn contains data about the passengers of the RMS Titanic, which sank after hitting an iceberg in 1912. The dataset includes:

- survived: Did the passenger survive? (0 = No, 1 = Yes)
- pclass: Passenger class (1 = 1st class, 2 = 2nd class, 3 = 3rd class)
- sex: Gender of passenger
- age: Age of passenger
- sibsp: Number of siblings/spouses aboard
- parch: Number of parents/children aboard
- fare: Passenger fare
- embarked: Port of embarkation (C = Cherbourg, Q = Queenstown, S = Southampton)
- class: Passenger class (First, Second, Third)
- who: Man, Woman, or Child
- adult_male: Whether passenger was an adult male
- deck: Cabin deck
- embark_town: Town of embarkation
- alive: Survival (Yes/No)

This dataset is commonly used for binary classification tasks to predict passenger survival.


In [None]:
df = sns.load_dataset('titanic')
df.head()

In [None]:
df.to_csv('../data/raw/titanic_raw.csv', index=False)

3. Basic Data Overview

3. Basic Data Overview

`df.describe()` provides statistical summary of the numerical columns in the dataset:
- count: number of non-null values
- mean: average value 
- std: standard deviation
- min: minimum value
- 25%: first quartile
- 50%: median
- 75%: third quartile  
- max: maximum value

These statistics help understand the distribution and range of numeric variables in the dataset.


In [None]:
df.describe()

`df.info()` provides a concise summary of the DataFrame, showing:

- Total number of entries (rows)
- Column names and their data types
- Number of non-null values in each column
- Memory usage

This is useful for:
- Quickly identifying missing values (comparing non-null counts vs total rows)
- Checking data types of columns (numeric, object, etc.)
- Understanding the size/shape of your dataset
- Verifying memory usage


In [None]:
df.info()

In [None]:
df.isnull().sum()

4. Data Visualization

In [None]:
sns.countplot(data=df, x='sex', hue='survived')
plt.title('Survival by Sex')
plt.show()

sns.histplot(df['age'].dropna(), kde=True)
plt.title('Age Distribution')
plt.show()

In [None]:
df = df.drop(columns=['deck', 'embark_town', 'alive'])

Fill missing values in the dataset:
- Age: Replace missing values with median age
- Embarked: Replace missing values with most common (mode) value


In [None]:
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

6. Feature Engineering

Convert categorical columns to numeric

Converts categorical variables ('sex', 'embarked', 'class', 'who') into dummy/indicator variables using one-hot encoding, while dropping the first category of each variable. This helps avoid multicollinearity - a situation where two or more predictor variables are highly correlated with each other.

For example, for the 'sex' variable:
- Without dropping: Female (0/1) and Male (0/1) columns would be perfectly correlated (when Female=1, Male=0 and vice versa)
- With dropping: Only Male (1) vs Female (0) column remains, removing redundant information

The transformed categorical columns:
- sex: Male (1) vs Female (0) 
- embarked: Q and S ports vs C port (reference)
- class: Second and Third class vs First class (reference)
- who: Man and Woman vs Child (reference)

Reference categories are dropped to serve as the baseline for comparison.


In [None]:
df = pd.get_dummies(df, columns=['sex', 'embarked', 'class', 'who'], drop_first=True)

In [None]:
df.head()

Create new feature: is_child

In [None]:
df['is_child'] = (df['age'] < 16).astype(int)

7. Final Check

In [None]:
df.head()
df.info()

8. Save Cleaned Dataset

In [None]:
df.to_csv('../data/processed/titanic_preprocessed.csv', index=False)
print('Saved as titanic_preprocessed.csv')