ML Preprocessing Notebook
This notebook walks through data loading, exploration, preprocessing, and feature engineering on a classification dataset (Titanic).

1. Load Required Libraries

In [None]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

2. Load Dataset

Titanic dataset from seaborn for simplicity

In [None]:
df = sns.load_dataset('titanic')
df.head()

In [None]:
df.to_csv('../data/raw/titanic_raw.csv', index=False)

3. Basic Data Overview

In [None]:
df.describe()

In [None]:
df.info()

In [None]:
df.isnull().sum()

4. Data Visualization

In [None]:
sns.countplot(data=df, x='sex', hue='survived')
plt.title('Survival by Sex')
plt.show()

sns.histplot(df['age'].dropna(), kde=True)
plt.title('Age Distribution')
plt.show()

In [None]:
df = df.drop(columns=['deck', 'embark_town', 'alive'])

Fill missing values in the dataset:
- Age: Replace missing values with median age
- Embarked: Replace missing values with most common (mode) value


In [None]:
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

6. Feature Engineering

Convert categorical columns to numeric

Converts categorical variables ('sex', 'embarked', 'class', 'who') into dummy/indicator variables using one-hot encoding, while dropping the first category of each variable to avoid multicollinearity.

The transformed categorical columns are:
- sex: Male (1) vs Female (0)
- embarked: Q and S ports vs C port
- class: Second and Third class vs First class
- who: Man and Woman vs Child

In [None]:
df = pd.get_dummies(df, columns=['sex', 'embarked', 'class', 'who'], drop_first=True)

In [None]:
df.head()

Create new feature: is_child

In [None]:
df['is_child'] = (df['age'] < 16).astype(int)

7. Final Check

In [None]:
df.head()
df.info()

8. Save Cleaned Dataset

In [None]:
df.to_csv('../data/processed/titanic_preprocessed.csv', index=False)
print('Saved as titanic_preprocessed.csv')