# 🛠️ Comprehensive Data Preprocessing: Titanic Dataset

In this notebook, we’ll apply **real-world preprocessing steps** to the Titanic dataset, a classic but imperfect dataset ideal for teaching.

---

### 📋 Steps Covered:
1. Data Cleaning
2. Data Transformation
3. Feature Engineering
4. Outlier Detection & Treatment
5. Data Splitting
6. Dataset-Specific Preprocessing
7. Target Variable Preprocessing


In [39]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

In [40]:
# Load Titanic dataset from seaborn
df = sns.load_dataset('titanic')
df.head()

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,,Southampton,no,True


## 1. 🧼 Data Cleaning

We clean the dataset by:
- Checking for missing values
- Removing duplicates
- Standardizing column names


In [41]:
# Check missing values
df.isnull().sum().sort_values(ascending=False)


deck           688
age            177
embarked         2
embark_town      2
sex              0
pclass           0
survived         0
fare             0
parch            0
sibsp            0
class            0
adult_male       0
who              0
alive            0
alone            0
dtype: int64

In [42]:
# Drop columns with too many missing values (e.g., 'deck')
df = df.drop(columns=['deck'])

# Fill 'age' with median and 'embarked' with mode
df['age'] = df['age'].fillna(df['age'].median())
df['embarked'] = df['embarked'].fillna(df['embarked'].mode()[0])

# Drop any remaining nulls
df = df.dropna()

# Drop duplicates
df = df.drop_duplicates()

# Clean column names
df.columns = [col.lower().replace(' ', '_') for col in df.columns]
df.head()


Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.25,S,Third,man,True,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.925,S,Third,woman,False,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1,S,First,woman,False,Southampton,yes,False
4,0,3,male,35.0,0,0,8.05,S,Third,man,True,Southampton,no,True


## 2. 📊 Data Transformation

We scale numerical features and encode categorical ones.


#### **StandardScaler**

The **StandardScaler** standardizes features by removing the mean and scaling to unit variance.

$$
z = \frac{x - \mu}{\sigma}
$$

Where:
- $ x $: original value  
- $ \mu $: mean of the feature  
- $ \sigma $: standard deviation of the feature  

**Example:**

If we have values $[10, 20, 30]$:

$$
\mu = 20, \quad \sigma = 8.165
$$

$$
[10, 20, 30] \rightarrow [-1.225, 0, 1.225]
$$

**Effect:**
- Mean becomes 0  
- Standard deviation becomes 1  
- Can produce negative values  

### **MinMaxScaler**

The **MinMaxScaler** rescales features to a fixed range, usually $ [0, 1] $ .

$$
x' = \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}}
$$

Where:
- $ x $: original value  
- $ x_{\text{min}} $: minimum of the feature  
- $ x_{\text{max}} $: maximum of the feature  

**Example:**

If we have values $[10, 20, 30]$:

$$
x_{\text{min}} = 10, \quad x_{\text{max}} = 30
$$

$$
[10, 20, 30] \rightarrow [0, 0.5, 1]
$$

**Effect:**
- Scales all features to a bounded range  
- Sensitive to outliers  

### **Comparison Table**

| Feature | StandardScaler | MinMaxScaler |
|----------|----------------|---------------|
| Formula | $ \frac{x - \mu}{\sigma} $ | $ \frac{x - x_{\text{min}}}{x_{\text{max}} - x_{\text{min}}} $ |
| Output Range | Centered around 0 | [0, 1] (default) |
| Affected by Outliers | Less sensitive | More sensitive |
| Keeps Distribution Shape | ✅ Yes | ✅ Yes |

In [43]:
from sklearn.preprocessing import StandardScaler

# Scale 'age' and 'fare'
scaler = StandardScaler()
df[['age_scaled', 'fare_scaled']] = scaler.fit_transform(df[['age', 'fare']])
df[['age_scaled', 'fare_scaled']].head()


Unnamed: 0,age_scaled,fare_scaled
0,-0.548619,-0.525112
1,0.61736,0.697085
2,-0.257124,-0.512228
3,0.398739,0.350022
4,0.398739,-0.509842


what is one hot encoding? why to do that? why not assign a numnber to each value?

In [44]:
df_drop_first_false = df
# Encode categorical variables using one-hot encoding
df = pd.get_dummies(df, drop_first=False)
df.head()


Unnamed: 0,survived,pclass,age,sibsp,parch,fare,adult_male,alone,age_scaled,fare_scaled,...,class_Second,class_Third,who_child,who_man,who_woman,embark_town_Cherbourg,embark_town_Queenstown,embark_town_Southampton,alive_no,alive_yes
0,0,3,22.0,1,0,7.25,True,False,-0.548619,-0.525112,...,False,True,False,True,False,False,False,True,True,False
1,1,1,38.0,1,0,71.2833,False,False,0.61736,0.697085,...,False,False,False,False,True,True,False,False,False,True
2,1,3,26.0,0,0,7.925,False,True,-0.257124,-0.512228,...,False,True,False,False,True,False,False,True,False,True
3,1,1,35.0,1,0,53.1,False,False,0.398739,0.350022,...,False,False,False,False,True,False,False,True,False,True
4,0,3,35.0,0,0,8.05,True,True,0.398739,-0.509842,...,False,True,False,True,False,False,False,True,True,False


In [45]:
# Encode categorical variables using one-hot encoding
df_drop_first_false = pd.get_dummies(df_drop_first_false, columns=['sex', 'embarked', 'class'], drop_first=True)
df_drop_first_false.head()

Unnamed: 0,survived,pclass,age,sibsp,parch,fare,who,adult_male,embark_town,alive,alone,age_scaled,fare_scaled,sex_male,embarked_Q,embarked_S,class_Second,class_Third
0,0,3,22.0,1,0,7.25,man,True,Southampton,no,False,-0.548619,-0.525112,True,False,True,False,True
1,1,1,38.0,1,0,71.2833,woman,False,Cherbourg,yes,False,0.61736,0.697085,False,False,False,False,False
2,1,3,26.0,0,0,7.925,woman,False,Southampton,yes,True,-0.257124,-0.512228,False,False,True,False,True
3,1,1,35.0,1,0,53.1,woman,False,Southampton,yes,False,0.398739,0.350022,False,False,True,False,False
4,0,3,35.0,0,0,8.05,man,True,Southampton,no,True,0.398739,-0.509842,True,False,True,False,True


## 3. 🧪 Feature Engineering

We create new features that may improve model performance.


In [46]:
# Create family size feature
df['family_size'] = df['sibsp'] + df['parch'] + 1

# Is child feature
df['is_child'] = df['age'] < 16

df[['family_size', 'is_child']].head()


Unnamed: 0,family_size,is_child
0,2,False
1,2,False
2,1,False
3,2,False
4,1,False


## 4. 🕵️ Outlier Detection & Treatment

We detect and handle outliers using the IQR method.


In [47]:
# Use IQR to cap outliers in 'fare'
Q1 = df['fare'].quantile(0.25)
Q3 = df['fare'].quantile(0.75)
IQR = Q3 - Q1
lower = Q1 - 1.5 * IQR
upper = Q3 + 1.5 * IQR
df['fare'] = np.clip(df['fare'], lower, upper)


## 5. 📁 Data Splitting

We split the dataset into training and testing sets.


### ❓ What is Train Data and Test Data?

**Q: What is train data?**  
**A:** Train data is the portion of the dataset used to **teach** the machine learning model how to make predictions. The model learns patterns from this data.

**Q: What is test data?**  
**A:** Test data is a **separate portion** of the dataset that is **not shown to the model during training**. It is used to check how well the model performs on new, unseen data.

---

### ❓ Why Do We Split the Data?

We split data to **simulate the real world** where we don’t know the answers in advance.  
It helps us answer:

> "Can this model make good predictions on data it hasn’t seen before?"

Without splitting, we might build a model that memorizes the training data but performs poorly on new data — this is called **overfitting**.

---

### ❓ What is a Good Split Value?

A common and effective split is:

- **80% training**
- **20% testing**

Other typical splits:
- **70/30** – when you want more test data
- **60/40** – for small datasets (rare)
- **90/10** – when you have lots of data and want to train more

> ✅ The best split depends on how much data you have and how critical performance evaluation is.


In [48]:
from sklearn.model_selection import train_test_split

# Define features and target
X = df.drop(columns=['survived'])
y = df['survived']

# Stratified split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=42)

X_train.shape, X_test.shape


((541, 27), (232, 27))

## 6. 📦 Dataset-Specific Preprocessing

For the Titanic dataset:
- We derived family size and age-based groups.
- For time-series or text data, other tasks would be required.


## 7. 🧠 Target Variable Preprocessing

In this classification task:
- The target is already binary (`survived`)
- In imbalanced datasets, use resampling or class weights

In [61]:
from collections import Counter
print("Target distribution:", Counter(y_train))

Target distribution: Counter({0: 318, 1: 223})


In [62]:
from imblearn.over_sampling import RandomOverSampler
oversampler = RandomOverSampler(random_state=42)
X_resampled_over, y_resampled_over = oversampler.fit_resample(X_train, y_train)
print("After Oversampling:", Counter(y_resampled_over))

After Oversampling: Counter({0: 318, 1: 318})


In [63]:
from imblearn.under_sampling import RandomUnderSampler
undersampler = RandomUnderSampler(random_state=42)
X_resampled_under, y_resampled_under = undersampler.fit_resample(X_train, y_train)
print("After Undersampling:", Counter(y_resampled_under))

After Undersampling: Counter({0: 223, 1: 223})


In [64]:
from imblearn.over_sampling import SMOTE
print("before SMOTE:", Counter(y_train))
smote = SMOTE(random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)
print("after SMOTE:", Counter(y_resampled))

before SMOTE: Counter({0: 318, 1: 223})
after SMOTE: Counter({0: 318, 1: 318})


## ✅ Summary

We’ve completed a full preprocessing pipeline:
- Cleaned missing, duplicated, and irrelevant data
- Scaled and encoded features
- Engineered new variables
- Detected and handled outliers
- Prepared a clean dataset for modeling

> You're now ready to build models confidently!
