
# Task 1 · Data Cleaning & Preprocessing (Titanic Example)
**Created:** 2025-08-19 07:01

This notebook walks through a practical, end-to-end preprocessing pipeline:

1. Import dataset and explore basic info (nulls, data types)  
2. Handle missing values (median/mode imputation)  
3. Convert categorical features into numerical (one‑hot encoding)  
4. Normalize/standardize numerical features (StandardScaler)  
5. Visualize outliers using boxplots and remove them with IQR

> **Tip:** If you have a local `titanic.csv` (or `train.csv`), place it alongside this notebook and re-run.


In [None]:

# Core imports
import warnings
warnings.filterwarnings("ignore")

import os
import numpy as np
import pandas as pd

# Charts: use matplotlib only (no seaborn styling)
import matplotlib.pyplot as plt

# Scaling
from sklearn.preprocessing import StandardScaler


## 1) Load a Titanic dataset (robust fallback)

In [None]:

def load_titanic():
    """Load Titanic data from common local filenames, else try seaborn's built-in dataset.
    Returns a (DataFrame, source_note).
    """
    possible_files = ["titanic.csv", "train.csv", "Titanic-Dataset.csv", "titanic_train.csv"]
    for fn in possible_files:
        if os.path.exists(fn):
            df = pd.read_csv(fn)
            return df, f"Loaded local file: {fn}"
    # Fall back to seaborn's sample dataset if available
    try:
        import seaborn as sns  # only for loading the dataset (we won't use seaborn for plotting)
        df = sns.load_dataset("titanic")
        return df, "Loaded seaborn sample dataset: titanic"
    except Exception as e:
        # As a last resort, synthesize a tiny mock dataset (structure similar to Titanic)
        data = {
            "survived": [0,1,1,0,0,1,0,1],
            "pclass":   [3,1,3,2,3,1,2,2],
            "sex":      ["male","female","female","male","male","female","male","female"],
            "age":      [22,38,26,35,np.nan,58,14,30],
            "sibsp":    [1,1,0,1,0,0,1,0],
            "parch":    [0,0,0,0,0,0,2,0],
            "fare":     [7.25,71.2833,7.925,8.05,8.4583,512.3292,14.4542,13.0],
            "embarked": ["S","C","S","S","Q","S",np.nan,"C"]
        }
        df = pd.DataFrame(data)
        return df, "Using a small synthetic fallback dataset."

df_raw, source_note = load_titanic()
print(source_note)
display(df_raw.head())
print("Shape:", df_raw.shape)


## 2) Basic info & missing values

In [None]:

# Data types
print("Data types:")
print(df_raw.dtypes)

# Missing values
print("\nMissing values (count):")
print(df_raw.isna().sum())

# Summary stats
display(df_raw.describe(include="all"))


## 3) Handle missing values (median for numeric, mode for categorical)

In [None]:

df = df_raw.copy()

numeric_cols = df.select_dtypes(include=[np.number]).columns.tolist()
categorical_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()

print("Numeric columns:", numeric_cols)
print("Categorical columns:", categorical_cols)

# Impute numeric with median
for col in numeric_cols:
    median_val = df[col].median()
    df[col] = df[col].fillna(median_val)

# Impute categorical with mode
for col in categorical_cols:
    if df[col].isna().any():
        mode_val = df[col].mode(dropna=True)
        if len(mode_val) > 0:
            df[col] = df[col].fillna(mode_val.iloc[0])
        else:
            df[col] = df[col].fillna("Unknown")

print("\nMissing values after imputation:")
print(df.isna().sum())


## 4) Encode categorical features (one‑hot)

In [None]:

# Identify categorical columns again (post-imputation)
categorical_cols = df.select_dtypes(exclude=[np.number]).columns.tolist()

# One-hot encode
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

print("Shape before encoding:", df.shape)
print("Shape after encoding:", df_encoded.shape)
display(df_encoded.head())


## 5) Standardize numerical features

In [None]:

# Standardize only the original numeric columns (that still exist after encoding)
num_cols_after = [c for c in df_encoded.columns if c in numeric_cols]

scaler = StandardScaler()
df_scaled = df_encoded.copy()
df_scaled[num_cols_after] = scaler.fit_transform(df_scaled[num_cols_after])

display(df_scaled.head())

print("Columns standardized:", num_cols_after)


## 6) Visualize outliers (boxplots)

In [None]:

# Choose a few numeric columns commonly used in Titanic
candidates = [c for c in ["fare", "age", "sibsp", "parch"] if c in df.columns]

for col in candidates:
    plt.figure()
    plt.boxplot(df[col].dropna())
    plt.title(f"Boxplot for {col}")
    plt.xlabel(col)
    plt.ylabel("Value")
    plt.show()


## 7) Remove outliers by IQR (per selected numeric columns)

In [None]:

def iqr_filter(dataframe, cols, k=1.5):
    mask = pd.Series([True]*len(dataframe), index=dataframe.index)
    for col in cols:
        if col in dataframe.columns:
            q1 = dataframe[col].quantile(0.25)
            q3 = dataframe[col].quantile(0.75)
            iqr = q3 - q1
            lower = q1 - k*iqr
            upper = q3 + k*iqr
            mask &= dataframe[col].between(lower, upper, inclusive="both")
    return dataframe[mask]

print("Rows before IQR filtering:", len(df))
df_no_outliers = iqr_filter(df, candidates, k=1.5)
print("Rows after IQR filtering:", len(df_no_outliers))

# Re-encode & rescale after outlier removal
df_no_outliers_encoded = pd.get_dummies(df_no_outliers, columns=[c for c in df_no_outliers.select_dtypes(exclude=[np.number]).columns], drop_first=True)
num_cols_after_no = [c for c in df_no_outliers_encoded.columns if c in numeric_cols]
scaler2 = StandardScaler()
df_no_outliers_scaled = df_no_outliers_encoded.copy()
if num_cols_after_no:
    df_no_outliers_scaled[num_cols_after_no] = scaler2.fit_transform(df_no_outliers_scaled[num_cols_after_no])

display(df_no_outliers_scaled.head())


## 8) Save processed datasets

In [None]:

out_paths = {}
out_paths["clean_imputed_encoded_scaled.csv"] = "/mnt/data/clean_imputed_encoded_scaled.csv"
out_paths["clean_no_outliers_encoded_scaled.csv"] = "/mnt/data/clean_no_outliers_encoded_scaled.csv"

df_scaled.to_csv(out_paths["clean_imputed_encoded_scaled.csv"], index=False)
df_no_outliers_scaled.to_csv(out_paths["clean_no_outliers_encoded_scaled.csv"], index=False)

out_paths



## 9) Interview Quick Answers

**Q1. What are the different types of missing data?**  
- **MCAR** (Missing Completely At Random): missingness unrelated to any variable.  
- **MAR** (Missing At Random): missingness depends on observed variables.  
- **MNAR** (Missing Not At Random): missingness depends on the unobserved value itself.

**Q2. How do you handle categorical variables?**  
- Encoding: **one‑hot** for nominal variables, **ordinal/label** for ordered categories. Consider target/mean encoding with care to avoid leakage.

**Q3. Normalization vs Standardization?**  
- **Normalization** scales to a range (often 0–1, e.g., Min‑Max).  
- **Standardization** centers to mean 0, std 1 (z‑score).

**Q4. How do you detect outliers?**  
- IQR (Q3−Q1), z‑scores, domain rules, model residuals. Visualize via boxplots, histograms, scatterplots.

**Q5. Why is preprocessing important in ML?**  
- Improves data quality, model convergence, and accuracy; reduces bias/leakage; makes features comparable.

**Q6. One‑hot vs Label encoding?**  
- **One‑hot:** creates binary columns per category (no order implied).  
- **Label:** assigns integer codes (implies order; suitable for tree models or true ordinal data).

**Q7. How do you handle data imbalance?**  
- Resampling (over/under, SMOTE), class‑weighted losses, appropriate metrics (ROC‑AUC, PR‑AUC, F1).

**Q8. Can preprocessing affect model accuracy?**  
- Yes. Correct imputation, encoding, scaling, and outlier handling often yield measurable gains and more stable models.
