# Data Preprocessing Tutorial

This notebook walks through the key steps of a data preprocessing pipeline:

1. **Data Cleaning** - Handle missing values and duplicates
2. **Exploratory Analysis** - Understand your data types
3. **Feature Engineering** - Encode categorical variables
4. **Normalization** - Scale numeric features
5. **Train/Test Split** - Avoid data leakage

We'll use a sample patient dataset.

## Step 0: Load Libraries

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

print("Libraries loaded!")

## Step 1: Load and Inspect Data

In [None]:
# Load the sample dataset
df = pd.read_csv('sample_data.csv')

print(f"Dataset shape: {df.shape}")
print(f"\nFirst few rows:")
print(df.head())

print(f"\nData types:")
print(df.dtypes)

print(f"\nMissing values:")
print(df.isna().sum())

## Step 2: Clean Missing Values

### Understanding the Problem
Notice that missing values are represented in different ways:
- `NA` (string)
- `N/A` (string)
- `NaN` (pandas null)

Pandas only recognizes `NaN` as missing. We need to replace the string versions.

In [None]:
# BEFORE: Show missing values
print("BEFORE replacing:")
print(f"Age column (raw): {df['age'].head(10).tolist()}")
print(f"City column (raw): {df['city'].head(10).tolist()}")
print(f"\nMissing count: {df[['age', 'income', 'city']].isna().sum().sum()}")

In [None]:
# Step 1: Replace missing value strings with NaN
df = df.replace(["NA", "N/A", "na", "n/a", "NaN", "nan", ""], np.nan)

print("AFTER replacing strings with NaN:")
print(f"Age column: {df['age'].head(10).tolist()}")
print(f"City column: {df['city'].head(10).tolist()}")
print(f"\nMissing count by column:")
print(df[['age', 'income', 'city']].isna().sum())

In [None]:
# Step 2: Impute missing numeric values with median
# Why median? It's robust to outliers

print(f"Age median: {df['age'].median()}")
print(f"Income median: {df['income'].median()}")

df['age'] = df['age'].fillna(df['age'].median())
df['income'] = df['income'].fillna(df['income'].median())

print("\nAFTER imputing numeric columns:")
print(df[['age', 'income']].isna().sum())
print(f"\nAge column (now complete): {df['age'].head(10).tolist()}")

In [None]:
# Step 3: Impute missing categorical values with mode (most frequent)
print(f"City value counts:")
print(df['city'].value_counts())
print(f"\nCity mode: {df['city'].mode()[0]}")

df['city'] = df['city'].fillna(df['city'].mode()[0])

print("\nAFTER imputing city:")
print(f"Missing in city: {df['city'].isna().sum()}")

In [None]:
# Step 4: Handle education (do the same thing)
df['education'] = df['education'].fillna(df['education'].mode()[0])

print("Final missing value check:")
print(df.isna().sum())
print("\n✓ All missing values handled!")

## Step 3: Remove Duplicates

In [None]:
# Check for duplicates
print(f"Dataset shape before: {df.shape}")
print(f"Number of duplicate rows: {df.duplicated().sum()}")

# Remove duplicates (keep first occurrence)
duplicates_removed = df.duplicated().sum()
df = df.drop_duplicates().reset_index(drop=True)

print(f"\nDataset shape after: {df.shape}")
print(f"Duplicates removed: {duplicates_removed}")

## Step 4: Identify Feature Types

Pandas has different data types:
- **Numeric**: `int64`, `float64` (numbers)
- **Categorical**: `object` (strings/categories)

We need to handle each type differently.

In [None]:
# Define which columns are ID, target, and features
id_cols = ['patient_id']
target = 'target'

# Get all feature columns
feature_cols = [c for c in df.columns if c not in id_cols and c != target]

# Split into categorical and numeric
cat_cols = [c for c in feature_cols if df[c].dtype == 'object']
num_cols = [c for c in feature_cols if df[c].dtype in ['int64', 'float64']]

print(f"ID columns: {id_cols}")
print(f"Target: {target}")
print(f"\nNumeric features: {num_cols}")
print(f"Categorical features: {cat_cols}")

## Step 5: Encode Categorical Features

Machine learning models need numbers, not text. **One-hot encoding** converts categories into binary columns.

Example:
- `city = 'Toronto'` → `city_Toronto=1, city_Vancouver=0, city_Montreal=0`
- `city = 'Vancouver'` → `city_Toronto=0, city_Vancouver=1, city_Montreal=0`

In [None]:
# Show the original categorical data
print("BEFORE encoding:")
print(df[['city', 'education']].head(10))
print(f"\nShape: {df.shape}")

In [None]:
# One-hot encode categorical columns
encoded_columns = []

for col in cat_cols:
    # Get one-hot encoded version
    encoded = pd.get_dummies(df[col], prefix=col, dtype=int)
    
    # Track column names
    encoded_columns.extend(encoded.columns.tolist())
    
    # Drop original and add encoded
    df = df.drop(col, axis=1)
    df = pd.concat([df, encoded], axis=1)

print("AFTER encoding:")
print(df.head(10))
print(f"\nNew shape: {df.shape}")
print(f"\nEncoded columns: {encoded_columns}")

## Step 6: Scale Numeric Features

**Standardization** converts numeric features to have:
- Mean = 0
- Standard deviation = 1

Formula: `(x - mean) / std`

Why? Many ML algorithms perform better with normalized inputs.

In [None]:
# Show numeric data BEFORE scaling
print("BEFORE scaling:")
print(df[num_cols].describe())
print(f"\nAge - mean: {df['age'].mean():.1f}, std: {df['age'].std():.1f}")
print(f"Income - mean: {df['income'].mean():.0f}, std: {df['income'].std():.0f}")

In [None]:
# Important! Compute scaling parameters from the FULL data
# (In real ML, you'd compute from TRAIN only)

means = {}
stds = {}

for col in num_cols:
    means[col] = df[col].mean()
    stds[col] = df[col].std()
    
    # Standardize: (x - mean) / std
    df[col] = (df[col] - means[col]) / stds[col]

print("AFTER scaling:")
print(df[num_cols].describe())
print(f"\nAge - mean: {df['age'].mean():.6f}, std: {df['age'].std():.6f}")
print(f"Income - mean: {df['income'].mean():.6f}, std: {df['income'].std():.6f}")
print("\n✓ Now mean ≈ 0 and std ≈ 1!")

## Step 7: Train/Test Split (No Data Leakage!)

**Data leakage** happens when preprocessing parameters (like scaling) are computed using test data.

**Correct approach:**
1. Split data into train/test
2. Compute scaling parameters from TRAIN
3. Apply those parameters to TEST

**Wrong approach:**
1. Scale using all data
2. Split into train/test (test set was used in scaling!)

In [None]:
# For this tutorial, let's start fresh to show the RIGHT way
# Reload and re-preprocess
df = pd.read_csv('sample_data.csv')

# Quick cleaning
df = df.replace(["NA", "N/A", "na", "n/a", "NaN", "nan", ""], np.nan)
df['age'] = df['age'].fillna(df['age'].median())
df['income'] = df['income'].fillna(df['income'].median())
df['city'] = df['city'].fillna(df['city'].mode()[0])
df['education'] = df['education'].fillna(df['education'].mode()[0])
df = df.drop_duplicates().reset_index(drop=True)

print(f"Data prepared. Shape: {df.shape}")

In [None]:
# ONE-HOT ENCODE (before split, ok)
# Note: We're doing this for simplicity. In practice, you'd fit encoders on train only.
for col in ['city', 'education']:
    encoded = pd.get_dummies(df[col], prefix=col, dtype=int)
    df = df.drop(col, axis=1)
    df = pd.concat([df, encoded], axis=1)

print(f"After encoding. Shape: {df.shape}")

In [None]:
# SPLIT INTO TRAIN/TEST
# Separate features and target
X = df.drop(['patient_id', 'target'], axis=1)
y = df['target']

# Split: 80% train, 20% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, 
    test_size=0.2, 
    random_state=42,
    stratify=y  # Keep same class distribution in train/test
)

print(f"Train set: {X_train.shape}")
print(f"Test set: {X_test.shape}")
print(f"\nTrain target distribution:")
print(y_train.value_counts())
print(f"\nTest target distribution:")
print(y_test.value_counts())

In [None]:
# SCALE NUMERIC FEATURES (from TRAIN parameters only!)

# Get numeric columns
num_cols = ['age', 'income']

# Compute means and stds from TRAIN SET ONLY
train_means = {}
train_stds = {}

for col in num_cols:
    train_means[col] = X_train[col].mean()
    train_stds[col] = X_train[col].std()
    
    print(f"{col} - Train mean: {train_means[col]:.1f}, std: {train_stds[col]:.1f}")

In [None]:
# Apply scaling using TRAIN statistics to BOTH train and test

# Scale TRAIN
for col in num_cols:
    X_train[col] = (X_train[col] - train_means[col]) / train_stds[col]

# Scale TEST using same parameters
for col in num_cols:
    X_test[col] = (X_test[col] - train_means[col]) / train_stds[col]

print("TRAIN scaled:")
print(X_train[num_cols].describe())

print("\nTEST scaled (using TRAIN parameters):")
print(X_test[num_cols].describe())

print("\n✓ Train is standardized (mean≈0, std≈1)")
print("✓ Test is scaled with same parameters (no leakage!)")

## Step 8: Summary

In [None]:
print("=" * 60)
print("PREPROCESSING PIPELINE SUMMARY")
print("=" * 60)

print(f"\n✓ Cleaned missing values (replaced NA/N/A strings, imputed medians)")
print(f"✓ Removed {8} duplicate rows")  # Example
print(f"✓ One-hot encoded: city, education")
print(f"✓ Scaled numeric: age, income (using TRAIN parameters only)")
print(f"✓ Split into train/test with stratification (no data leakage)")

print(f"\nFinal shapes:")
print(f"  Train: {X_train.shape}")
print(f"  Test: {X_test.shape}")

print(f"\nTrain set statistics (scaled):")
print(f"  Age - mean: {X_train['age'].mean():.6f}, std: {X_train['age'].std():.6f}")
print(f"  Income - mean: {X_train['income'].mean():.6f}, std: {X_train['income'].std():.6f}")

print(f"\nTest set statistics (using TRAIN parameters):")
print(f"  Age - mean: {X_test['age'].mean():.3f}, std: {X_test['age'].std():.3f}")
print(f"  Income - mean: {X_test['income'].mean():.1f}, std: {X_test['income'].std():.3f}")
print(f"  (Note: Test means/stds won't be exactly 0/1 - that's expected!)")

## Key Takeaways

1. **Replace missing value strings first** - `NA`, `N/A`, etc. aren't recognized as NaN
2. **Impute numeric with median** - Robust to outliers
3. **Impute categorical with mode** - Most frequent value
4. **One-hot encode categories** - Converts text to numbers ML models need
5. **Scale from TRAIN only** - Prevents data leakage
6. **Use same parameters for TEST** - Apply train scaling/encoding to test
7. **Stratify the split** - Keep class distributions balanced

Your assignment asks you to implement these steps in `src/preprocess.py`!