# Task 10: Data Preprocessing & Feature Engineering

## Introduction

Data preprocessing and feature engineering are crucial steps in preparing data for ML models. This notebook covers common challenges and techniques.

## 1. Import Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
import warnings
warnings.filterwarnings('ignore')

%matplotlib inline

## 2. Handling Missing Values

In [None]:
# Create sample dataset with missing values
np.random.seed(42)
data = {
    'Age': [25, 30, np.nan, 45, 35, np.nan, 50, 28, 33, 40],
    'Salary': [50000, 60000, 55000, np.nan, 70000, 45000, np.nan, 52000, 58000, 65000],
    'Department': ['IT', 'HR', 'IT', 'Sales', np.nan, 'HR', 'IT', 'Sales', 'HR', 'IT'],
    'Experience': [2, 5, 3, 10, 7, 1, 15, 4, 6, 9]
}
df = pd.DataFrame(data)

print("Original Data with Missing Values:")
print(df)
print(f"\nMissing values per column:\n{df.isnull().sum()}")

### Why Mean/Median Imputation?

Mean imputation is suitable when data is **MCAR (Missing Completely at Random)** and has no extreme outliers. Median is preferred when data has **outliers** as it's more robust to skewness.

In [None]:
# Handle missing values
# For numerical: use median (robust to outliers)
df_cleaned = df.copy()
df_cleaned['Age'] = df_cleaned['Age'].fillna(df_cleaned['Age'].median())
df_cleaned['Salary'] = df_cleaned['Salary'].fillna(df_cleaned['Salary'].median())

# For categorical: use mode
df_cleaned['Department'] = df_cleaned['Department'].fillna(df_cleaned['Department'].mode()[0])

print("Data after Missing Value Imputation:")
print(df_cleaned)

## 3. Feature Scaling

In [None]:
# Create sample data with different scales
np.random.seed(42)
X = np.random.rand(100, 3) * np.array([1000, 10, 100])
df_scaled = pd.DataFrame(X, columns=['Income', 'Age', 'Score'])

print("Original Data (different scales):")
print(df_scaled.describe())

### Why StandardScaler?

StandardScaler (z-score normalization) is used when algorithm assumes **zero-centered data** or uses **distance-based metrics** (KNN, SVM, Neural Networks). It transforms data to mean=0, std=1.

In [None]:
# StandardScaler - transforms to mean=0, std=1
scaler_standard = StandardScaler()
X_standard = scaler_standard.fit_transform(df_scaled)

print("After StandardScaler:")
print(pd.DataFrame(X_standard, columns=df_scaled.columns).describe().round(2))

### Why MinMaxScaler?

MinMaxScaler is used when we need **bounded values** (0-1 range) or for **algorithms that don't assume normality** like Neural Networks and K-Nearest Neighbors.

In [None]:
# MinMaxScaler - transforms to range [0, 1]
scaler_minmax = MinMaxScaler()
X_minmax = scaler_minmax.fit_transform(df_scaled)

print("After MinMaxScaler:")
print(pd.DataFrame(X_minmax, columns=df_scaled.columns).describe().round(2))

## 4. Encoding Categorical Variables

In [None]:
# Sample categorical data
df_cat = pd.DataFrame({
    'Color': ['Red', 'Blue', 'Green', 'Red', 'Blue'],
    'Size': ['S', 'M', 'L', 'S', 'M'],
    'Label': [1, 0, 1, 0, 1]
})

print("Categorical Data:")
print(df_cat)

### Why Label Encoding?
Label Encoding converts categories to numbers (0, 1, 2...). It's suitable for **ordinal variables** or when the algorithm can learn **ordinal relationships** (Decision Trees, Random Forest).

In [None]:
# Label Encoding - for ordinal/nominal with tree-based algorithms
le = LabelEncoder()
df_cat['Color_encoded'] = le.fit_transform(df_cat['Color'])

print("After Label Encoding:")
print(df_cat)

### Why One-Hot Encoding?
One-Hot Encoding creates binary columns for each category. It's suitable for **nominal variables** without ordinal relationship and for **linear models** where numbers shouldn't imply order.

In [None]:
# One-Hot Encoding - for nominal variables
df_onehot = pd.get_dummies(df_cat[['Color', 'Size']], drop_first=False)

print("After One-Hot Encoding:")
print(df_onehot)

## 5. Using Pipeline & ColumnTransformer

In [None]:
# Create sample dataset
np.random.seed(42)
X = pd.DataFrame({
    'Age': [25, 30, 35, 40, 45, 50, 22, 28],
    'Income': [50000, 60000, 55000, 70000, 80000, 90000, 45000, 52000],
    'City': ['NYC', 'LA', 'NYC', 'Chicago', 'LA', 'Chicago', 'NYC', 'LA']
})
y = [1, 0, 1, 0, 0, 1, 1, 0]

print("Sample Dataset:")
print(X)

### Why Pipeline?
Pipeline chains multiple transformations and ensures **consistent preprocessing** for both training and test data, prevents **data leakage** by fitting only on training data.

In [None]:
# Define preprocessing for numeric and categorical columns
numeric_features = ['Age', 'Income']
categorical_features = ['City']

numeric_transformer = Pipeline(steps=[
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('onehot', OneHotEncoder(drop='first'))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ]
)

# Apply transformation
X_transformed = preprocessor.fit_transform(X)

print("Transformed Data Shape:", X_transformed.shape)
print("\nTransformed Data (first 5 rows):")
print(X_transformed[:5].round(2))

## 6. Summary

| Technique | When to Use | Why |
|-----------|-------------|-----|
| Mean/Median Imputation | MCAR data, numerical features | Mean for normal data, Median for outliers |
| StandardScaler | Distance-based algorithms | Zero-centered, unit variance |
| MinMaxScaler | Bounded data needed, Neural Networks | Maps to [0,1] range |
| Label Encoding | Ordinal variables, Tree algorithms | Preserves order |
| One-Hot Encoding | Nominal variables, Linear models | No false ordinal relationship |
| Pipeline | Any ML workflow | Prevents leakage, ensures consistency |