# CSI-6-ARI Week 4 Tutorial ‚Äî **COMPLETE ANSWERED VERSION**
## Data Preprocessing

This notebook contains **all code cells answered step-by-step**, including the 4 exercises.

---
### Topics covered
1. StandardScaler (standardisation)
2. Categorical encoding (LabelEncoder & OneHotEncoder)
3. Train / Test split
4. End-to-end preprocessing with MinMaxScaler


---
## ‚úÖ Setup

In [None]:
# ‚úÖ Setup ‚Äî run this first
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

print("Setup complete, seed =", SEED)

---
## Part 1 ‚Äî Standardisation with `StandardScaler`

**Key formula:** `z = (x - Œº) / œÉ`  
After standardisation every feature column has **mean ‚âà 0** and **std ‚âà 1**.

This is important for distance-based (KNN, SVM, KMeans) and gradient-based models.

In [None]:
from sklearn.preprocessing import StandardScaler

In [None]:
# --- Option 1: quick one-liner ---
# Feature matrix: 4 samples, 2 features with very different scales
X = np.array([
    [1,  10],
    [3, 100],
    [2,  55],
    [4,  25]
])

# fit_transform: learns mean/std from X and applies z = (x-mean)/std
X_scaled = StandardScaler().fit_transform(X)
print("Standardised X (2 features):")
print(X_scaled)

In [None]:
# --- Option 2: using sklearn.preprocessing directly (3 features) ---
import sklearn

X3 = np.array([
    [1,  10,  0],
    [3, 100, 10],
    [2,  55,  5],
    [4,  25,  2]
])

X3_scaled = sklearn.preprocessing.StandardScaler().fit_transform(X3)
print("Standardised X (3 features):")
print(X3_scaled)
# Notice each column is now on the same scale

---
### üìù Exercise 1 ‚Äî Check standardisation

**Task:** Create `X2` (shape 5√ó2) with very different column scales, standardise it, then verify mean‚âà0 and std‚âà1.

In [None]:
# ‚úÖ EXERCISE 1 ‚Äî ANSWER

# Step 1: Create X2 with two features on very different scales
X2 = np.array([
    [1,    10],
    [2,    50],
    [3,   100],
    [4,    25],
    [5,    60],
])
print("Original X2:")
print(X2)

# Step 2: Standardise using StandardScaler
X2_scaled = StandardScaler().fit_transform(X2)
print("\nScaled X2:")
print(X2_scaled)

# Step 3: Compute per-column mean and std
col_means = X2_scaled.mean(axis=0)
col_stds  = X2_scaled.std(axis=0)

print("\nColumn means (should be ~0):", col_means)
print("Column stds  (should be ~1):", col_stds)

# Explanation:
# StandardScaler standardises each column independently.
# Tiny non-zero means (e.g. 1e-16) are just floating-point rounding errors, not actual bias.

---
## Part 2 ‚Äî Handling Categorical Variables (Encoding)

Machine learning models need numbers. Two strategies:
- **Label Encoding**: assigns integers (0, 1, 2, ‚Ä¶). Implies ordering ‚Äî OK for ordinal categories.
- **One-Hot Encoding**: creates a binary column per category. No ordering implied ‚Äî best for nominal categories.

In [None]:
# Original categorical feature matrix
X_cat = np.array([
    ["Red",   "Petrol", "Sedan"],
    ["Black", "Diesel", "Sedan"],
    ["Blue",  "Diesel", "Hatchback"]
])
print(X_cat)

In [None]:
from sklearn import preprocessing

# Create a LabelEncoder instance
encoder = preprocessing.LabelEncoder()

# Encode the 'Colour' feature ‚Äî assigns integers alphabetically
encoded_colours = encoder.fit_transform(["Red", "Black", "Blue"])
print("Label encoded ['Red','Black','Blue']:", encoded_colours)
# Black=0, Blue=1, Red=2  (alphabetical order)

In [None]:
# Encoding with a new value 'Yellow'
encoded2 = encoder.fit_transform(["Red", "Black", "Blue", "Yellow", "Red"])
print("Label encoded with Yellow:", encoded2)
# Black=0, Blue=1, Red=2, Yellow=3

In [None]:
# Create the DataFrame used in this part
df = pd.DataFrame({
    'Colour':    ['Red', 'Black', 'Blue'],
    'Fuel Type': ['Petrol', 'Diesel', 'Diesel'],
    'Body':      ['Sedan', 'Sedan', 'Hatchback']
})
print(df)

In [None]:
from sklearn.preprocessing import OneHotEncoder

# Fit OneHotEncoder on the 'Colour' column
ohe = preprocessing.OneHotEncoder()
ohe.fit(df[['Colour']])
print("Unique colour categories:", ohe.categories_)

In [None]:
# Transform ‚Äî produces a sparse matrix; .toarray() makes it dense
ohe_result = ohe.transform(df[['Colour']]).toarray()
print("One-hot encoded colours:")
print(ohe_result)
# Columns order: Black, Blue, Red
# Row 0 (Red)  ‚Üí [0, 0, 1]
# Row 1 (Black)‚Üí [1, 0, 0]
# Row 2 (Blue) ‚Üí [0, 1, 0]

In [None]:
# Put results into a readable DataFrame
new_df = pd.DataFrame(
    ohe_result.astype(int),
    columns=['Black', 'Blue', 'Red']
)
print(new_df)

---
### üìù Exercise 2 ‚Äî Label encoding vs One-hot encoding

**Task:** Apply both encodings to the `Fuel Type` column, then explain when one-hot is preferred.

In [None]:
# ‚úÖ EXERCISE 2 ‚Äî ANSWER

# Work on a copy to avoid mutating the original df
df_tmp = df.copy()

# --- Part 1: Label encoding for 'Fuel Type' ---
le = preprocessing.LabelEncoder()
df_tmp["FuelType_LE"] = le.fit_transform(df_tmp["Fuel Type"])
print("DataFrame with Label Encoded Fuel Type:")
print(df_tmp)
print("\nEncoding mapping:", dict(zip(le.classes_, le.transform(le.classes_))))
# Diesel=0, Petrol=1

In [None]:
# --- Part 2: One-hot encoding for 'Fuel Type' using pd.get_dummies ---
fuel_ohe = pd.get_dummies(df_tmp["Fuel Type"], prefix="Fuel")
print("One-hot encoded Fuel Type:")
print(fuel_ohe)

# Combine with original (drop the original 'Fuel Type' column)
combined = pd.concat([df_tmp.drop(columns=["Fuel Type", "FuelType_LE"]), fuel_ohe], axis=1)
print("\nCombined DataFrame:")
print(combined)

In [None]:
# --- Part 3: Explanation ---
explanation = """
When to prefer One-Hot Encoding over Label Encoding:

1. One-hot encoding should be used when the categorical variable is NOMINAL 
   (no natural ordering), such as fuel type (Diesel, Petrol, Electric). 
   Label encoding would incorrectly imply Diesel < Petrol, which has no meaning.

2. Most ML algorithms (logistic regression, SVM, neural networks) treat 
   integer-encoded labels as numeric distances, leading to biased predictions.
   One-hot avoids this by treating each category as an independent binary feature.

3. However, if there are MANY unique categories (high cardinality), one-hot 
   encoding creates very wide sparse matrices, so label encoding or target 
   encoding may be more practical in those cases.
"""
print(explanation)

---
## Part 3 ‚Äî Train / Test Split

`train_test_split` randomly divides data into training and test sets.  
- `test_size=0.2` ‚Üí 80% train, 20% test  
- Always set `random_state` for reproducibility

In [None]:
from sklearn.model_selection import train_test_split

# Dataset with 10 samples: 2 numeric features + 1 categorical feature
X = ([12,300,'Red'],[11,280,'Red'],[15,264,'Black'],[9,230,'Blue'],
     [25,459,'Black'],[12,400,'Red'],[42,355,'Blue'],[32,435,'Red'],
     [22,564,'Black'],[21,231,'Red'])

# Target labels
y = [1,2,1,1,1,2,2,1,2,1]

print("Full X:", X)
print("Full y:", y)

In [None]:
# Split: 80% train, 20% test, fixed seed for reproducibility
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=4
)

print("X_train (8 samples):", X_train)
print("\nX_test  (2 samples):", X_test)
print("\ny_train:", y_train)
print("y_test: ", y_test)

In [None]:
# The test size (20%) ‚Üí 2 out of 10 samples go to test, 8 to train.
# random_state=4 ensures the same split every time you run the notebook.
print("Train size:", len(X_train), "| Test size:", len(X_test))

---
### üìù Exercise 3 ‚Äî Inspect your split

**Task:** Re-split with `test_size=0.3`, print sizes, count 1s and 2s in each subset.

In [None]:
# ‚úÖ EXERCISE 3 ‚Äî ANSWER

# Step 1: Split with test_size=0.3 (30% test, 70% train)
X_train2, X_test2, y_train2, y_test2 = train_test_split(
    X, y, test_size=0.3, random_state=4
)

# Step 2: Print sizes
print("=== Sizes ===")
print(f"X_train: {len(X_train2)} samples")
print(f"X_test:  {len(X_test2)} samples")
print(f"y_train: {len(y_train2)} labels")
print(f"y_test:  {len(y_test2)} labels")

# Step 3: Count class labels
n1_train = y_train2.count(1)
n2_train = y_train2.count(2)
n1_test  = y_test2.count(1)
n2_test  = y_test2.count(2)

print("\n=== Class Counts ===")
print(f"y_train  ‚Üí class 1: {n1_train}, class 2: {n2_train}")
print(f"y_test   ‚Üí class 1: {n1_test},  class 2: {n2_test}")

# NOTE: Because the dataset is very small (10 samples), class balance
# can shift noticeably between splits. In practice, use stratify=y
# to preserve the original class proportions in both sets.

---
## Part 4 ‚Äî End-to-End Preprocessing (MinMaxScaler)

**MinMaxScaler** rescales features to a fixed range (default [0, 1]):  
`x_scaled = (x - x_min) / (x_max - x_min)`

**‚ö†Ô∏è Key rule ‚Äî fit on train only:**  
Compute `x_min` and `x_max` from `X_train`, then apply to both `X_train` and `X_test`.

In [None]:
# Create the fake_reg.csv dataset (gem stone measurements)
rng = np.random.default_rng(SEED)
n = 1000
feature1 = rng.normal(loc=0.0, scale=1.0, size=n)
feature2 = rng.normal(loc=0.0, scale=1.0, size=n)
noise    = rng.normal(loc=0.0, scale=0.15, size=n)
price    = 50_000 + 12_000*feature1 + 8_000*feature2 + 10_000*noise

fake = pd.DataFrame({"feature1": feature1, "feature2": feature2, "price": price})
fake.to_csv("fake_reg.csv", index=False)
print("Created fake_reg.csv with shape", fake.shape)

In [None]:
# Load the data
df = pd.read_csv('fake_reg.csv')
df.head()

In [None]:
# Extract features (X) and target (y)
X = df[['feature1', 'feature2']].values
y = df['price'].values

# Split 70% train / 30% test
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.3, random_state=42
)

print("X_train shape:", X_train.shape)   # (700, 2)
print("X_test  shape:", X_test.shape)    # (300, 2)
print("y_train shape:", y_train.shape)   # (700,)
print("y_test  shape:", y_test.shape)    # (300,)

In [None]:
from sklearn.preprocessing import MinMaxScaler

# Create scaler
scaler = MinMaxScaler()

# CRITICAL: fit ONLY on training data (no data leakage)
scaler.fit(X_train)

# Transform both sets using the training statistics
X_train = scaler.transform(X_train)
X_test  = scaler.transform(X_test)

print("X_train after MinMaxScaler (first 3 rows):")
print(X_train[:3])
print("\nX_test after MinMaxScaler (first 3 rows):")
print(X_test[:3])

---
### üìù Exercise 4 ‚Äî Check for data leakage

**Task:** Verify `X_train` values are in [0,1], check `X_test` range, explain any out-of-range values.

In [None]:
# ‚úÖ EXERCISE 4 ‚Äî ANSWER

# Step 1: Min and max of X_train after scaling
print("=== X_train (scaled) ===")
print("Per-column min:", X_train.min(axis=0))   # Should both be exactly 0.0
print("Per-column max:", X_train.max(axis=0))   # Should both be exactly 1.0

# Step 2: Min and max of X_test after scaling
print("\n=== X_test (scaled) ===")
print("Per-column min:", X_test.min(axis=0))    # Could be slightly < 0
print("Per-column max:", X_test.max(axis=0))    # Could be slightly > 1

# Step 3: Check if any test values fall outside [0, 1]
outside = ((X_test < 0) | (X_test > 1)).any()
print("\nAny X_test values outside [0, 1]?", outside)

In [None]:
# --- Explanation ---
explanation = """
Why can X_test values fall outside [0, 1]?

1. MinMaxScaler learns x_min and x_max exclusively from X_train.
   It scales values using: (x - train_min) / (train_max - train_min)

2. If X_test contains values BELOW the training minimum, the scaled 
   value will be negative (< 0). If X_test values are ABOVE the 
   training maximum, the scaled value will exceed 1.

3. This is CORRECT BEHAVIOUR and not a bug. The alternative ‚Äî fitting 
   the scaler on the full dataset ‚Äî would leak test-set information 
   into the preprocessing step, making model evaluation unreliable.
   Slight out-of-range values on test data are acceptable.
"""
print(explanation)

---
## ‚úÖ Summary ‚Äî Key takeaways from Week 4

| Concept | Tool | When to use |
|---|---|---|
| Standardisation | `StandardScaler` | KNN, SVM, PCA, gradient descent models |
| Normalisation | `MinMaxScaler` | Neural networks, when bounded range needed |
| Label encoding | `LabelEncoder` | Ordinal categories, tree models |
| One-hot encoding | `OneHotEncoder` / `pd.get_dummies` | Nominal categories, linear/distance models |
| Train/test split | `train_test_split` | Always ‚Äî before any fitting! |

### ‚ö†Ô∏è The golden rule
**Fit preprocessing (scalers, encoders) on the training set only**, then transform both sets. Fitting on the full dataset causes **data leakage** and gives unreliable model evaluation.