# CSI-6-ARI Week 4 Tutorial — **COMPLETE ANSWERED VERSION**
## Data Preprocessing

This notebook contains **all code cells answered step-by-step**, including the 4 exercises.

---
### Topics covered
1. StandardScaler (standardisation)
2. Categorical encoding (LabelEncoder & OneHotEncoder)
3. Train / Test split
4. End-to-end preprocessing with MinMaxScaler


# Set up

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import random

SEED = 42
random.seed(SEED)
np.random.seed(SEED)

print("Setup complete, seed =", SEED)

Setup complete, seed = 42


## Part 1 — Standardisation with `StandardScaler`

**Key formula:** `z = (x - μ) / σ`
After standardisation every feature column has **mean ≈ 0** and **std ≈ 1**.

This is important for distance-based (KNN, SVM, KMeans) and gradient-based models.

In [6]:
from sklearn.preprocessing import StandardScaler

In [7]:
# --- Option 1: quick one-liner ---
# Feature matrix: 4 samples, 2 features with very different scales
X = np.array([
    [1,  10],
    [3, 100],
    [2,  55],
    [4,  25]
])

# fit_transform: learns mean/std from X and applies z = (x-mean)/std
X_scaled = StandardScaler().fit_transform(X)
print("Standardised X (2 features):")
print(X_scaled)

Standardised X (2 features):
[[-1.34164079 -1.09108945]
 [ 0.4472136   1.52752523]
 [-0.4472136   0.21821789]
 [ 1.34164079 -0.65465367]]


In [None]:
# --- Option 2: using sklearn.preprocessing directly (3 features) ---
import sklearn

X3 = np.array([
    [1,  10,  0],
    [3, 100, 10],
    [2,  55,  5],
    [4,  25,  2]
])

X3_scaled = sklearn.preprocessing.StandardScaler().fit_transform(X3)
print("Standardised X (3 features):")
print(X3_scaled)
# Notice each column is now on the same scale

---
### Exercise 1 — Check standardisation

**Task:** Create `X2` (shape 5×2) with very different column scales, standardise it, then verify mean≈0 and std≈1.