# <font color="#418FDE" size="6.5" uppercase>**Splits And Validation**</font>

>Last update: 20260201.
    
By the end of this Lecture, you will be able to:
- Describe the roles of training, validation, and test splits in model development. 
- Propose simple ways to split a small dataset into separate parts. 
- Explain how improper splitting can lead to overly optimistic performance estimates. 


## **1. Why We Split Data**

### **1.1. Separating learning and testing**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_A/image_01_01.jpg?v=1769963625" width="250">



>* Keep training and testing data strictly separate
>* Use unseen test data to check generalization

>* Models must be tested on unseen future-like data
>* Held-out test sets give realistic performance estimates

>* Using test data during tuning biases decisions
>* Keeping test data untouched gives honest generalization estimate



### **1.2. Preventing Data Leakage**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_A/image_01_02.jpg?v=1769963637" width="250">



>* Data leakage lets models secretly use future information
>* Separate train, validation, test sets block these leaks

>* Leakage often happens during preprocessing and feature engineering
>* Fit all transformations only on training data

>* Make all data decisions using training only
>* Keep validation, test untouched to avoid leakage



In [None]:
#@title Python Code - Preventing Data Leakage

# This script shows simple data leakage prevention.
# We use tiny synthetic data for illustration.
# Focus on splitting before preprocessing operations.

# import required built in and numeric libraries.
import numpy as np
import pandas as pd

# set deterministic random seed for reproducibility.
rng = np.random.default_rng(seed=42)

# create tiny dataset with target and one feature.
size = 20
feature = rng.normal(loc=50.0, scale=10.0, size=size)

# create target correlated with feature plus noise.
target = (feature * 0.5) + rng.normal(loc=0.0, scale=5.0, size=size)

# build pandas dataframe from arrays.
data = pd.DataFrame({"feature": feature, "target": target})

# show first few rows to understand structure.
print("Head of full dataset with feature and target:")
print(data.head(5))

# define train size and compute split index.
train_size = 14
split_index = train_size

# perform incorrect scaling using full dataset.
full_mean = data["feature"].mean()
full_std = data["feature"].std(ddof=0)

# scale feature using information from all rows.
data["feature_scaled_leaky"] = (data["feature"] - full_mean) / full_std

# perform correct split before computing statistics.
train_data = data.iloc[:split_index].copy()
test_data = data.iloc[split_index:].copy()

# compute scaling parameters using only training data.
train_mean = train_data["feature"].mean()
train_std = train_data["feature"].std(ddof=0)

# scale train and test using training statistics only.
train_data["feature_scaled_safe"] = (
    (train_data["feature"] - train_mean) / train_std
)

# apply same transformation to test data safely.
test_data["feature_scaled_safe"] = (
    (test_data["feature"] - train_mean) / train_std
)

# print comparison of means for scaled features.
print("\nLeaky scaled feature mean using all data:")
print(round(data["feature_scaled_leaky"].mean(), 4))

# show safe scaled feature mean on training data.
print("Safe scaled training feature mean using train only:")
print(round(train_data["feature_scaled_safe"].mean(), 4))

# show safe scaled feature mean on test data.
print("Safe scaled test feature mean using train only:")
print(round(test_data["feature_scaled_safe"].mean(), 4))

# final print summarizing why leakage is problematic.
print("\nLeaky scaling secretly used test information during preprocessing.")



### **1.3. Fair Model Assessment**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_A/image_01_03.jpg?v=1769963662" width="250">



>* Splits simulate how models face real data
>* Held-out test set estimates future model performance

>* Reusing training data for testing inflates performance
>* A separate untouched test set checks real-world accuracy

>* Use one shared, untouched test set
>* Ensures fair, unbiased comparison of model performance



## **2. Basic Data Splits**

### **2.1. Simple Holdout Split**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_A/image_02_01.jpg?v=1769963677" width="250">



>* Split data into training and holdout groups
>* Holdout set tests real-world performance, prevents overfitting

>* Randomly hold out a small shuffled portion
>* Evaluate model on held-out data for honesty

>* Simple holdout splits are quick, easy, practical
>* Single splits can be unstable on limited data



In [None]:
#@title Python Code - Simple Holdout Split

# This script demonstrates a simple holdout split.
# We use tiny synthetic data for clear illustration.
# Focus on separating training and holdout evaluation sets.

# Import required standard libraries for randomness and math.
import random
import math

# Set a deterministic random seed for reproducible splitting.
random.seed(42)

# Create a tiny synthetic dataset of labeled examples.
examples = []
for i in range(1, 21):
    label = "positive" if i % 2 == 0 else "negative"
    examples.append({"id": i, "text": f"review_{i}", "label": label})

# Verify dataset size is safely large enough for splitting.
if len(examples) < 4:
    raise ValueError("Dataset too small for meaningful holdout split.")

# Decide the holdout fraction for evaluation set size.
holdout_fraction = 0.25

# Compute holdout size using floor to avoid overshooting.
raw_holdout_size = len(examples) * holdout_fraction
holdout_size = max(1, math.floor(raw_holdout_size))

# Shuffle dataset copy so original order remains unchanged.
shuffled_examples = examples.copy()
random.shuffle(shuffled_examples)

# Split shuffled data into training and holdout subsets.
holdout_set = shuffled_examples[:holdout_size]
training_set = shuffled_examples[holdout_size:]

# Validate that splits cover all examples without overlap.
all_ids = {e["id"] for e in examples}
train_ids = {e["id"] for e in training_set}
holdout_ids = {e["id"] for e in holdout_set}

# Ensure no example appears in both training and holdout sets.
if train_ids.intersection(holdout_ids):
    raise RuntimeError("Overlap detected between training and holdout sets.")

# Ensure combined split sizes exactly match original dataset size.
if len(train_ids) + len(holdout_ids) != len(all_ids):
    raise RuntimeError("Split sizes do not match original dataset size.")

# Print short summary describing the simple holdout split.
print("Total examples in full dataset:", len(examples))
print("Training examples after split:", len(training_set))
print("Holdout examples after split:", len(holdout_set))

# Show a few training examples to illustrate their structure.
print("\nSample training examples (id, text, label):")
for example in training_set[:3]:
    print(example["id"], example["text"], example["label"])

# Show all holdout examples since dataset is intentionally tiny.
print("\nAll holdout examples (id, text, label):")
for example in holdout_set:
    print(example["id"], example["text"], example["label"])



### **2.2. Choosing Split Ratios**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_A/image_02_02.jpg?v=1769963719" width="250">



>* Balance training size with evaluation reliability
>* Too little training or testing harms model learning

>* Start with simple ratios like 80-20 splits
>* Adjust splits based on size, diversity, purpose

>* Plan split sizes around validation needs
>* Balance dataset size, diversity, and tuning cycles



### **2.3. Random Versus Ordered Splits**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_A/image_02_03.jpg?v=1769963731" width="250">



>* Random splits mix examples for balanced representation
>* Ordered splits follow dataset sequence but risk bias

>* Use ordered splits when data have sequence
>* Preserving order better reflects real future performance

>* Random splits mix different subgroups more fairly
>* Choose split method that matches real data use



## **3. Splitting Mistakes**

### **3.1. Overusing Test Sets**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_A/image_03_01.jpg?v=1769963748" width="250">



>* Test set is a one-time final exam
>* Repeated checks overfit to its noise, inflating performance

>* Team keeps tweaking models using same test set
>* Test score inflates, real-world performance drops

>* Overusing test sets happens across many fields
>* Turns test data into training, inflating performance



### **3.2. Peeking at evaluation**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_A/image_03_02.jpg?v=1769963762" width="250">



>* Repeatedly tweaking after evaluation quietly uses test data
>* Model overfits quirks, inflating expected real-world performance

>* Repeated tuning on one split encodes quirks
>* Model looks great there, fails on new data

>* Repeated validation checks secretly guide model design
>* This captures noise, inflating performance and trust



In [None]:
#@title Python Code - Peeking at evaluation

# This script shows peeking at evaluation effects.
# It uses a tiny synthetic regression dataset.
# We will overfit by reusing validation feedback.

# import required built in and numerical libraries.
import numpy as np

# set deterministic random seed for reproducibility.
np.random.seed(42)

# generate simple one dimensional input feature values.
X = np.linspace(0.0, 1.0, 60).reshape(-1, 1)

# generate target values from a noisy linear relationship.
y = 2.0 * X[:, 0] + 0.3 + np.random.normal(0.0, 0.05, 60)

# verify shapes are as expected before splitting.
assert X.shape == (60, 1) and y.shape == (60,)

# create indices for training validation and test splits.
train_idx = np.arange(0, 30)

# define validation indices using a fixed slice.
val_idx = np.arange(30, 45)

# define test indices using the remaining slice.
test_idx = np.arange(45, 60)

# create split arrays using the index selections.
X_train, y_train = X[train_idx], y[train_idx]

# create validation arrays using the index selections.
X_val, y_val = X[val_idx], y[val_idx]

# create test arrays using the index selections.
X_test, y_test = X[test_idx], y[test_idx]

# define a function computing mean squared error safely.
def mean_squared_error(y_true, y_pred):
    # ensure shapes match before computing error.
    assert y_true.shape == y_pred.shape

    # compute average squared difference between arrays.
    return float(np.mean((y_true - y_pred) ** 2))

# define a function fitting polynomial regression models.
def fit_polynomial_model(x_values, y_values, degree):
    # build Vandermonde matrix for polynomial features.
    X_poly = np.vander(x_values[:, 0], degree + 1, increasing=True)

    # solve least squares problem for polynomial coefficients.
    coeffs, *_ = np.linalg.lstsq(X_poly, y_values, rcond=None)

    # return learned coefficient vector for later predictions.
    return coeffs

# define a function predicting using polynomial coefficients.
def predict_polynomial(x_values, coeffs):
    # build Vandermonde matrix for polynomial features.
    X_poly = np.vander(x_values[:, 0], len(coeffs), increasing=True)

    # compute predictions using matrix multiplication.
    return X_poly @ coeffs

# define candidate polynomial degrees to explore greedily.
candidate_degrees = list(range(1, 13))

# prepare containers for tracking validation and test errors.
val_errors = []

# prepare containers for tracking test errors for each degree.
test_errors = []

# loop over candidate degrees simulating repeated peeking.
for degree in candidate_degrees:
    # fit model on training data using current degree.
    coeffs = fit_polynomial_model(X_train, y_train, degree)

    # compute validation predictions using current model.
    val_pred = predict_polynomial(X_val, coeffs)

    # compute test predictions using current model.
    test_pred = predict_polynomial(X_test, coeffs)

    # compute and store validation mean squared error.
    val_mse = mean_squared_error(y_val, val_pred)

    # compute and store test mean squared error.
    test_mse = mean_squared_error(y_test, test_pred)

    # append errors to tracking lists for later inspection.
    val_errors.append(val_mse)

    # append test errors to tracking lists for later inspection.
    test_errors.append(test_mse)

# convert error lists to numpy arrays for easier indexing.
val_errors = np.array(val_errors)

# convert test error list to numpy array for easier indexing.
test_errors = np.array(test_errors)

# find degree with smallest validation error after peeking.
best_index = int(np.argmin(val_errors))

# extract best degree and corresponding validation error.
best_degree = candidate_degrees[best_index]

# extract corresponding test error for the chosen degree.
best_val_error = float(val_errors[best_index])

# compute test error for a simple baseline degree.
baseline_degree = 1

# find index of baseline degree within candidate list.
baseline_index = candidate_degrees.index(baseline_degree)

# extract baseline validation and test errors.
baseline_val_error = float(val_errors[baseline_index])

# extract baseline test error from stored array.
baseline_test_error = float(test_errors[baseline_index])

# extract test error for the best validation degree.
best_test_error = float(test_errors[best_index])

# print short header explaining the experiment scenario.
print("Simulating peeking at validation to choose model complexity.")

# print baseline model degree and its validation error.
print("Baseline degree", baseline_degree, "validation MSE", round(baseline_val_error, 4))

# print baseline model degree and its test error.
print("Baseline degree", baseline_degree, "test MSE", round(baseline_test_error, 4))

# print chosen degree after repeated validation peeking.
print("Chosen degree after peeking", best_degree, "validation MSE", round(best_val_error, 4))

# print test error for the chosen degree to show optimism.
print("Chosen degree after peeking", best_degree, "test MSE", round(best_test_error, 4))

# print short interpretation highlighting optimistic validation performance.
print("Notice validation looks best for", best_degree, "but test error is worse.")



### **3.3. Common Leakage Scenarios**

<img src="https://cdn.jsdelivr.net/gh/mhrafiei/contents@main/LFF/Machine Learning for Beginners/Module_07/Lecture_A/image_03_03.jpg?v=1769963789" width="250">



>* Data leakage happens when future information leaks
>* Same customer in train and test inflates performance

>* Preprocessing must be learned only from training data
>* Using full data for scaling or selection inflates performance

>* Using future-only data secretly boosts evaluation scores
>* Match training information to real prediction-time availability



In [None]:
#@title Python Code - Common Leakage Scenarios

# This script shows simple data leakage examples.
# It compares correct and incorrect dataset splitting.
# Focus on how leakage inflates model performance.

# import required built in modules only.
import random
import statistics

# set deterministic random seed for reproducibility.
random.seed(42)

# generate tiny synthetic customer dataset with labels.
customers = [
    {"id": cid, "income": 30000 + cid * 1000,
     "debt": 1000 + (cid % 3) * 500,
     "default": 1 if cid % 4 == 0 else 0}
    for cid in range(1, 13)
]

# verify dataset size is small and safe.
assert len(customers) == 12

# define simple accuracy function for predictions.

def accuracy(y_true, y_pred):
    correct = sum(1 for t, p in zip(y_true, y_pred) if t == p)

    return correct / len(y_true) if y_true else 0.0


# build a naive rule based model from training data.

def train_rule_model(train_rows):
    incomes_default = [r["income"] for r in train_rows if r["default"] == 1]
    incomes_ok = [r["income"] for r in train_rows if r["default"] == 0]

    if not incomes_default or not incomes_ok:
        threshold = statistics.mean(r["income"] for r in train_rows)
    else:
        threshold = (statistics.mean(incomes_default) + statistics.mean(incomes_ok)) / 2

    return threshold


# use the learned threshold to make predictions.

def predict_with_threshold(rows, threshold):

    return [1 if r["income"] >= threshold else 0 for r in rows]


# create a leaky split where same customers appear twice.
leaky_train = []
leaky_test = []
for row in customers:
    duplicate = dict(row)
    leaky_train.append(row)
    leaky_test.append(duplicate)

# train model on leaky training data.
leaky_threshold = train_rule_model(leaky_train)

# evaluate model on leaky test data.
leaky_preds = predict_with_threshold(leaky_test, leaky_threshold)
leaky_true = [r["default"] for r in leaky_test]
leaky_acc = accuracy(leaky_true, leaky_preds)

# create a proper split by separating customer identities.
proper_train = customers[:8]
proper_test = customers[8:]

# ensure no overlapping customer identifiers.
assert not {r["id"] for r in proper_train}.intersection({r["id"] for r in proper_test})

# train model on proper training data only.
proper_threshold = train_rule_model(proper_train)

# evaluate model on truly unseen customers.
proper_preds = predict_with_threshold(proper_test, proper_threshold)
proper_true = [r["default"] for r in proper_test]
proper_acc = accuracy(proper_true, proper_preds)

# print concise comparison of both scenarios.
print("Leaky split accuracy looks unrealistically high:", round(leaky_acc, 2))
print("Proper split accuracy is more realistic:", round(proper_acc, 2))
print("Same rule, different splitting, different trust level.")
print("Leakage happens when test customers influence training.")
print("Always design splits that mimic real deployment.")



# <font color="#418FDE" size="6.5" uppercase>**Splits And Validation**</font>


In this lecture, you learned to:
- Describe the roles of training, validation, and test splits in model development. 
- Propose simple ways to split a small dataset into separate parts. 
- Explain how improper splitting can lead to overly optimistic performance estimates. 

In the next Lecture (Lecture B), we will go over 'Cleaning And Scaling'