# Lecture Notes: Preventing Data Leakage and Building Robust Machine Learning Pipelines

In these notes, we cover how to simulate and visualize data, understand the dangers of data leakage, and build robust machine learning pipelines. Our example will focus on forecasting ice cream sales. The following topics are covered:

**Table of Contents**
1. Introduction: Ice Cream Sales Forecasting
2. Setting Up: Importing Required Libraries
3. Data Simulation: Creating Sales Data
4. Data Visualization: Spotting Trends
5. A Naïve Approach: Using the Entire Dataset
6. Demonstrating Data Leakage
7. A Careful Approach: Proper Train-Test Split and Standardization
8. Detailed Illustration: fit() vs. transform()
9. Leveraging Pipelines to Prevent Data Leakage
10. Model Persistence with pickle
11. Extended Concepts: Train, Validation, and Test Datasets & The ML Pipeline
12. Summary and Best Practices

<a name="introduction"></a>
## 1. Introduction: Ice Cream Sales Forecasting

In this lecture, we explore the steps required to forecast ice cream sales while avoiding data leakage. We cover how to simulate data, split and preprocess it, and how to build a robust pipeline that ensures every step—from data preparation to evaluation—is properly executed. Finally, we explain the importance of saving your trained model for future use.

<a name="setting-up"></a>
## 2. Setting Up: Importing Required Libraries

Before starting our analysis, we first import all the necessary Python libraries. These include libraries for numerical computations, data manipulation, visualization, and machine learning.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

import pickle  # For model serialization

print("All libraries have been successfully imported.")

<a name="data-simulation"></a>
## 3. Data Simulation: Creating Sales Data

We simulate 90 days of ice cream sales data. The features include:

- **Temperature:** Simulated using a normal distribution (mean = 25°C, standard deviation = 3).
- **Promotion:** A binary variable indicating whether a promotion is active (1) or not (0), with a 30% chance of promotion.

Sales are computed using the formula:

  sales = 300 + 12 × temperature + 60 × promotion + random noise (mean = 0, std = 20)

This formula is designed to mimic a scenario where higher temperatures and promotions increase sales.

In [None]:
np.random.seed(42)

# 1. Generate dates (90 days)
n_days = 90
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(n_days)]

# 2. Generate temperature and promotion features
temperatures = np.random.normal(loc=25, scale=3, size=n_days).round(1)
promotions = np.random.choice([0, 1], size=n_days, p=[0.7, 0.3])

# 3. Compute sales using the given formula
sales = 300 + 12 * temperatures + 60 * promotions + np.random.normal(0, 20, size=n_days)

# 4. Create a DataFrame
df = pd.DataFrame({
    'date': dates,
    'temperature': temperatures,
    'promotion': promotions,
    'sales': sales.round().astype(int)
})

print(df.head())
print("\n(Data simulation successful.)")

<a name="data-visualization"></a>
## 4. Data Visualization: Spotting Trends

Visualizing the data helps us understand patterns. We will create two visualizations:

1. **Scatter Plot (Temperature vs Sales):** This plot shows how temperature and promotions influence sales.
2. **Boxplot (Promotion vs Sales):** This plot compares sales on days with and without promotions.

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(df['temperature'], df['sales'], c=df['promotion'], cmap='coolwarm', alpha=0.7)
plt.title('Temperature vs Sales (Color by Promotion)')
plt.xlabel('Temperature (°C)')
plt.ylabel('Sales')
cbar = plt.colorbar()
cbar.set_label('Promotion', rotation=270, labelpad=15)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(6,4))
groups = df.groupby('promotion')['sales']
labels = ['No Promo (0)', 'Promo (1)']
data_list = [groups.get_group(g) for g in sorted(groups.groups.keys())]
plt.boxplot(data_list, labels=labels)
plt.title('Promotion vs Sales - Boxplot')
plt.ylabel('Sales')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

print("The boxplot shows higher sales during promotions.")

<a name="naive-approach"></a>
## 5. A Naïve Approach: Using the Entire Dataset

A naïve approach is to use the entire dataset for both training and evaluation. Although the model may appear to perform well (high R² score), this approach can conceal issues such as overfitting and does not reflect the model's true generalization capability.

In [None]:
print("=== Naïve Approach: Using the entire dataset for training and evaluation ===")

X_all = df[['temperature', 'promotion']]
y_all = df['sales']

model_all = RandomForestRegressor(random_state=42)
model_all.fit(X_all, y_all)

r2_all = model_all.score(X_all, y_all)
print(f"Model R^2 using the entire dataset: {r2_all:.4f}")

<a name="demonstrating-leakage"></a>
## 6. Demonstrating Data Leakage

Data leakage occurs when information from outside the training dataset is used to create the model. Here, a common mistake is to include the target variable as a feature. This example demonstrates how leakage can lead to unrealistically high performance.

In [None]:
print("=== Demonstrating Data Leakage ===")

X_leak = df[['temperature', 'promotion']].copy()
X_leak['target_leak'] = df['sales']  # Incorrectly adding the target as a feature
y_leak = df['sales']

X_train_leak, X_test_leak, y_train_leak, y_test_leak = train_test_split(
    X_leak, y_leak, test_size=0.2, random_state=42
)

leak_model = RandomForestRegressor(random_state=42)
leak_model.fit(X_train_leak, y_train_leak)

train_score_leak = leak_model.score(X_train_leak, y_train_leak)
test_score_leak = leak_model.score(X_test_leak, y_test_leak)

print(f"Training R^2 with leakage: {train_score_leak:.4f}")
print(f"Testing R^2 with leakage: {test_score_leak:.4f}")
print("Notice how the test score is suspiciously high—this indicates leakage.")

<a name="careful-approach"></a>
## 7. A Careful Approach: Proper Train-Test Split and Standardization

A better strategy involves:

- Splitting the data into training and test sets.
- Fitting a scaler on the training data.
- Applying the same transformation to the test data.

This approach prevents data leakage by ensuring that the test data remains unseen during model training.

In [None]:
print("=== Careful Approach: Manual Data Splitting and Standardization ===")

X = df[['temperature', 'promotion']]
y = df['sales']

# 1. Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# 2. Scale the training data using fit_transform()
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# 3. Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)

# 4. Train the model
model = RandomForestRegressor(random_state=42)
model.fit(X_train_scaled, y_train)

# 5. Evaluate the model
r2_train = model.score(X_train_scaled, y_train)
r2_test = model.score(X_test_scaled, y_test)

print(f"Training R^2: {r2_train:.4f}")
print(f"Testing R^2: {r2_test:.4f}")

<a name="fit-vs-transform"></a>
## 8. Detailed Illustration: fit() vs. transform()

When preprocessing, **fit()** calculates necessary parameters (e.g., mean and standard deviation) from the training data. **transform()** then applies these parameters to new data. It is critical to:

- Use `fit_transform()` only on the training data.
- Use `transform()` on the test data.

Below is an illustration using just the temperature feature.

In [None]:
# Using only the temperature feature for demonstration
X_ice = df[['temperature']]
y_ice = df['sales']

X_train_ice, X_test_ice, y_train_ice, y_test_ice = train_test_split(
    X_ice, y_ice, test_size=0.2, random_state=42
)

print("Training set size:", X_train_ice.shape, "Test set size:", X_test_ice.shape)

In [None]:
# Correct: fit on training set and then transform test set
scaler_correct = StandardScaler()
X_train_ice_scaled_correct = scaler_correct.fit_transform(X_train_ice)
X_test_ice_scaled_correct = scaler_correct.transform(X_test_ice)

print("Transformed training data (first 5 rows):")
print(X_train_ice_scaled_correct[:5])
print("\nTransformed test data (first 5 rows):")
print(X_test_ice_scaled_correct[:5])

In [None]:
# Incorrect: Using fit_transform() on the test data causes leakage
scaler_wrong = StandardScaler()
X_test_ice_scaled_wrong = scaler_wrong.fit_transform(X_test_ice)

print("\n(Incorrect) Transformed test data using fit_transform (first 5 rows):")
print(X_test_ice_scaled_wrong[:5])

<a name="pipelines"></a>
## 9. Leveraging Pipelines to Prevent Data Leakage

Pipelines combine multiple preprocessing steps with model training so that every process occurs in a fixed, leak-free order. In our example, we build a pipeline that scales selected features and then trains a Random Forest model.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

print("=== Using a Pipeline to Combine Preprocessing and Modeling ===")

# Split the data
X_pipeline = df[['temperature', 'promotion']]
y_pipeline = df['sales']

X_train_pipe, X_test_pipe, y_train_pipe, y_test_pipe = train_test_split(
    X_pipeline, y_pipeline, test_size=0.2, random_state=42
)

# Create a preprocessor that scales 'temperature' and leaves 'promotion' unchanged
preprocessor = ColumnTransformer([
    ('scale_temp', StandardScaler(), ['temperature'])
], remainder='passthrough')

# Build the pipeline: preprocessing followed by model training
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(random_state=42))
])

# Fit the pipeline (fit_transform is applied on the training data internally)
pipeline.fit(X_train_pipe, y_train_pipe)

r2_train_pipe = pipeline.score(X_train_pipe, y_train_pipe)
r2_test_pipe = pipeline.score(X_test_pipe, y_test_pipe)

print(f"Pipeline Training R^2: {r2_train_pipe:.4f}")
print(f"Pipeline Testing R^2: {r2_test_pipe:.4f}")

<a name="model-persistence"></a>
## 10. Model Persistence with pickle: Saving the Best Model

After training a robust model using a pipeline, it is important to save it for future use. We use `pickle` to serialize the pipeline.

In [None]:
print("=== Saving and Loading the Pipeline using pickle ===")

model_filename = "model_pipeline.pkl"
with open(model_filename, 'wb') as f:
    pickle.dump(pipeline, f)
print(f"Model has been saved to {model_filename}")

# To load the model later:
with open(model_filename, 'rb') as f:
    loaded_pipeline = pickle.load(f)
print(f"Pipeline loaded from {model_filename} and ready to use.")

# Validate the loaded model by scoring on the test data
loaded_test_score = loaded_pipeline.score(X_test_pipe, y_test_pipe)
print(f"Loaded Pipeline Testing R^2: {loaded_test_score:.4f}")

<a name="extended-concepts"></a>
## 11. Extended Concepts: Train, Validation, and Test Datasets & The ML Pipeline

In more advanced settings, data is often divided into three sets:

- **Training Set:** Used for model fitting.
- **Validation Set:** Used to fine-tune the model (e.g., hyperparameter tuning).
- **Test Set:** Used for a final unbiased evaluation of the model.

A comprehensive machine learning pipeline involves:

1. **Data Preparation and Cleaning:** Select and clean your features.
2. **Feature Engineering:** Create and select relevant features.
3. **Data Splitting:** Divide data into training, validation, and test sets.
4. **Model Selection and Training:** Experiment with and tune various models.
5. **Evaluation:** Rigorously assess model performance.
6. **Deployment and Monitoring:** Deploy the model and monitor its performance over time.

Additionally, environments like Anaconda allow easy package and environment management, helping maintain reproducibility.

<a name="summary"></a>
## 12. Summary and Best Practices

**Key Takeaways:**

- **Prevent Data Leakage:** Never allow test data to influence model training. Always ensure that preprocessing is done separately for training and test sets.
- **Data Splitting:** Properly divide your dataset into training, validation, and test sets for a more accurate evaluation.
- **Use Pipelines:** Encapsulate preprocessing and modeling steps into a pipeline to maintain a consistent and leak-free process.
- **Model Persistence:** Save your trained model using tools like pickle to ensure reproducibility and future deployment.
- **Development Tools:** Leverage environments and tools (such as Anaconda) for easy management of packages and dependencies.

By adhering to these best practices, you can build robust machine learning models that generalize well and perform reliably in real-world applications.

Happy modeling!