# Lecture Notes: Preventing Data Leakage and Building Robust Machine Learning Pipelines

In these notes, we cover how to simulate and visualize data, prevent data leakage with proper splits and preprocessing, leverage pipelines, and persist models using an ice cream sales forecasting example.

## Table of Contents (5 Sections)
1. Introduction & Environment Setup
2. Data Simulation, Visualization & Extended Concepts
3. Preventing Data Leakage
4. Using Pipelines, a Careful Approach & Model Persistence
5. Summary & Best Practices

## 1. Introduction & Environment Setup

In this lecture, we explore how to forecast ice cream sales while avoiding data leakage. We will learn how to simulate data, properly preprocess it, build robust machine learning pipelines, and save our models for future use.

Let's start by importing our required libraries.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import r2_score

import pickle  # For model serialization

print("All libraries have been successfully imported.")

## 2. Data Simulation, Visualization & Extended Concepts

### Data Simulation

We simulate 90 days of ice cream sales data. The features include:

- **Temperature:** Simulated using a normal distribution with a mean of 25°C and a standard deviation of 3.
- **Promotion:** A binary feature representing whether a promotion was active (1) or not (0) with a 30% chance.

Sales are computed using the formula:

  sales = 300 + 12 × temperature + 60 × promotion + random noise (mean = 0, std = 20)

This formula mimics a scenario in which higher temperatures and promotions increase ice cream sales.

In [None]:
np.random.seed(42)

# Generate dates (90 days)
n_days = 90
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(n_days)]

# Generate temperature and promotion features
temperatures = np.random.normal(loc=25, scale=3, size=n_days).round(1)
promotions = np.random.choice([0, 1], size=n_days, p=[0.7, 0.3])

# Compute sales using the given formula
sales = 300 + 12 * temperatures + 60 * promotions + np.random.normal(0, 20, size=n_days)

# Create a DataFrame
df = pd.DataFrame({
    'date': dates,
    'temperature': temperatures,
    'promotion': promotions,
    'sales': sales.round().astype(int)
})

print(df.head())
print("\n(Data simulation successful.)")

### Data Visualization

Visualizing data helps reveal patterns and trends. We will create two plots:

1. **Scatter Plot (Temperature vs Sales):** Shows how temperature (and promotions) affect sales.
2. **Boxplot (Promotion vs Sales):** Compares sales on days with and without promotions.

In [None]:
plt.figure(figsize=(6,4))
plt.scatter(df['temperature'], df['sales'], c=df['promotion'], cmap='coolwarm', alpha=0.7)
plt.title('Temperature vs Sales (Color by Promotion)')
plt.xlabel('Temperature (°C)')
plt.ylabel('Sales')
cbar = plt.colorbar()
cbar.set_label('Promotion', rotation=270, labelpad=15)
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

In [None]:
plt.figure(figsize=(6,4))
groups = df.groupby('promotion')['sales']
labels = ['No Promo (0)', 'Promo (1)']
data_list = [groups.get_group(g) for g in sorted(groups.groups.keys())]
plt.boxplot(data_list, labels=labels)
plt.title('Promotion vs Sales - Boxplot')
plt.ylabel('Sales')
plt.grid(True, linestyle='--', alpha=0.5)
plt.tight_layout()
plt.show()

print("The boxplot shows higher sales during promotions.")

## Extended Concepts: Data Splitting & The Full ML Pipeline

In more advanced workflows, it is common to split data into multiple sets:

- **Training Set:** Used for fitting model parameters.
- **Validation Set:** Used for hyperparameter tuning and model selection.
- **Test Set:** Used for final unbiased evaluation.

A complete machine learning pipeline encompasses:
1. **Data Preparation and Cleaning:** Selecting and cleaning your features.
2. **Feature Engineering:** Creating new features or transforming data.
3. **Data Splitting:** Reserving data for training, validation, and testing.
4. **Model Selection and Training:** Experimenting with different algorithms and tuning parameters.
5. **Evaluation:** Assessing model performance thoroughly.
6. **Deployment and Monitoring:** Deploying the model and tracking its performance over time.

By incorporating these steps and splitting your data properly, you further reduce the risk of data leakage and improve the robustness of your machine learning model.

## 3. Preventing Data Leakage

Data leakage happens when information from the test set inadvertently influences model training. A common mistake is to include the target variable as a feature.

### Example: Naïve Approach

A naïve approach uses the entire dataset for both training and evaluation. Although the model may achieve a high R² score, it does not reflect the model's true performance.

In [None]:
print("=== Naïve Approach: Using the entire dataset ===")

X_all = df[['temperature', 'promotion']]
y_all = df['sales']

model_all = RandomForestRegressor(random_state=42)
model_all.fit(X_all, y_all)

r2_all = model_all.score(X_all, y_all)
print(f"Model R^2 (all data): {r2_all:.4f}")

### Demonstrating Data Leakage

Including the target variable as part of the feature set (an accidental 'leak') can result in unrealistically high performance. In the code below, the target is mistakenly added as a feature.

In [None]:
print("=== Demonstrating Data Leakage ===")

X_leak = df[['temperature', 'promotion']].copy()
X_leak['target_leak'] = df['sales']  # Incorrectly add the target
y_leak = df['sales']

X_train_leak, X_test_leak, y_train_leak, y_test_leak = train_test_split(
    X_leak, y_leak, test_size=0.2, random_state=42
)

leak_model = RandomForestRegressor(random_state=42)
leak_model.fit(X_train_leak, y_train_leak)

train_score_leak = leak_model.score(X_train_leak, y_train_leak)
test_score_leak = leak_model.score(X_test_leak, y_test_leak)

print(f"Training R^2 (with leakage): {train_score_leak:.4f}")
print(f"Testing R^2 (with leakage): {test_score_leak:.4f}")
print("Notice how the test score is suspiciously high – a sign of leakage.")

## 4. Using Pipelines, a Careful Approach & Model Persistence

A robust approach entails splitting data into training and test sets, applying careful preprocessing (such as standardizing only the training data), and then using pipelines to encapsulate these steps. Finally, the trained model is persisted for future use.

In [None]:
print("=== Careful Approach: Data Splitting and Standardization ===")

X = df[['temperature', 'promotion']]
y = df['sales']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Scale the training data
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)

# Transform the test data using the same scaler
X_test_scaled = scaler.transform(X_test)

# Train the model
model = RandomForestRegressor(random_state=42)
model.fit(X_train_scaled, y_train)

# Evaluate the model
r2_train = model.score(X_train_scaled, y_train)
r2_test = model.score(X_test_scaled, y_test)

print(f"Training R^2: {r2_train:.4f}")
print(f"Testing R^2: {r2_test:.4f}")

### Using Pipelines to Prevent Data Leakage

Pipelines allow you to encapsulate preprocessing and modeling steps in one object. This ensures that the correct operations (like fitting a scaler only on training data) are applied in a fixed order.

Below, we create a pipeline to scale the 'temperature' feature and train a Random Forest model.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

print("=== Pipeline Construction and Modeling ===")

# Prepare data for the pipeline
X_pipeline = df[['temperature', 'promotion']]
y_pipeline = df['sales']

X_train_pipe, X_test_pipe, y_train_pipe, y_test_pipe = train_test_split(
    X_pipeline, y_pipeline, test_size=0.2, random_state=42
)

# Define a preprocessor to scale 'temperature'
preprocessor = ColumnTransformer([
    ('scale_temp', StandardScaler(), ['temperature'])
], remainder='passthrough')

# Build the pipeline
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(random_state=42))
])

# Train the pipeline
pipeline.fit(X_train_pipe, y_train_pipe)

r2_train_pipe = pipeline.score(X_train_pipe, y_train_pipe)
r2_test_pipe = pipeline.score(X_test_pipe, y_test_pipe)

print(f"Pipeline Training R^2: {r2_train_pipe:.4f}")
print(f"Pipeline Testing R^2: {r2_test_pipe:.4f}")

### Model Persistence

Once you have trained a robust model, it is important to save it so that it can be used later without retraining. We use the `pickle` module to serialize and deserialize our pipeline.

In [None]:
print("=== Saving and Loading the Pipeline using pickle ===")

model_filename = "model_pipeline.pkl"
with open(model_filename, 'wb') as f:
    pickle.dump(pipeline, f)
print(f"Model has been saved to {model_filename}")

# Later, load the saved model
with open(model_filename, 'rb') as f:
    loaded_pipeline = pickle.load(f)
print(f"Loaded pipeline from {model_filename} and ready to use.")

# Validate loaded pipeline
loaded_test_score = loaded_pipeline.score(X_test_pipe, y_test_pipe)
print(f"Loaded Pipeline Testing R^2: {loaded_test_score:.4f}")

## 5. Summary & Best Practices

**Key Recommendations:**

- **Prevent Data Leakage:** Always ensure that preprocessing on test data is performed separately (or via pipelines) so that test data does not influence model training.
- **Proper Data Splitting:** Divide your data into training (and optionally validation) and test sets to evaluate how the model will generalize.
- **Leverage Pipelines:** Use pipelines to enforce a consistent, leak-free workflow from preprocessing to modeling.
- **Save Your Work:** Persist your developed models using tools like pickle for reproducibility and future deployment.

### Installing Anaconda (Optional, but Recommended)

For a robust and reliable data science environment, it is recommended to install Anaconda. Anaconda simplifies package management and provides a pre-configured Python distribution with many of the required libraries.

**To install Anaconda:**
1. Visit the official [Anaconda Distribution](https://www.anaconda.com/products/distribution) page.
2. Download the installer for your operating system (Windows, macOS, or Linux).
3. Follow the installation instructions provided on the website.
4. Once installed, open Anaconda Navigator and launch Jupyter Notebook to begin working.

By following these best practices, you can build robust machine learning models that generalize well and are production-ready.

Happy modeling!