# In-Class Exercise: Preventing Data Leakage & Building Robust ML Pipelines

In this exercise, you will apply what you have learned about data splitting, avoiding data leakage, using pipelines for preprocessing and modeling, and saving/loading your models. Follow each task and complete the required code in the provided cells.

## Instructions

1. **Import Libraries & Simulate Data:**
   - Import the required libraries.
   - Simulate 90 days of ice cream sales data. (A template is provided below.)

2. **Task 1 – Data Splitting and Standardization:**
   - Split the data into training and testing sets.
   - Apply standard scaling to the training data using `fit_transform()` and transform the test data using `transform()`.
   - *Hint:* Use the `StandardScaler` from scikit-learn.

3. **Task 2 – Building a Pipeline:**
   - Construct a ML pipeline that scales the `temperature` feature and passes the `promotion` feature unchanged.
   - Append a RandomForestRegressor to your pipeline.

4. **Task 3 – Demonstrating Data Leakage:**
   - Create a version of your dataset that erroneously includes the target variable as a feature.
   - Train a model on this leaked dataset and observe the difference in performance.

5. **Task 4 – Model Persistence:**
   - Save your trained pipeline to disk using `pickle`.
   - Load the saved model and validate it using the test data.

6. **Reflection:**
   - In your own words, describe why it is important to prevent data leakage and how pipelines help in maintaining reproducible workflows.


### Step 1: Import Libraries & Simulate Data

Use the following template code to import libraries and simulate ice cream sales data.

In [None]:
# Import necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from datetime import datetime, timedelta
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

import pickle

print("Libraries imported.")

# Simulate 90 days of ice cream sales data
np.random.seed(42)
n_days = 90
start_date = datetime(2024, 1, 1)
dates = [start_date + timedelta(days=i) for i in range(n_days)]

temperatures = np.random.normal(loc=25, scale=3, size=n_days).round(1)
promotions = np.random.choice([0, 1], size=n_days, p=[0.7, 0.3])

sales = 300 + 12 * temperatures + 60 * promotions + np.random.normal(0, 20, size=n_days)

df = pd.DataFrame({
    'date': dates,
    'temperature': temperatures,
    'promotion': promotions,
    'sales': sales.round().astype(int)
})

print(df.head())
print("Data simulation complete.")

## Task 1 – Data Splitting and Standardization

Split the simulated data into training and testing sets. Then, apply `StandardScaler` to the training set using `fit_transform()` and to the test set using `transform()`. Complete the code in the cell below.

In [None]:
# TODO: Split the data into features (X) and target (y).
X = df[['temperature', 'promotion']]
y = df['sales']

# TODO: Split the data into training and test sets (80% training, 20% test)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TODO: Initialize the StandardScaler and apply fit_transform on X_train and transform on X_test for the 'temperature' feature only.
scaler = StandardScaler()

# Apply scaling on the 'temperature' column of the training set
X_train_scaled = X_train.copy()
X_train_scaled['temperature'] = scaler.fit_transform(X_train[['temperature']])

# Apply the same transformation on the test set
X_test_scaled = X_test.copy()
X_test_scaled['temperature'] = scaler.transform(X_test[['temperature']])

print("Data splitting and scaling complete.")
print(X_train_scaled.head())

## Task 2 – Building a Pipeline

Build a machine learning pipeline that:

- Preprocesses the data by scaling the `temperature` feature while leaving the `promotion` feature unchanged.
- Trains a `RandomForestRegressor` on the preprocessed data.

Complete the code in the cell below. Hint: use `ColumnTransformer` and `Pipeline` from scikit-learn.

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

# TODO: Create a ColumnTransformer to scale only the 'temperature' feature
preprocessor = ColumnTransformer(
    transformers=[
        ('scale_temp', StandardScaler(), ['temperature'])
    ],
    remainder='passthrough'
)

# TODO: Construct a Pipeline with the preprocessor and a RandomForestRegressor
pipeline = Pipeline([
    ('preprocessor', preprocessor),
    ('regressor', RandomForestRegressor(random_state=42))
])

# TODO: Train the pipeline using the training data
pipeline.fit(X_train, y_train)

print("Pipeline training complete.")
print("Pipeline training score:", pipeline.score(X_train, y_train))

## Task 3 – Demonstrating Data Leakage

Create a version of your dataset that erroneously includes the target variable (`sales`) as a feature. Train a model on this leaked dataset and compare the performance with your previous correct approach. Complete the code in the cell below.

In [None]:
# TODO: Create a copy of the dataset and include 'sales' in the features to simulate leakage
X_leak = df[['temperature', 'promotion']].copy()
X_leak['leak_feature'] = df['sales']  # Incorrect addition of target variable as a feature
y_leak = df['sales']

# Split the leaked dataset
X_train_leak, X_test_leak, y_train_leak, y_test_leak = train_test_split(X_leak, y_leak, test_size=0.2, random_state=42)

# Train a RandomForestRegressor on the leaked data
leak_model = RandomForestRegressor(random_state=42)
leak_model.fit(X_train_leak, y_train_leak)

train_score_leak = leak_model.score(X_train_leak, y_train_leak)
test_score_leak = leak_model.score(X_test_leak, y_test_leak)

print(f"Training R^2 with leakage: {train_score_leak:.4f}")
print(f"Testing R^2 with leakage: {test_score_leak:.4f}")
print("Note the unusually high testing score that indicates data leakage.")

## Task 4 – Model Persistence

Save your trained pipeline (from Task 2) using `pickle` and then load it back. Validate the loaded model by checking its performance on the test set. Complete the code in the cell below.

In [None]:
# TODO: Save the trained pipeline to disk using pickle
model_filename = "student_model_pipeline.pkl"
with open(model_filename, 'wb') as file:
    pickle.dump(pipeline, file)
print(f"Model saved to {model_filename}")

# TODO: Load the saved model and validate it on the test data
with open(model_filename, 'rb') as file:
    loaded_pipeline = pickle.load(file)

loaded_score = loaded_pipeline.score(X_test, y_test)
print(f"Loaded pipeline test score: {loaded_score:.4f}")

## Reflection

In a few sentences, describe why it is important to prevent data leakage in model training and how the use of pipelines ensures a more reproducible and robust workflow.

*Write your answers in the cell below (as markdown or comments in a code cell).*

In [None]:
# Your reflection here
# For example:
# "Preventing data leakage is crucial because it ensures that the model only learns
# from the training data and does not have access to any information from the test set.
# This leads to more realistic performance estimates and prevents overfitting. 
# Pipelines help encapsulate preprocessing and modeling steps in a unified framework,
# reducing the chance of accidental leakage and ensuring reproducibility across experiments."

# (Feel free to modify the text above with your own reflections.)