# Preprocessing, Windowing and Calendar Features

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/giulatona/iecon2025_tutorial/blob/main/notebooks/02_preprocessing_windowing_features.ipynb)

This notebook demonstrates preprocessing techniques for time series forecasting:

- **Data Preprocessing**: Cleaning, normalization, and scaling
- **Windowing Techniques**: Creating sliding windows for sequence modeling
- **Calendar Features**: Adding temporal features (to be used as exogenous features)

**Dataset**: UCI Individual Household Electric Power Consumption  
**Goal**: Prepare data for machine learning forecasting models

## Setup and Data Loading

In [1]:
# Import required libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from datetime import datetime, timedelta
import warnings
warnings.filterwarnings('ignore')

# Set plotting style
plt.style.use('default')
sns.set_palette("husl")

### Dataset Download (Hidden)
The following cell downloads the dataset if not available locally. This cell is hidden by default to keep the notebook clean.

In [3]:
# Download data if not available locally
import os
import urllib.request
import zipfile

# Check if running in Google Colab
in_colab = 'google.colab' in str(get_ipython())

if in_colab:
    print("Running in Google Colab - downloading dataset...")
    !wget -q https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip
    !unzip -q household_power_consumption.zip
    data_file = 'household_power_consumption.txt'
else:
    print("Running locally...")
    data_file = 'data/household_power_consumption.txt'
    
    # Create data directory if it doesn't exist
    os.makedirs('data', exist_ok=True)
    
    # Download dataset if it doesn't exist locally
    if not os.path.exists(data_file):
        print("Dataset not found locally. Downloading...")
        url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/00235/household_power_consumption.zip'
        zip_file = 'data/household_power_consumption.zip'
        
        # Download the zip file
        urllib.request.urlretrieve(url, zip_file)
        print("Download completed. Extracting...")
        
        # Extract the zip file
        with zipfile.ZipFile(zip_file, 'r') as zip_ref:
            zip_ref.extractall('data/')
        
        print("Extraction completed.")
    else:
        print("Dataset found locally.")

print(f"Using data file: {data_file}")

Running locally...
Dataset not found locally. Downloading...
Download completed. Extracting...
Download completed. Extracting...
Extraction completed.
Using data file: data/household_power_consumption.txt
Extraction completed.
Using data file: data/household_power_consumption.txt


In [4]:
# Load and perform initial preprocessing
print("Loading dataset...")
df = pd.read_csv(data_file, sep=';', low_memory=False)

print(f"Dataset shape: {df.shape}")
print(f"Date range: {df['Date'].iloc[0]} to {df['Date'].iloc[-1]}")
df.head()

Loading dataset...
Dataset shape: (2075259, 9)
Date range: 16/12/2006 to 26/11/2010
Dataset shape: (2075259, 9)
Date range: 16/12/2006 to 26/11/2010


Unnamed: 0,Date,Time,Global_active_power,Global_reactive_power,Voltage,Global_intensity,Sub_metering_1,Sub_metering_2,Sub_metering_3
0,16/12/2006,17:24:00,4.216,0.418,234.84,18.4,0.0,1.0,17.0
1,16/12/2006,17:25:00,5.36,0.436,233.63,23.0,0.0,1.0,16.0
2,16/12/2006,17:26:00,5.374,0.498,233.29,23.0,0.0,2.0,17.0
3,16/12/2006,17:27:00,5.388,0.502,233.74,23.0,0.0,1.0,17.0
4,16/12/2006,17:28:00,3.666,0.528,235.68,15.8,0.0,1.0,17.0


## Data Preprocessing Pipeline

### 1. Basic Data Cleaning

In [None]:
def clean_data(df):
    """
    Clean the household power consumption dataset.
    
    Parameters:
    df (pd.DataFrame): Raw dataset
    
    Returns:
    pd.DataFrame: Cleaned dataset with datetime index
    """
    # TODO: Implement data cleaning
    # - Combine Date and Time columns
    # - Convert '?' to NaN and handle missing values
    # - Convert data types
    # - Set datetime index
    pass

# Apply cleaning
# df_clean = clean_data(df)
print("Data cleaning function defined. Implementation needed.")

### 2. Data Normalization and Scaling

In [None]:
def normalize_features(df, method='standardize', columns=None):
    """
    Normalize features using different scaling methods.
    
    Parameters:
    df (pd.DataFrame): Input dataframe
    method (str): 'standardize', 'minmax', or 'robust'
    columns (list): Columns to normalize (None for all numeric)
    
    Returns:
    pd.DataFrame: Normalized dataframe
    dict: Scaler parameters for inverse transformation
    """
    # TODO: Implement different normalization methods
    # - Standard scaling (z-score)
    # - Min-max scaling
    # - Robust scaling
    # - Return scaler parameters for inverse transformation
    pass

print("Normalization function defined. Implementation needed.")

## Windowing Techniques

### 1. Sliding Window Creation

In [None]:
def create_sliding_windows(data, window_size, forecast_horizon=1, step_size=1):
    """
    Create sliding windows for time series forecasting.
    
    Parameters:
    data (pd.DataFrame or np.array): Time series data
    window_size (int): Number of time steps in input window
    forecast_horizon (int): Number of time steps to predict
    step_size (int): Step size between windows
    
    Returns:
    tuple: (X, y) arrays for model training
    """
    # TODO: Implement sliding window creation
    # - Create input sequences (X) and target sequences (y)
    # - Handle multiple forecast horizons
    # - Support different step sizes
    # - Preserve temporal order
    pass

print("Sliding window function defined. Implementation needed.")

## Calendar Features Engineering

### 1. Basic Temporal Features

In [None]:
def extract_temporal_features(df, datetime_col=None):
    """
    Extract basic temporal features from datetime index or column.
    
    Parameters:
    df (pd.DataFrame): Input dataframe
    datetime_col (str): Name of datetime column (None if using index)
    
    Returns:
    pd.DataFrame: Dataframe with added temporal features
    """
    # TODO: Implement temporal feature extraction
    # - Hour, day, month, year
    # - Day of week, day of year
    # - Quarter, week of year
    # - Is weekend, is holiday
    pass

print("Temporal features function defined. Implementation needed.")

### 2. Cyclical Encoding

In [None]:
def create_cyclical_features(df, columns):
    """
    Create cyclical encoding for temporal features using sine and cosine.
    
    Parameters:
    df (pd.DataFrame): Input dataframe
    columns (dict): Dictionary mapping column names to their periods
                   e.g., {'hour': 24, 'day_of_week': 7, 'month': 12}
    
    Returns:
    pd.DataFrame: Dataframe with cyclical features
    """
    # TODO: Implement cyclical encoding
    # - Apply sine and cosine transformations
    # - Handle different periods (hour, day, week, month)
    # - Preserve cyclical nature of time
    pass

def visualize_cyclical_features(df, feature_name, period):
    """
    Visualize cyclical features to verify encoding.
    
    Parameters:
    df (pd.DataFrame): Dataframe with cyclical features
    feature_name (str): Name of the original feature
    period (int): Period of the cyclical feature
    """
    # TODO: Implement visualization
    # - Plot original vs cyclical features
    # - Show circular representation
    # - Validate encoding correctness
    pass

print("Cyclical encoding functions defined. Implementation needed.")

### 3. Holiday and Special Events

In [None]:
def add_holiday_features(df, country='France'):
    """
    Add holiday indicators and special events.
    
    Parameters:
    df (pd.DataFrame): Input dataframe with datetime index
    country (str): Country for holiday calendar
    
    Returns:
    pd.DataFrame: Dataframe with holiday features
    """
    # TODO: Implement holiday features
    # - Use holidays library for country-specific holidays
    # - Add binary indicators for holidays
    # - Include special events (vacation periods, etc.)
    # - Add days before/after holidays
    pass

print("Holiday features function defined. Implementation needed.")

## Lag Features and Rolling Statistics

### 1. Lag Features

In [None]:
def create_lag_features(df, columns, lags):
    """
    Create lag features for specified columns.
    
    Parameters:
    df (pd.DataFrame): Input dataframe
    columns (list): Columns to create lags for
    lags (list): List of lag periods
    
    Returns:
    pd.DataFrame: Dataframe with lag features
    """
    # TODO: Implement lag feature creation
    # - Create lagged versions of specified columns
    # - Handle multiple lag periods
    # - Manage missing values from lagging
    pass

print("Lag features function defined. Implementation needed.")

### 2. Rolling Statistics

In [None]:
def create_rolling_features(df, columns, windows, statistics=['mean', 'std', 'min', 'max']):
    """
    Create rolling window statistics.
    
    Parameters:
    df (pd.DataFrame): Input dataframe
    columns (list): Columns to calculate statistics for
    windows (list): List of window sizes
    statistics (list): List of statistics to calculate
    
    Returns:
    pd.DataFrame: Dataframe with rolling features
    """
    # TODO: Implement rolling statistics
    # - Calculate rolling mean, std, min, max
    # - Support multiple window sizes
    # - Add rolling quantiles
    # - Handle edge cases
    pass

def create_expanding_features(df, columns, statistics=['mean', 'std']):
    """
    Create expanding window statistics.
    
    Parameters:
    df (pd.DataFrame): Input dataframe
    columns (list): Columns to calculate statistics for
    statistics (list): List of statistics to calculate
    
    Returns:
    pd.DataFrame: Dataframe with expanding features
    """
    # TODO: Implement expanding statistics
    # - Calculate expanding mean, std
    # - Useful for trend analysis
    pass

print("Rolling statistics functions defined. Implementation needed.")

## Complete Preprocessing Pipeline

In [None]:
class TimeSeriesPreprocessor:
    """
    Complete preprocessing pipeline for time series forecasting.
    """
    
    def __init__(self, window_size=24, forecast_horizon=1, 
                 normalization='standardize', add_cyclical=True):
        """
        Initialize preprocessing pipeline.
        
        Parameters:
        window_size (int): Input sequence length
        forecast_horizon (int): Number of steps to predict
        normalization (str): Normalization method
        add_cyclical (bool): Whether to add cyclical features
        """
        # TODO: Initialize pipeline components
        pass
    
    def fit(self, df):
        """
        Fit preprocessing pipeline on training data.
        
        Parameters:
        df (pd.DataFrame): Training dataset
        
        Returns:
        self: Fitted preprocessor
        """
        # TODO: Fit all preprocessing steps
        pass
    
    def transform(self, df):
        """
        Transform dataset using fitted pipeline.
        
        Parameters:
        df (pd.DataFrame): Input dataset
        
        Returns:
        tuple: (X, y) for model training/prediction
        """
        # TODO: Apply all preprocessing steps
        pass
    
    def fit_transform(self, df):
        """
        Fit and transform in one step.
        
        Parameters:
        df (pd.DataFrame): Input dataset
        
        Returns:
        tuple: (X, y) for model training
        """
        return self.fit(df).transform(df)
    
    def inverse_transform(self, predictions):
        """
        Inverse transform predictions to original scale.
        
        Parameters:
        predictions (np.array): Model predictions
        
        Returns:
        np.array: Predictions in original scale
        """
        # TODO: Implement inverse transformation
        pass

print("TimeSeriesPreprocessor class defined. Implementation needed.")

## Usage Examples and Testing

In [None]:
# TODO: Add example usage of the preprocessing pipeline
# Example:
# preprocessor = TimeSeriesPreprocessor(window_size=24, forecast_horizon=1)
# X_train, y_train = preprocessor.fit_transform(df_train)
# X_test, y_test = preprocessor.transform(df_test)

print("Usage examples to be implemented.")

## Summary

This notebook provides a comprehensive framework for time series preprocessing including:

### Key Components:
1. **Data Cleaning**: Handling missing values, data type conversion, datetime processing
2. **Normalization**: Multiple scaling methods for different use cases
3. **Windowing**: Sliding windows, overlapping sequences, multi-horizon support
4. **Calendar Features**: Temporal features, cyclical encoding, holiday indicators
5. **Lag Features**: Historical values and rolling statistics
6. **Pipeline**: Complete preprocessing class for reproducible workflows

### Next Steps:
- Implement the defined functions with specific logic
- Test with the household power consumption dataset
- Validate preprocessing quality
- Integrate with machine learning models

### Best Practices:
- Always preserve temporal order in time series data
- Avoid data leakage when creating features
- Use separate train/validation/test splits for time series
- Validate all transformations are invertible
- Document all preprocessing steps for reproducibility