# Retail Sales Forecasting using the M5 dataset with Granite Time Series - Few-shot finetuning, evaluation, and visualization

In this tutorial, we will explore [timeseries forecasting](https://www.ibm.com/think/insights/time-series-forecasting) using the [IBM Granite Timeseries model](https://ibm.com/granite) to predict retail sales. We will cover key techniques such as few-shot forecasting and fine-tuning. We are using  [M5 datasets](https://drive.google.com/drive/folders/1D6EWdVSaOtrP1LEFh1REjI3vej6iUS_4?usp=sharing) from the official M-Competitions [repository](https://github.com/Mcompetitions/M5-methods) to forecast future sales aggregated by state. The aim of this recipe is to showcase how to use a pre-trained time series foundation model for multivariate forecasting and explores various features available with Granite Time Series Foundation Models.

This recipe uses TinyTimeMixers (TTMs), which are compact pre-trained models for Multivariate Time-Series Forecasting, open-sourced by IBM Research. With less than 1 Million parameters, TTM introduces the notion of the first-ever "tiny" pre-trained models for Time-Series Forecasting. TTM outperforms several popular benchmarks demanding billions of parameters in zero-shot and few-shot forecasting and can easily be fine-tuned for multivariate forecasts.

## Setting Up

### Install the TSFM Library

The [granite-tsfm library](https://github.com/ibm-granite/granite-tsfm) provides utilities for working with Time Series Foundation Models (TSFM). Here we retrieve and install the latest version of the library.

In [20]:
# Install the tsfm library
! pip install "granite-tsfm[notebooks] @ git+https://github.com/ibm-granite/granite-tsfm.git" -U

Collecting granite-tsfm@ git+https://github.com/ibm-granite/granite-tsfm.git (from granite-tsfm[notebooks]@ git+https://github.com/ibm-granite/granite-tsfm.git)
  Cloning https://github.com/ibm-granite/granite-tsfm.git to /private/var/folders/t7/xsc0cc9n5qnfsv5glqn6wzq40000gn/T/pip-install-4od4fevl/granite-tsfm_7a2783e2bf3d48c4ba8f223afb84d211
  Running command git clone --filter=blob:none --quiet https://github.com/ibm-granite/granite-tsfm.git /private/var/folders/t7/xsc0cc9n5qnfsv5glqn6wzq40000gn/T/pip-install-4od4fevl/granite-tsfm_7a2783e2bf3d48c4ba8f223afb84d211
  Resolved https://github.com/ibm-granite/granite-tsfm.git to commit 66a368f93427ee6b35832ab733727e5ffbb03c23
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone


### Import Packages

From `tsfm_public`, we use the TinyTimeMixer model, forecasting pipeline, and plotting function. We also leverage a few components for the fine-tuning process.

In [None]:
import math
import os

import numpy as np
import pandas as pd
from torch.optim import AdamW
from torch.optim.lr_scheduler import OneCycleLR
from torch.utils.data import Subset
from transformers import EarlyStoppingCallback, Trainer, TrainingArguments

from tsfm_public import (
    ForecastDFDataset,
    TimeSeriesPreprocessor,
    TinyTimeMixerForPrediction,
    TrackingCallback,
    count_parameters,
)
from tsfm_public.toolkit.lr_finder import optimal_lr_finder
from tsfm_public.toolkit.time_series_preprocessor import prepare_data_splits
from tsfm_public.toolkit.util import select_by_timestamp
from tsfm_public.toolkit.visualization import plot_predictions

### Specify configuration variables

The forecast length is specified as well as the context length (in time steps) which is set to match the pretrained model. Additionally, we declare the Granite Time Series Model and the specific revision we are targeting.

The granite-timeseries TTM R2 card has several different revisions of the model available for various context lengths and prediction lengths. In this example we will be working with daily data, so we choose a model suitable for that resolution -- 90 days of history to forecast the next 30 days.

In [22]:
forecast_length = 28
context_length = 90

TTM_MODEL_PATH = "ibm-granite/granite-timeseries-ttm-r2"
REVISION = "90-30-ft-l1-r2.1"

## Preparing the Data

As mentioned in the introduction, this notebook makes use of the [M5 datasets](https://drive.google.com/drive/folders/1D6EWdVSaOtrP1LEFh1REjI3vej6iUS_4?usp=sharing) from the official M-Competitions [repository](https://github.com/Mcompetitions/M5-methods). 


The original data includes hierarchy and product information, for this example we aggregate the sales by state into three separate time series. The code for downloading the datasets and preparing them is available in `M5_retail_data_prep.py`. Here, we simply run the `prepare_data()` function to save the prepared dataset.

In [39]:
! pip install wget
import wget
url = 'https://raw.githubusercontent.com/ibm-granite-community/granite-timeseries-cookbook/refs/heads/main/recipes/Time_Series/Bike_Sharing_Finetuning_with_Exogenous.ipynb'
M5_retail_data_prep = wget.download(url)

Collecting wget
  Downloading wget-3.2.zip (10 kB)
  Installing build dependencies ... [?25ldone
[?25h  Getting requirements to build wheel ... [?25ldone
[?25h  Preparing metadata (pyproject.toml) ... [?25ldone
[?25hBuilding wheels for collected packages: wget
  Building wheel for wget (pyproject.toml) ... [?25ldone
[?25h  Created wheel for wget: filename=wget-3.2-py3-none-any.whl size=9684 sha256=acfa2c021398b56e39aa59057f3f7395586ab0ebed99b12fc66243911063406f
  Stored in directory: /Users/joesepi/Library/Caches/pip/wheels/40/b3/0f/a40dbd1c6861731779f62cc4babcb234387e11d697df70ee97
Successfully built wget
Installing collected packages: wget
Successfully installed wget-3.2


In [40]:
from M5_retail_data_prep import prepare_data
prepare_data()

INFO:p-27337:t-8601880640:M5_retail_data_prep.py:prepare_data:Temporary folder already exists, assuming data already downloaded.


Successfully saved the prepared M5 data to m5_for_state_level_forecasting.csv.gz.


### Read in the data

We parse the CSV into a pandas dataframe and ensure the timestamp column is a UTC datetime and drop two unnecessary columns.

In [24]:
data_path = "m5_for_state_level_forecasting.csv.gz"

data = pd.read_csv(data_path, parse_dates=["date"]).drop(columns=["d", "weekday"])
data.head()

Unnamed: 0,state_id,sales,date,wm_yr_wk,wday,month,year,snap_CA,snap_TX,snap_WI,event_name_1,event_type_1,event_name_2,event_type_2,wm_yr_wk_sin,wday_sin,month_sin,enc_state_id_mean,enc_state_id_std
0,CA,14195,2011-01-29,0,0,0,2011,0,0,0,noevent,noevent,noevent,noevent,0.0,0.0,0.0,14989.831155,3257.223505
1,TX,9438,2011-01-29,0,0,0,2011,0,0,0,noevent,noevent,noevent,noevent,0.0,0.0,0.0,9879.250392,1964.928938
2,WI,8998,2011-01-29,0,0,0,2011,0,0,0,noevent,noevent,noevent,noevent,0.0,0.0,0.0,9472.48092,2563.314535
3,CA,13805,2011-01-30,0,1,0,2011,0,0,0,noevent,noevent,noevent,noevent,0.0,0.781831,0.0,14989.831155,3257.223505
4,TX,9630,2011-01-30,0,1,0,2011,0,0,0,noevent,noevent,noevent,noevent,0.0,0.781831,0.0,9879.250392,1964.928938


Next, we must clean up the columns in our data and declare the names of the timestamp column, the target column to be predicted as well as the categorical column used to aggregate the data.

In [25]:
cols = list(data.columns)
[cols.remove(c) for c in ["date", "sales", "state_id"]]
cols

column_specifiers = {
    "timestamp_column": "date",
    "id_columns": ["state_id"],
    "target_columns": ["sales"],
    "control_columns": cols,
    "static_categorical_columns": ["state_id"],
    "categorical_columns": [
        "event_name_1",
        "event_type_1",
        "event_name_2",
        "event_type_2",
    ],
}

### Train the Preprocessor

The preprocessor is trained on the training portion of the input data to learn the scaling factors. The scaling will be applied when we use the preprocess method of the time series preprocessor.

In [26]:
tsp = TimeSeriesPreprocessor(
    **column_specifiers,
    context_length=context_length,
    prediction_length=forecast_length,
    scaling=True,
    encode_categorical=True,
    scaler_type="standard",
)

df_train = select_by_timestamp(
    data, timestamp_column=column_specifiers["timestamp_column"], end_timestamp="2016-05-23"
)

trained_tsp = tsp.train(df_train)

## Finetune the model

Now we will focus on fine-tuning the pretrained model. We use the same data splits we defined above, but now include extra columns during the fine-tuning process.

### Preparing the data for fine-tuning

We split the data into training, validation, and test sets. The training set is used to train the model, while the test set is used to evaluate its performance.

In [27]:
split_params = {"train": [0, 0.5], "valid": [0.5, 0.75], "test": [0.75, 1.0]}

train_data, valid_data, test_data = prepare_data_splits(
    data, id_columns=column_specifiers["id_columns"], split_config=split_params, context_length=context_length
)

Here we will construct the torch dataset because we cant pass panda dataframes using our torch dataset class specifically designed for forecasting usecases. 

In [28]:
frequency_token = tsp.get_frequency_token(tsp.freq)

dataset_params = column_specifiers.copy()
dataset_params["frequency_token"] = frequency_token
dataset_params["context_length"] = context_length
dataset_params["prediction_length"] = forecast_length


train_dataset = ForecastDFDataset(tsp.preprocess(train_data), **dataset_params)
valid_dataset = ForecastDFDataset(tsp.preprocess(valid_data), **dataset_params)
test_dataset = ForecastDFDataset(tsp.preprocess(test_data), **dataset_params)

Now let's take a smaller sample from the torch datasets produced above.

In [None]:
# 20% training data (few-shot finetuning)
n_train_all = len(train_dataset)
train_index = np.random.permutation(n_train_all)[: int(0.2 * n_train_all)]
train_dataset = Subset(train_dataset, train_index)

# 25% validation data
n_valid_all = len(valid_dataset)
valid_index = np.random.permutation(n_valid_all)[: int(0.25 * n_valid_all)]
valid_dataset = Subset(valid_dataset, valid_index)

n_train_all, len(train_dataset), n_valid_all, len(valid_dataset)

(2601, 520, 1395, 348)

### Load the model for fine-tuning

We must first load the TTM model available on HuggingFace using the model and revision set above. We have three target channels and several exogenous channels in this example and set configuration appropriately take this into accounts. Note that we also enable channel mixing in the decoder and forecast channel mising. This allows the decoder to be tuned to capture interactions between the channels as well as adjust the forecasts based on interactions with the exogenous.

In [30]:
finetune_forecast_model = TinyTimeMixerForPrediction.from_pretrained(
    TTM_MODEL_PATH,
    revision=REVISION,
    context_length=context_length,
    prediction_filter_length=forecast_length,
    num_input_channels=tsp.num_input_channels,
    decoder_mode="mix_channel",  # exog:  set to mix_channel for mixing channels in history
    prediction_channel_indices=tsp.prediction_channel_indices,
    exogenous_channel_indices=tsp.exogenous_channel_indices,
    fcm_context_length=1,  # exog: indicates lag length to use in the exog fusion. for Ex. if today sales can get affected by discount on +/- 2 days, mention 2
    fcm_use_mixer=True,  # exog: Try true (1st option) or false
    fcm_mix_layers=2,  # exog: Number of layers for exog mixing
    enable_forecast_channel_mixing=True,  # exog: set true for exog mixing
    fcm_prepend_past=True,  # exog: set true to include lag from history during exog infusion.
)

Some weights of TinyTimeMixerForPrediction were not initialized from the model checkpoint at ibm-granite/granite-timeseries-ttm-r2 and are newly initialized: ['decoder.decoder_block.mixers.0.channel_feature_mixer.gating_block.attn_layer.bias', 'decoder.decoder_block.mixers.0.channel_feature_mixer.gating_block.attn_layer.weight', 'decoder.decoder_block.mixers.0.channel_feature_mixer.mlp.fc1.bias', 'decoder.decoder_block.mixers.0.channel_feature_mixer.mlp.fc1.weight', 'decoder.decoder_block.mixers.0.channel_feature_mixer.mlp.fc2.bias', 'decoder.decoder_block.mixers.0.channel_feature_mixer.mlp.fc2.weight', 'decoder.decoder_block.mixers.0.channel_feature_mixer.norm.norm.bias', 'decoder.decoder_block.mixers.0.channel_feature_mixer.norm.norm.weight', 'decoder.decoder_block.mixers.1.channel_feature_mixer.gating_block.attn_layer.bias', 'decoder.decoder_block.mixers.1.channel_feature_mixer.gating_block.attn_layer.weight', 'decoder.decoder_block.mixers.1.channel_feature_mixer.mlp.fc1.bias', 'dec

### Optional: Freeze the TTM Backbone

During fine-tuning we freeze the backbone and focus on tuning only the parameters in the decoder. This reduces the overall number of parameters being tuned and maintains what the encoder learned during pretraining. For this notebook, we will keep the TTM backbone unfrozen, but we've included the code for users who may want to experiment with freezing it.

In [31]:
freeze_backbone = False
if freeze_backbone:
    print(
        "Number of params before freezing backbone",
        count_parameters(finetune_forecast_model),
    )

    # Freeze the backbone of the model
    for param in finetune_forecast_model.backbone.parameters():
        param.requires_grad = False

    # Count params
    print(
        "Number of params after freezing the backbone",
        count_parameters(finetune_forecast_model),
    )

### Set up a Trainer for Fine-tuning

Configure a Trainer for use in fine-tuning and evaluating the model.

In [32]:
num_epochs = 50  # Ideally, we need more epochs (try offline preferably in a gpu for faster computation)
batch_size = 64

learning_rate, finetune_forecast_model = optimal_lr_finder(
    finetune_forecast_model,
    train_dataset,
    batch_size=batch_size,
    enable_prefix_tuning=True,
)
print("OPTIMAL SUGGESTED LEARNING RATE =", learning_rate)

INFO:p-27337:t-8601880640:lr_finder.py:optimal_lr_finder:LR Finder: Running learning rate (LR) finder algorithm. If the suggested LR is very low, we suggest setting the LR manually.
INFO:p-27337:t-8601880640:lr_finder.py:optimal_lr_finder:LR Finder: Using CPU.
INFO:p-27337:t-8601880640:lr_finder.py:optimal_lr_finder:LR Finder: Suggested learning rate = 0.0005214008287999684


OPTIMAL SUGGESTED LEARNING RATE = 0.0005214008287999684


### Train the Model

Here we train the model on the training data.

In [33]:
OUT_DIR = "ttm_finetuned_models/"

print(f"Using learning rate = {learning_rate}")
finetune_forecast_args = TrainingArguments(
    output_dir=os.path.join(OUT_DIR, "output"),
    overwrite_output_dir=True,
    learning_rate=learning_rate,
    num_train_epochs=num_epochs,
    do_eval=True,
    eval_strategy="epoch",
    per_device_train_batch_size=batch_size,
    per_device_eval_batch_size=2 * batch_size,
    dataloader_num_workers=4,
    report_to=None,
    save_strategy="epoch",
    logging_strategy="epoch",
    save_total_limit=1,
    logging_dir=os.path.join(OUT_DIR, "logs"),  # Make sure to specify a logging directory
    load_best_model_at_end=True,  # Load the best model when training ends
    metric_for_best_model="eval_loss",  # Metric to monitor for early stopping
    greater_is_better=False,  # For loss
    use_cpu=True,
)

# Create the early stopping callback
early_stopping_callback = EarlyStoppingCallback(
    early_stopping_patience=10,  # Number of epochs with no improvement after which to stop
    early_stopping_threshold=0.0,  # Minimum improvement required to consider as improvement
)
tracking_callback = TrackingCallback()

# Optimizer and scheduler
optimizer = AdamW(finetune_forecast_model.parameters(), lr=learning_rate)
scheduler = OneCycleLR(
    optimizer,
    learning_rate,
    epochs=num_epochs,
    steps_per_epoch=math.ceil(len(train_dataset) / (batch_size)),
)

finetune_forecast_trainer = Trainer(
    model=finetune_forecast_model,
    args=finetune_forecast_args,
    train_dataset=train_dataset,
    eval_dataset=valid_dataset,
    callbacks=[early_stopping_callback, tracking_callback],
    optimizers=(optimizer, scheduler),
)

# Fine tune
finetune_forecast_trainer.train()

finetune_forecast_trainer.evaluate(test_dataset)

Using learning rate = 0.0005214008287999684


INFO:p-93600:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93606:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93607:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93647:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.


Epoch,Training Loss,Validation Loss
1,0.659,0.579633
2,0.6119,0.522812
3,0.557,0.470904
4,0.5186,0.471321
5,0.5086,0.493303
6,0.4743,0.44996
7,0.4812,0.440951
8,0.4585,0.460977
9,0.4677,0.437332
10,0.4446,0.438951


INFO:p-93696:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93722:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93743:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93745:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93804:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93805:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93806:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-93807:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-94068:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-94069:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-94070:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-94071:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-94155:t-8601880640:config.py:<module>:PyTorch version 2.6

[TrackingCallback] Mean Epoch Time = 34.879443016919225 seconds, Total Train Time = 2605.0954711437225


INFO:p-2653:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-2654:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-2655:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.
INFO:p-2695:t-8601880640:config.py:<module>:PyTorch version 2.6.0 available.


{'eval_loss': 0.35255715250968933,
 'eval_runtime': 31.9955,
 'eval_samples_per_second': 43.694,
 'eval_steps_per_second': 0.344,
 'epoch': 44.0}

### Plot the Predictions vs. Actuals

Plot the predictions vs. actuals for some random samples of time intervals in test dataset.

In [None]:
plot_predictions(
    model=finetune_forecast_trainer.model,
    dset=test_dataset,
    plot_prefix="test_finetune",
    channel=0,
)