# LLM-based Approaches for Forecasting: Chronos

**Approximate Learning Time**:Up to 2 hours

---

In this module, we will explore LLM-based approaches for time series forecasting. LLMs (Large Language Models) are trained on vast amounts of online data, and due to their sequential nature, they are inherently effective at predicting future steps—making them a promising candidate for forecasting tasks. As a result, there have been several attempts to leverage LLMs for time series forecasting.

Broadly, the community has proposed three types of approaches:

1. **Directly prompting pre-trained LLMs** (e.g., GPT-4, LLaMa): 
   This method involves using pre-trained LLMs for forecasting by simply prompting them. While we will discuss this approach here, the hands-on tutorial for it is left as an exercise in the folder LLMTime. The tutorial is based on the work **LLMTime** ([Gruver et al. (2023)](https://arxiv.org/pdf/2310.07820)), though it uses a different dataset and requires API keys. You are encouraged to follow that tutorial and apply it to the dataset we've been using throughout this tutorial.

2. **Training LLMs directly on massive time series datasets**:
   This approach involves using LLMs specifically trained for time series forecasting, as seen in models like **Chronos** ([Ansari et al. (2024)](https://arxiv.org/abs/2403.07815)), which we will cover in this notebook. These models are used in a way similar one would prompt pre-trained LLMs. We will compare results from **Chronos** with other models we have explored so far.

3. **Reprogramming LLMs like GPT-2**:
   The focus of the next notebook is on reprogramming existing LLMs for time series forecasting. We will discuss **TimeLLM**, a recently published approach by [Jin et al. (2023)](https://arxiv.org/abs/2310.01728) where the authors reprogram open source models like GPT-2 to function as time series forecasters. Due to the computational requirements of this approach—such as loading large models like GPT-2—we will only briefly touch on the implementation in the next notebook. 


---

## In-Context Learning

Large Language Models (LLMs) are built using transformer architectures. After being trained on a vast corpus of text to predict the next token in a sequence, these models are ready to perform inference based on the input provided to them.

We interact with LLMs by inputting text and receiving a generated response. For example, if we want to perform sentiment classification, we can prompt the model in different ways:

<ins>**Zero-shot learning**</ins>: The model completes the task without any prior examples or guidance.

```txt
Classify the following sentence as either negative or positive:
"I received a broken chair."
```

In this case, the LLM may not have seen this exact task during training, but it can still figure out how to perform it. This is why it's called “learning” – the LLM generalizes to a task it hasn't directly encountered before. Zero-shot learning is useful for tasks that the model may not have explicitly seen during training.

In <ins>**few-shot learning**</ins>, the prompt includes a few examples to guide the model. Here's an example of a 2-shot learning prompt:

```txt
You are an AI assistant tasked with classifying the sentiment of sentences.
You should respond with "Positive" or "Negative".

Here are some examples:
Sentence: "I had a bad experience."
Your response: Negative

Sentence: "I had a blast!"
Your response: Positive

Now, classify this sentence:
"I received a broken chair."
```

LLMs demonstrate a surprising ability to learn from context within the prompt. Although we don't have a concrete theory on how LLMs perform in-context learning, it’s speculated that they build a generalized world model during training, which allows them to adapt quickly to unseen tasks.

--- 

## Zero-shot Forecasting

In this tutorial, we'll explore how LLMs can be used for time series forecasting through prompting. Time series data can be converted into a tokenized form that LLMs expect as input. LLMs, trained on billions of words and capable of predicting the next token in a sequence, can use their predictive abilities for time series forecasting.

Recently, researchers have reframed time series forecasting as a token prediction task. This means that time series data is tokenized, fed into the LLM, and the model predicts future tokens, which are then decoded back into numerical values.

The detailed process includes:
1. **Scaling**: Time series values are scaled (e.g., using min-max scaling) to keep values within a suitable range.
   
2. **Tokenization**: The scaled values are tokenized into strings that LLMs can understand. This tokenization process is specific to the LLM being used.

3. **Prediction**: The LLM predicts the next tokens based on the input sequence. In this framework, **forecasting is reframed as a classification task**, where the LLM selects the most likely next tokens from a distribution.
   
4. **Decoding**: The predicted tokens are then decoded back into string.

5. **Inverse scaling**: Finally, the decoded values are converted back to their original scale through **inverse scaling**, restoring the values to their real-world range.


The diagram below illustrates this prompting flow:

<div style="text-align: center; padding: 20px;">
<img src="LLMTime/img/llmtime.png" style="max-width: 70%; clip-path: inset(2px); height: auto; border-radius: 15px; box-shadow: 0px 4px 8px rgba(0, 0, 0, 0.1);">
</div>

### LLMTime: Zero-shot Forecasting with GPT-4

In a recent study, LLMTime by Gruver et al. (2023) used GPT-4 and LLaMA for time series forecasting. These models, trained extensively on online data, were prompted to predict future values in time series data. The authors claim that LLMs prefer simpler explanations, which helps them generate accurate forecasts.

The LLMTime tutorial, included in the accompanying folder, uses weather data to predict average maximum temperature. However, you are free to experiment with any dataset of your choice.

### Chronos LLM for Zero-shot Forecasting

Chronos is an LLM specifically trained on time series data. A large collection of both public and synthetically created time series datasets were scaled and tokenized, similarly to the process described above. Chronos was trained on these quantized time series to predict future values.

Unlike GPT-4, Chronos is specfically designed for univariate time series forecasting and ignores time and frequency information, treating the time series purely as a sequence of values.

In **Table 2** of the Appendix B in Ansari et al. (2024), you’ll find the datasets Chronos was trained on. They are categorized into:
- **Pretraining-only datasets**: Used for model training.
- **In-domain evaluation datasets**: Used for partial training, and evaluated on their later steps.
- **Zero-shot evaluation datasets**: These datasets were not seen during training.

For this tutorial, we are focusing on the **Exchange Rate dataset**, which Chronos did not encounter during training, making it a zero-shot evaluation task.


**References**:

[(Gruver et al. 2023) Large Language Models Are Zero-Shot Time Series Forecasters](https://arxiv.org/pdf/2310.07820)

[(Jin et al. 2023) Time-LLM: Time Series Forecasting by Reprogramming Large Language Models](https://arxiv.org/abs/2310.01728)

[(Ansari et al. 2024) Chronos: Learning the Language of Time Series](https://arxiv.org/abs/2403.07815)

--- 

Let's load the log daily returns of exchange rates, and split the data into train, validation, and test subsets!


In [None]:
import pathlib
import numpy as np
import pandas as pd
import torch
from chronos import ChronosPipeline

## WARNING: To compare different models on the same horizon, keep this same across the notebooks
from termcolor import colored
import sys; sys.path.append("../")
import utils

FORECASTING_HORIZON = [4, 8, 12] # weeks 
MAX_FORECASTING_HORIZON = max(FORECASTING_HORIZON)

PREDICTION_LENGTH = MAX_FORECASTING_HORIZON

DIRECTORY_PATH_TO_SAVE_RESULTS = pathlib.Path('../results/DIY/').resolve()
MODEL_NAME = "Chronos"

RESULTS_DIRECTORY = DIRECTORY_PATH_TO_SAVE_RESULTS / MODEL_NAME
if RESULTS_DIRECTORY.exists():
    print(colored(f'Directory {str(RESULTS_DIRECTORY)} already exists.'
           '\nThis notebook will overwrite results in the same directory.'
           '\nYou can also create a new directory if you want to keep this directory untouched.'
           ' Just change the `MODEL_NAME` in this notebook.\n', "red" ))
else:
    RESULTS_DIRECTORY.mkdir(parents=True)

data, transformed_data = utils.load_tutotrial_data(dataset='exchange_rate', log_transform=True)
data = transformed_data

train_val_data = data.iloc[:-MAX_FORECASTING_HORIZON]
train_data, val_data = train_val_data.iloc[:-MAX_FORECASTING_HORIZON], train_val_data.iloc[-MAX_FORECASTING_HORIZON:]
test_data = data.iloc[-MAX_FORECASTING_HORIZON:]
print(f"Number of steps in training data: {len(train_data)}\nNumber of steps in validation data: {len(val_data)}\nNumber of steps in test data: {len(test_data)}")


%load_ext autoreload
%autoreload 2

--- 

## Forecast


Chronos is a pre-trained LLM specifically designed for time series forecasting. This means we only need to prompt the LLM to generate forecasts, making the process straightforward.

For this tutorial, we will use the smaller version of Chronos, which is based on the T5 model ([Wikipedia](https://en.wikipedia.org/wiki/T5_(language_model))). If you are running the code on CPUs, ensure that the device is set to `"cpu"`.

One important aspect to consider is how Chronos accepts the time series input. The following example is adapted from their [official documentation](https://github.com/amazon-science/chronos-forecasting?tab=readme-ov-file), and we will use it as a reference in our tutorial. Since Chronos performs probabilistic forecasting, it requires specifying the number of samples you want to generate. Keep in mind that this step may take around 2 minutes to complete.


In [None]:
pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-small",
    device_map="mps",  # use "cpu" for CPU inference and "mps" for Apple Silicon
    torch_dtype=torch.bfloat16,
)

# context must be either a 1D tensor, a list of 1D tensors,
# or a left-padded 2D tensor with batch as the first dimension
# forecast shape: [num_series, num_samples, prediction_length]
forecast = pipeline.predict(
    context=torch.tensor(train_val_data.transpose().values),
    prediction_length=MAX_FORECASTING_HORIZON,
    num_samples=100,
)

# the model outputs a distribution so we keep a point estimate here.
forecast = forecast.permute(1, 2, 0)
forecast_median = np.median(forecast, 0)

AUGMENTED_COL_NAMES = [f"{MODEL_NAME}_{col}_mean" for col in data.columns]
test_predictions_df = pd.DataFrame(forecast_median, columns=AUGMENTED_COL_NAMES, index=test_data.index)

# save them to the directory
test_predictions_df.to_csv(f"{str(RESULTS_DIRECTORY)}/predictions.csv", index=True)
print(test_predictions_df.shape)
test_predictions_df.head()

--- 

## Evaluate 

Let's compute the metrics by comparing the predictions with that of the target data. Note that we will have to rename the columns of the dataframe to match the expected column names by the function. 

In [None]:
# evalaute metrics
target_data = data[-MAX_FORECASTING_HORIZON:]
model_metrics, records = utils.get_mase_metrics(
    historical_data=train_val_data,
    test_predictions=test_predictions_df.rename(
            columns={x:x.split("_")[1] for x in test_predictions_df.columns
        }),
    target_data=target_data,
    forecasting_horizons=FORECASTING_HORIZON,
    columns=data.columns, 
    model_name=MODEL_NAME
)
records = pd.DataFrame(records)

records.to_csv(f"{str(RESULTS_DIRECTORY)}/metrics.csv", index=False)
records[['col', 'horizon', 'mase']].pivot(index=['horizon'], columns='col')

--- 

## Compare Models

In [None]:
utils.display_results(path=DIRECTORY_PATH_TO_SAVE_RESULTS, metric='mase')

--- 

## Plot Forecasts

In [None]:
fig, axs = utils.plot_forecasts(
    historical_data=train_val_data,
    forecast_directory_path=DIRECTORY_PATH_TO_SAVE_RESULTS,
    target_data=target_data,
    columns=data.columns,
    n_history_to_plot=10, 
    forecasting_horizon=MAX_FORECASTING_HORIZON,
    dpi=200,
    exclude_models=['LSTM'],
    plot_se=False
)

--- 

## Conclusion

In this tutorial, we explored two approaches for using LLMs in time series forecasting. First, we learned how to leverage general-purpose LLMs, such as GPT-4 and LLaMA, by framing forecasting as a token prediction task. Then, we applied Chronos, an LLM specifically pre-trained for time series data, to perform forecasting tasks.

---

## Exercises

- Check how the mean performs on MASE.

- Plot standard errors on the distribution.

- Follow the TimeLLM tutorial to perform forecasting using GPT-4 or LLaMA.

- Apply a normalization procedure (e.g., **min-max scaling**) to the data, ensuring that only the training data is used for fitting the scaler. Perform the modeling process on the normalized data and, after generating the final model's predictions, invert the normalization to return the output to its original scale. See `sklearn.preprocessing.MinMaxScaler` ([documentation](https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html))

- Additionally, perform the modeling on the **raw data**, without applying any transformation (such as converting it into log daily returns), to compare results directly with the untransformed dataset.


---

## Next Steps

Head to the last notebook in this module to learn how to reprogram LLMs like GPT-2 for time series forecasting.

---