# MANATEE(lm) : Market Analysis based on language model architectures

[![Python](https://img.shields.io/pypi/pyversions/tensorflow.svg)](https://badge.fury.io/py/tensorflow)
![Maintainer](https://img.shields.io/badge/maintainer-@louisbrulenaudet-blue)

This Colab project focuses on employing LLM to analyze time series data for forecasting purposes, based on the "Chronos: Learning the Language of Time Series" paper from the Amazon Web Services and Amazon Supply Chain Optimization Technologies.

From source :
Time series forecasting is an essential component of decision-making across various domains, including retail, energy, finance, healthcare, climate science, among others. Traditionally, forecasting has been dominated by statistical models such as ARIMA and ETS.

The emergence of large language models (LLMs) with zero-shot learning capabilities has ignited interest in developing “foundation models” for time series. In the context of LLMs, this interest has been pursued through two main avenues: directly prompting pretrained LLMs in natural language and fine-tuning LLMs for time series tasks

In this work, we take a step back and ask: what are the fundamental differences between a language model that predicts the next token, and a time series forecasting model that predicts the next values? Despite the apparent distinction — tokens from a finite dictionary versus values from an unbounded, usually continuous domain — both endeavors fundamentally aim to model the sequential structure of the data to predict future patterns. Shouldn't good language models “just work” on time series? This naive question prompts us to challenge the necessity of time-series-specific modifications, and answering it led us to develop Chronos, a language modeling framework minimally adapted for time series forecasting. Chronos tokenizes time series into discrete bins through simple scaling and quantization of real values. In this way, we can train off-the-shelf language models on this “language of time series,” with no changes to the model architecture. Remarkably, this straightforward approach proves to be effective and efficient, underscoring the potential for language model architectures to address a broad range of time series problems with minimal modifications.

[...]

## Citing this project

If you use this code in your research, please use the following BibTeX entry.

```BibTeX
@misc{louisbrulenaudet2023,
  author =       {Louis Brulé Naudet},
  title =        {A time series forecasting based on language model architectures showcase},
  year =         {2024}
}
```

## Feedback

If you have any feedback, please reach out at [louisbrulenaudet@icloud.com](mailto:louisbrulenaudet@icloud.com).

# Configuration

## Efficient Data Manipulation and Analysis
Using Polars for data science and time series forecasting offers several advantages due to its efficient and high-performance data manipulation capabilities. Here's a text highlighting some reasons why Polars is a beneficial choice for these tasks:
- Polars leverages parallelized operations, enabling concurrent execution of data transformations and computations across multiple CPU cores. This parallel processing capability significantly reduces computation time, making it ideal for time-sensitive tasks like time series forecasting ;
- with its intuitive and expressive API, Polars simplifies complex data manipulation tasks, allowing users to perform a wide range of operations with minimal code. From data cleaning and transformation to advanced analytics, Polars offers a rich set of functionalities to streamline the data science workflow ;
- Polars provides robust support for time series data manipulation and analysis, offering specialized functions for handling temporal data efficiently. Its ability to handle time series data seamlessly makes it a preferred choice for time series forecasting tasks ;
- Polars employs memory-efficient data structures and algorithms, optimizing memory usage without compromising performance. This ensures efficient utilization of system resources, making it suitable for working with large datasets even in memory-constrained environments.

## Deep Learning Framework
The torch library, a popular deep learning framework, is utilized for various machine learning tasks. Its powerful tensor computation capabilities enable the implementation of complex neural network models for tasks such as time series forecasting.

## Data Visualization
For interactive and insightful data visualization, the code utilizes plotly.express and plotly.graph_objects. These libraries offer a wide range of visualization options, allowing users to create interactive plots and charts for better data understanding.

## Financial Data Access
The alpaca package provides access to financial data sources such as stock and cryptocurrency historical data, latest quotes, and trading information. It offers convenient APIs for fetching data and executing trading orders, making it a valuable asset for financial data analysis and algorithmic trading.


In [1]:
!pip3 install alpaca-py polars plotly
!pip install git+https://github.com/amazon-science/chronos-forecasting.git

Collecting alpaca-py
  Downloading alpaca_py-0.19.0-py3-none-any.whl (110 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/110.9 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[91m╸[0m[90m━━━[0m [32m102.4/110.9 kB[0m [31m3.0 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m110.9/110.9 kB[0m [31m2.1 MB/s[0m eta [36m0:00:00[0m
Collecting sseclient-py<2.0.0,>=1.7.2 (from alpaca-py)
  Downloading sseclient_py-1.8.0-py2.py3-none-any.whl (8.8 kB)
Collecting websockets<12.0.0,>=11.0.3 (from alpaca-py)
  Downloading websockets-11.0.3-cp310-cp310-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_17_x86_64.manylinux2014_x86_64.whl (129 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m129.9/129.9 kB[0m [31m5.8 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sseclient-py, websockets, alpaca-py
Successfully installed alpaca-py-0.19.0 ssec

In [4]:
import json
import os

from datetime import datetime
from time import sleep

import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import polars as pl
import torch

from alpaca.data import (
    CryptoHistoricalDataClient,
    StockHistoricalDataClient,
    StockLatestQuoteRequest
)
from alpaca.data.requests import StockBarsRequest
from alpaca.data.timeframe import TimeFrame
from alpaca.trading.client import TradingClient
from alpaca.trading.enums import OrderSide, TimeInForce
from alpaca.trading.requests import GetAssetsRequest
from chronos import ChronosPipeline
from google.colab import userdata
from plotly.subplots import make_subplots

# notebook.force_gpu_runtime()

# Class definition

In [5]:
class Security:
    """
    A class representing a financial security for fetching and plotting historical data.

    Attributes
    ----------
    api_key : str
        The Alpaca API key for authentication.

    secret_key : str
        The Alpaca secret key for authentication.

    symbol : str
        The symbol of the security.

    stock_client : StockHistoricalDataClient
        The Alpaca client for fetching historical stock data.

    dataframe : pl.DataFrame
        DataFrame containing the fetched historical data.
    """
    def __init__(
        self, api_key: str,
        secret_key: str,
        symbol: str
    ) -> None:
        """
        Initialize the Security object with API keys and symbol.

        Parameters
        ----------
        api_key : str
            The Alpaca API key.

        secret_key : str
            The Alpaca secret key.

        symbol : str
            The symbol of the security.
        """
        self.api_key = api_key
        self.secret_key = secret_key
        self.symbol = symbol
        self.stock_client = StockHistoricalDataClient(
            api_key=self.api_key,
            secret_key=self.secret_key
        )
        self.dataframe = None


    def fetch(
        self,
        start: datetime,
        timeframe: TimeFrame=TimeFrame.Day
    ) -> pl.DataFrame:
        """
        Fetches historical data for the security.

        Parameters
        ----------
        start : datetime
            The start date for fetching historical data.

        timeframe : TimeFrame, optional
            The timeframe for fetching historical data. Default is Day.

        Returns
        -------
        pl.DataFrame
            A DataFrame containing the fetched historical data.
        """
        try:
            request_params = StockBarsRequest(
                symbol_or_symbols=self.symbol,
                timeframe=timeframe,
                start=start,
            )

            quotes = self.stock_client.get_stock_bars(request_params)
            self.dataframe = pl.from_dicts(quotes[self.symbol])

            return self.dataframe

        except Exception as e:
            print(f"Error fetching historical data: {e}")
            return None

# Data fetching

In [6]:
security = Security(
    api_key=userdata.get("alpaca_api_key"),
    secret_key=userdata.get("alpaca_api_secret"),
    symbol="AAPL"
)

dataframe = security.fetch(
    start=datetime(2020, 9, 1)
)

dataframe

symbol,timestamp,open,high,low,close,volume,trade_count,vwap
str,datetime[μs],f64,f64,f64,f64,f64,f64,f64
"""AAPL""",2020-09-01 04:00:00,132.76,134.8,130.53,134.18,1.62353762e8,1.494294e6,132.902394
"""AAPL""",2020-09-02 04:00:00,137.59,137.98,127.0,131.4,2.10023657e8,1.843494e6,131.705031
"""AAPL""",2020-09-03 04:00:00,126.91,128.84,120.5,120.88,2.74068039e8,2.360762e6,123.441146
"""AAPL""",2020-09-04 04:00:00,120.1,123.7,110.89,120.96,3.44528755e8,2.962331e6,118.038516
"""AAPL""",2020-09-08 04:00:00,114.16,118.99,112.68,112.82,2.45981952e8,2.00368e6,115.257751
"""AAPL""",2020-09-09 04:00:00,117.26,119.14,115.26,117.32,1.90263275e8,1.308845e6,117.4086
"""AAPL""",2020-09-10 04:00:00,120.36,120.5,112.5,113.42,1.92230707e8,1.444583e6,115.976572
"""AAPL""",2020-09-11 04:00:00,114.57,115.23,110.0,112.0,1.87577679e8,1.401259e6,112.057091
"""AAPL""",2020-09-14 04:00:00,114.72,115.93,112.8,115.355,1.50551407e8,1.013528e6,114.627706
"""AAPL""",2020-09-15 04:00:00,118.33,118.829,113.61,115.54,1.91579878e8,1.290607e6,116.415375


# Model loading

In [105]:
pipeline = ChronosPipeline.from_pretrained(
    "amazon/chronos-t5-large",
    device_map="auto",
    torch_dtype=torch.bfloat16,
)

# Forecasting

In [106]:
def make_forecast(
    context: torch.Tensor,
    prediction_length: int,
    pipeline
) -> torch.Tensor:
    """
    Generate a forecast using the specified context and prediction length.

    Parameters
    ----------
    context : torch.Tensor or list of torch.Tensor
        The context data used for generating the forecast.

    prediction_length : int
        The length of the forecast.

    pipeline : torch.Module
        The forecasting pipeline model.

    Returns
    -------
    torch.Tensor
        The forecast tensor.

    Examples
    --------
    >>> import torch
    >>> context = torch.tensor([1, 2, 3, 4, 5])
    >>> prediction_length = 20
    >>> pipeline = YourForecastingPipeline()
    >>> forecast = make_forecast(context, prediction_length, pipeline)
    """
    forecast = pipeline.predict(
        context,
        prediction_length
    )

    return forecast


forecast = make_forecast(
    context=torch.tensor(dataframe["close"]),
    prediction_length=20,
    pipeline=pipeline
)

forecast

tensor([[[175.6135, 176.8246, 178.0357, 176.8246, 176.8246, 176.8246, 176.8246,
          175.6135, 176.8246, 176.8246, 176.8246, 178.0357, 178.0357, 180.4580,
          180.4580, 181.6691, 180.4580, 180.4580, 181.6691, 180.4580],
         [173.1912, 171.9801, 169.5578, 170.7690, 171.9801, 170.7690, 168.3467,
          167.1356, 164.7133, 169.5578, 169.5578, 167.1356, 165.9244, 161.0799,
          153.8132, 155.0243, 153.8132, 159.8688, 161.0799, 162.2911],
         [171.9801, 173.1912, 174.4023, 174.4023, 173.1912, 174.4023, 169.5578,
          170.7690, 170.7690, 169.5578, 173.1912, 174.4023, 174.4023, 173.1912,
          174.4023, 175.6135, 174.4023, 174.4023, 173.1912, 171.9801],
         [174.4023, 173.1912, 174.4023, 174.4023, 176.8246, 176.8246, 178.0357,
          180.4580, 180.4580, 176.8246, 175.6135, 176.8246, 179.2469, 178.0357,
          179.2469, 180.4580, 178.0357, 178.0357, 178.0357, 176.8246],
         [174.4023, 176.8246, 171.9801, 175.6135, 176.8246, 178.0357, 175.61

In [107]:
def calculate_quantiles(
    forecast_data: np.ndarray,
    quantiles: list = [0.1, 0.5, 0.9],
    axis: int = 0
) -> tuple:
    """
    Calculate quantiles from forecast data.

    Parameters
    ----------
    forecast_data : numpy.ndarray
        Array-like object containing forecast data.

    quantiles : list of float, optional
        List of quantiles to compute. Default is [0.1, 0.5, 0.9].

    axis : int, optional
        Axis along which to compute quantiles. Default is 0.

    Returns
    -------
    tuple
        A tuple containing the calculated quantiles.

    Examples
    --------
    >>> import numpy as np
    >>> forecast_data = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
    >>> low, median, high = calculate_quantiles(forecast_data)
    """
    return np.quantile(forecast_data, quantiles, axis)


low, median, high = calculate_quantiles(
    forecast_data=forecast[0].numpy(),
    quantiles=[0.1, 0.5, 0.9],
    axis=0
)

low, median, high

(array([171.98007202, 170.64783936, 169.43671722, 168.34669495,
        169.19449158, 170.40561371, 168.22558289, 166.89334869,
        164.47109222, 168.58892975, 165.80332642, 164.59220581,
        167.01446075, 160.83771515, 155.99320221, 156.11431427,
        158.17323303, 162.04883728, 163.25997162, 162.16994934]),
 array([174.40234375, 173.19120789, 173.79677582, 174.40234375,
        175.00791168, 175.61347961, 175.61347961, 175.00791168,
        175.61347961, 175.00791168, 173.79677582, 175.00791168,
        176.21903992, 175.00791168, 173.79677582, 173.79677582,
        174.40234375, 174.40234375, 173.19120789, 174.40234375]),
 array([175.61347961, 176.94571228, 176.94571228, 176.94571228,
        176.94571228, 180.45797729, 180.45797729, 182.88023376,
        182.15354919, 181.91132355, 182.88023376, 183.00134583,
        184.33358002, 183.24357147, 182.15354919, 186.75583649,
        186.75583649, 186.75583649, 187.24028625, 186.15027771]))

# Data post-processing

In [109]:
def calculate_forecast_index(
    dataframe:pl.DataFrame,
    offset_days:int,
    interval:str="1d"
) -> pl.DataFrame:
    """
    Calculate the forecast index based on the last timestamp in the dataframe and an offset.

    Parameters
    ----------
    dataframe : pl.DataFrame
        The input Polars DataFrame containing the timestamps.

    offset_days : int
        The number of days by which to offset the last timestamp.

    interval : str, optional
        The interval for the date range, e.g., "1d" for daily, "1w" for weekly. Default is "1d".

    Returns
    -------
    pl.DataFrame
        A Polars DataFrame containing the forecast index dates.

    Examples
    --------
    >>> import polars as pl
    >>> dataframe = pl.DataFrame({
    ...     'timestamp': ['2022-01-01', '2022-01-02', '2022-01-03']
    ... })
    >>> forecast_index = calculate_forecast_index(dataframe, offset_days=20)
    >>> print(forecast_index)
    shape: (21, 1)
    ╭──────────────────╮
    │ date             │
    │ ---              │
    │ datetime         │
    ╞══════════════════╡
    │ 2022-01-22 00:00 │
    │ 2022-01-23 00:00 │
    │ 2022-01-24 00:00 │
    │ 2022-01-25 00:00 │
    │ 2022-01-26 00:00 │
    │ 2022-01-27 00:00 │
    │ 2022-01-28 00:00 │
    │ 2022-01-29 00:00 │
    │ 2022-01-30 00:00 │
    │ 2022-01-31 00:00 │
    │ 2022-02-01 00:00 │
    │ 2022-02-02 00:00 │
    │ 2022-02-03 00:00 │
    │ 2022-02-04 00:00 │
    │ 2022-02-05 00:00 │
    │ 2022-02-06 00:00 │
    │ 2022-02-07 00:00 │
    │ 2022-02-08 00:00 │
    │ 2022-02-09 00:00 │
    │ 2022-02-10 00:00 │
    │ 2022-02-11 00:00 │
    │ 2022-02-12 00:00 │
    │ 2022-02-13 00:00 │
    ╰──────────────────╯
    """
    # Get the last timestamp in the dataframe
    last_timestamp = dataframe.select(
        pl.last(
            "timestamp"
        )
    )

    # Offset the timestamp by the specified number of days
    offset_timestamp = last_timestamp.with_columns(
        offset=pl.col("timestamp").dt.offset_by(f"{offset_days}d")
    )

    # Create a date range from the original timestamp to the offset timestamp
    forecast_index_dates = pl.date_range(
        start=offset_timestamp["timestamp"],
        end=offset_timestamp["offset"],
        interval=interval,
        eager=True
    ).alias("date")

    return forecast_index_dates


forecast_index = calculate_forecast_index(
    dataframe=dataframe,
    offset_days=20
)

forecast_index

date
datetime[μs]
2024-03-15 04:00:00
2024-03-16 04:00:00
2024-03-17 04:00:00
2024-03-18 04:00:00
2024-03-19 04:00:00
2024-03-20 04:00:00
2024-03-21 04:00:00
2024-03-22 04:00:00
2024-03-23 04:00:00
2024-03-24 04:00:00


# Plotting

In [110]:
def create_forecast_plot(
    dataframe: pl.DataFrame,
    date: list,
    median: np.ndarray,
    high: np.ndarray,
    low: np.ndarray
) -> go.Figure:
    """
    Create a Plotly figure for time series forecasting with historical data, median forecast, and prediction interval.

    Parameters
    ----------
    dataframe : pl.DataFrame
        The input Polars DataFrame containing the historical data.

    date : list
        List of datetime objects representing the forecast dates.

    median : array-like
        Array-like object containing the median forecast values.

    high : array-like
        Array-like object containing the upper bound of the prediction interval.

    low : array-like
        Array-like object containing the lower bound of the prediction interval.

    Returns
    -------
    go.Figure
        A Plotly figure object.

    Examples
    --------
    >>> import plotly.graph_objects as go
    >>> import numpy as np
    >>> dataframe = pl.DataFrame({"timestamp": ["2022-01-01", "2022-01-02", "2022-01-03"], "close": [100, 110, 120]})
    >>> date = ["2022-01-04", "2022-01-05", "2022-01-06"]
    >>> median = [105, 115, 125]
    >>> high = [110, 120, 130]
    >>> low = [100, 110, 120]
    >>> fig = create_forecast_plot(dataframe, date, median, high, low)
    >>> fig.show()
    """
    fig = go.Figure()

    # Add historical data to the plot
    fig.add_trace(
        go.Scatter(
            x=dataframe["timestamp"],
            y=dataframe["close"],
            mode="lines",
            name="Historical Data",
            line=dict(color="royalblue")
        )
    )

    # Add median forecast data to the plot
    fig.add_trace(
        go.Scatter(
            x=date,
            y=median,
            mode="lines",
            name="Median Forecast",
            line=dict(color="tomato")
        )
    )

    # Add prediction interval (fill area between lines) to the plot
    fig.add_trace(
        go.Scatter(
            x=np.concatenate([date, date[::-1]]),
            y=np.concatenate([high, low[::-1]]),
            fill="toself",
            fillcolor="tomato",
            line=dict(color="rgba(255,255,255,0)"),
            name="80% Prediction Interval",
            showlegend=True,
            opacity=0.3
        )
    )

    # Update the layout for better visualization
    fig.update_layout(
        title="Time series forecasting with 80% Prediction Interval",
        xaxis_title="Time",
        yaxis_title="Value",
        legend=dict(
            yanchor="top",
            y=0.99,
            xanchor="left",
            x=0.01
        ),
        template="plotly_dark"
    )

    return fig


fig = create_forecast_plot(
    dataframe=dataframe,
    date=date,
    median=median,
    high=high,
    low=low
)

fig.show()