# Time Series Datasets Using GluonTS Library

In this notebook, we will explore the GluonTS library, a Python library designed for time series modeling with a focus on deep learning models, based on PyTorch and MXNet.

While we will develop models using this library later in the series, this notebook aims to familiarize ourselves with GluonTS's dataset loading capabilities.

Library documentation: [GluonTS Documentation](https://ts.gluon.ai/stable/index.html#)

## Installation

Install the library by running the following command in your shell:

```bash
pip install gluonts
```

## Dataset Loading with GluonTS

GluonTS provides the `gluonts.dataset` package ([documentation](https://ts.gluon.ai/stable/api/gluonts/gluonts.dataset.html)), which includes a subpackage `gluonts.dataset.repository` for downloading commonly available datasets.

### Available Datasets

You can view the list of available datasets in the [source code](https://ts.gluon.ai/stable/_modules/gluonts/dataset/repository/datasets.html#get_download_path) or by accessing the `dataset_names` variable in the subpackage. Many of these datasets are similar to those we have explored in previous notebooks.

For example, to download the Traffic dataset, use the corresponding key - 'traffic'.

**Note:** The first download may take some time as the dataset is stored in `$HOME/.gluonts`. Subsequent accesses will check this directory before downloading again.

### GluonTS vs. Loading Your Own Data

GluonTS applies preprocessing to the datasets and specifies meta information about train-test splits. You can review this preprocessing by examining the specific loading functions in the source code. For instance, the "traffic" dataset follows the preprocessing defined in `generate_lstnet_dataset` ([source code link](https://github.com/jgasthaus/gluon-ts/blob/c47ac9a0e11439edb9bdaae80975fefd035ae595/src/gluonts/dataset/repository/_lstnet.py#L125)). This function indicates that the dataset will be loaded for `rolling_evaluations=7`, resulting in 7 time series with incrementally larger horizons for model evaluation.

## Setup

In [2]:
import pathlib
import pandas as pd
import matplotlib.pyplot as plt

import utils_tfb # copied a specific function to read data
from gluonts.dataset.repository.datasets import get_dataset, dataset_names

In [3]:
print("Available datasets: ", dataset_names)

Available datasets:  ['constant', 'exchange_rate', 'solar-energy', 'electricity', 'traffic', 'exchange_rate_nips', 'electricity_nips', 'traffic_nips', 'solar_nips', 'wiki2000_nips', 'wiki-rolling_nips', 'taxi_30min', 'kaggle_web_traffic_with_missing', 'kaggle_web_traffic_without_missing', 'kaggle_web_traffic_weekly', 'm1_yearly', 'm1_quarterly', 'm1_monthly', 'nn5_daily_with_missing', 'nn5_daily_without_missing', 'nn5_weekly', 'tourism_monthly', 'tourism_quarterly', 'tourism_yearly', 'cif_2016', 'london_smart_meters_without_missing', 'wind_farms_without_missing', 'car_parts_without_missing', 'dominick', 'fred_md', 'pedestrian_counts', 'hospital', 'covid_deaths', 'kdd_cup_2018_without_missing', 'weather', 'm3_monthly', 'm3_quarterly', 'm3_yearly', 'm3_other', 'm4_hourly', 'm4_daily', 'm4_weekly', 'm4_monthly', 'm4_quarterly', 'm4_yearly', 'm5', 'uber_tlc_daily', 'uber_tlc_hourly', 'airpassengers', 'australian_electricity_demand', 'electricity_hourly', 'electricity_weekly', 'rideshare_wi

## Download & Load Traffic Dataset

The **Traffic dataset** is a multivariate dataset containing 48 months (2015-2016) of hourly data from the California Department of Transportation. It includes road occupancy rates (values between 0 and 1) measured by various sensors on San Francisco Bay area freeways. This dataset was first utilized by Lai et al. (2017) to evaluate the performance of different modeling techniques.

[[1] Lai et al. 2017, Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks](https://arxiv.org/abs/1703.07015)

In [4]:
dataset = get_dataset("traffic")

**Metadata**:
GluonTS datasets come with metadata that provide specific information about the time series, including:

- **Frequency of Data:** Indicates the time interval of the data points (e.g., hourly 'H', monthly 'M').
- **Prediction Length:** Specifies the forecasting horizon.
- **Cardinality:** The number of time series in the dataset (1 for univariate datasets).
- **Exogenous Variables:** Details about any additional variables present in the dataset.


In [5]:
dataset.metadata

MetaData(freq='H', target=None, feat_static_cat=[CategoricalFeatureInfo(name='feat_static_cat_0', cardinality='862')], feat_static_real=[], feat_dynamic_real=[], feat_dynamic_cat=[], prediction_length=24)

In [6]:
freq = dataset.metadata.freq
prediction_length = dataset.metadata.prediction_length

print(f"Frequency of the dataset: {freq}")
print(f"Prediction horizon: {prediction_length}")

Frequency of the dataset: H
Prediction horizon: 24


## Training Dataset

GluonTS provides methods to organize and split time series datasets into subsets for training machine learning models. Let's explore how GluonTS handles these datasets.

The time series data is organized in a GluonTS dataset as follows

- **'target':** Each dimension (multivariate has multiple dimensions) is represented as a 1D numpy array.
- **'start':** Contains information about the starting point and the period of the observations.
- **'item_id':** The identifier of the feature or dimension.
- **'feat_static_cat':** Contains the index of the feature or dimension in a numpy array.

In [62]:
n_dim = len(dataset.train)
n_steps_training = next(iter(dataset.train))['target'].shape[0]

print(f"Number of time series (dimensions) in the training dataset: {n_dim}")
print(f"Number of time steps in time series in the training dataset: {n_steps_training}")

print("An example of a time series")
next(iter(dataset.train))

Number of time series (dimensions) in the training dataset: 862
Number of time steps in time series in the training dataset: 14036
An example of a time series


{'target': array([0.0048, 0.0072, 0.004 , ..., 0.053 , 0.0533, 0.05  ], dtype=float32),
 'start': Period('2015-01-01 00:00', 'h'),
 'feat_static_cat': array([0], dtype=int32),
 'item_id': 0}

## Test Dataset

Following Lai et al. (2017), the forecasting task involves making hourly predictions for a day. The dataset is split into 7 test series to evaluate the estimator on these series, using the training set as input to the model.

The structure of the test dataset will become clearer when we create our own custom dataset below.

In [10]:
print(f"Number of time series in the test dataset: {len(dataset.test)}")

lens = set()
for x in iter(dataset.test):
    lens.add(x['target'].shape[0])

print("Unique time horizons in the test dataset:", lens)

print("An example of a time series")
next(iter(dataset.test))

Number of time series in the test dataset: 6034
Unique time horizons in the test dataset: {14084, 14180, 14204, 14156, 14060, 14132, 14108}
An example of a time series


{'target': array([0.0048, 0.0072, 0.004 , ..., 0.0467, 0.0412, 0.0386], dtype=float32),
 'start': Period('2015-01-01 00:00', 'H'),
 'feat_static_cat': array([0], dtype=int32),
 'item_id': 0}

## Train-Test Split

How is a dataset split into training and testing sets? We will look at the train-test splitting process in the forecasting notebook.

## Custom Datasets with GluonTS

We can also import our own datasets into the GluonTS format. Let's take an example of a raw dataset without a clear definition of frequency.

Specifically, we will use the `economics_97.csv` dataset we downloaded earlier in the `forecasting` folder, which lacks a defined time index. Let's load this dataset into GluonTS.


In [11]:
TS_DATA_FOLDER = pathlib.Path("./forecasting").resolve()
dataset = TS_DATA_FOLDER / "economics_97.csv"
data = utils_tfb.read_data(str(dataset))

In [12]:
data

Unnamed: 0_level_0,channel_1
date,Unnamed: 1_level_1
1970-01-01 00:00:00.000000001,38.396118
1970-01-01 00:00:00.000000002,32.301302
1970-01-01 00:00:00.000000003,28.912815
1970-01-01 00:00:00.000000004,31.747465
1970-01-01 00:00:00.000000005,38.663177
...,...
1970-01-01 00:00:00.000000068,30.995002
1970-01-01 00:00:00.000000069,35.926088
1970-01-01 00:00:00.000000070,30.613136
1970-01-01 00:00:00.000000071,28.850088


**Note:** The date column has a frequency of `nanoseconds`, while we expect yearly measurements for this dataset. Let's clean this up and load it into GluonTS.

For this example, we'll assume the following:

1. **Prediction Length:** The forecasting horizon we care about, e.g., predicting 4 years ahead.
2. **Rolling Windows:** Determines how many evaluations we need to perform to set up the correct evaluation metric.

The rolling window approach suggests that for each evaluation of time series metrics, we use the historical data within the window to make predictions for the specified forecasting horizon.

In [117]:
from gluonts.dataset.common import ListDataset

freq = '1Y'
start = pd.Period("01-01-1990", freq=freq)
prediction_length = 4
rolling_windows = 6

train_ds = ListDataset(
    [{
        "target": data['channel_1'].values[:-rolling_windows*prediction_length], "start": start
    }],
    freq=freq
)

test_ds = ListDataset(
    [
        {
            "target": data['channel_1'].values[:-(rolling_windows - x - 1) * prediction_length], "start": start
        } 
    for x in range(rolling_windows)],
    freq=freq
)

  ProcessDataEntry(to_offset(freq), one_dim_target, use_timestamp),


**Note:** The test dataset is organized as an iterable due to the rolling window approach. We will explore this concept in more detail in the next notebook on forecasting.

## Next Steps

Proceed to the next notebook (`02_forecasting_basics.ipynb`) to formally understand forecasting basics and how to perform time series splitting for model development.