# Time Series Datasets: GluonTS

In this notebook, we will look at a python library, GluonTS, that is a python library for time series modeling, with a focus on deep learning based models, based on PyTorch and MXNet.

We will use the library to develop models later in the series. However, this notebook is only to get ourselves familiar with the dataset loading capability of GluonTS. 

Library documentation: [https://ts.gluon.ai/](https://ts.gluon.ai/stable/index.html#)

Let's install the library! Run the following command in your shell - 

```bash
pip install "gluonts"
```

`gluonts` provides a package named `gluonts.dataset` [documentation](https://ts.gluon.ai/stable/api/gluonts/gluonts.dataset.html). We will use a subpackage of `gluonts.dataset` named `gluonts.dataset.repository` to download several commonly available datasets online. 

Which datasets are available in the package? We can look at the list of available datasets in the [source code] (https://ts.gluon.ai/stable/_modules/gluonts/dataset/repository/datasets.html#get_download_path) or `dataset_names` variable from the subpackage. Most of these datasets are similar to what we have been looking at in the previous notebooks. 

For example, if we have to download the Traffic [1] dataset, we can use the corresponding key. 

**Note:** It may take some time to download the dataset in `$HOME/.gluonts` for the first time. From the next time, it looks for the dataset in `$HOME/.gluonts` before downloading.


**GluonTS vs Loading your own data**

Note: GluonTS applies its preprocessing on the datasets as well as specifies some meta information about train-test splits. One can get this information by reading the specific loading functions used in the source code. For example, "traffic" data follows the preprocessing specified here in `generate_lstnet_dataset` ([link](https://github.com/jgasthaus/gluon-ts/blob/c47ac9a0e11439edb9bdaae80975fefd035ae595/src/gluonts/dataset/repository/_lstnet.py#L125)). From the source code, one can infer that the dataset will be loaded for `rolling_evaluations=7` as a result, we should expect 7 time series of incremetally larger horizons for evluating our models.


## Setup

In [2]:
import pathlib
import pandas as pd
import matplotlib.pyplot as plt

import utils_tfb # copied a specific function to read data
from gluonts.dataset.repository.datasets import get_dataset, dataset_names

In [3]:
print("Available datasets: ", dataset_names)

Available datasets:  ['constant', 'exchange_rate', 'solar-energy', 'electricity', 'traffic', 'exchange_rate_nips', 'electricity_nips', 'traffic_nips', 'solar_nips', 'wiki2000_nips', 'wiki-rolling_nips', 'taxi_30min', 'kaggle_web_traffic_with_missing', 'kaggle_web_traffic_without_missing', 'kaggle_web_traffic_weekly', 'm1_yearly', 'm1_quarterly', 'm1_monthly', 'nn5_daily_with_missing', 'nn5_daily_without_missing', 'nn5_weekly', 'tourism_monthly', 'tourism_quarterly', 'tourism_yearly', 'cif_2016', 'london_smart_meters_without_missing', 'wind_farms_without_missing', 'car_parts_without_missing', 'dominick', 'fred_md', 'pedestrian_counts', 'hospital', 'covid_deaths', 'kdd_cup_2018_without_missing', 'weather', 'm3_monthly', 'm3_quarterly', 'm3_yearly', 'm3_other', 'm4_hourly', 'm4_daily', 'm4_weekly', 'm4_monthly', 'm4_quarterly', 'm4_yearly', 'm5', 'uber_tlc_daily', 'uber_tlc_hourly', 'airpassengers', 'australian_electricity_demand', 'electricity_hourly', 'electricity_weekly', 'rideshare_wi

## Download & Load Traffic Dataset

**Traffic dataset**: A multivariate dataset containing a collection of 48 months (2015-2016) hourly data from the California Department of Transportation. The data contains the road occupancy rates (value between 0 and 1) measured by different senosrs on San Francisco Bay area freeways. This dataset was first used by Lai et al. (2017) for someone to look further into what mdeling techniques perform how on these datasets.

[[1] Lai et al. 2017 Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks](https://arxiv.org/abs/1703.07015)

In [4]:
dataset = get_dataset("traffic")

## Metadata

GluonTS Datasets have their metadata that contain information about the specifics of the time series, such as the 
- frequency of data (hourly 'H', monthly 'M', etc.)
- prediction length for the forecasting horizon,
- cardinality: number of time series in the dataset (1 for univariate dataset)
- Other exogenous variables, if present.

In [5]:
dataset.metadata

MetaData(freq='H', target=None, feat_static_cat=[CategoricalFeatureInfo(name='feat_static_cat_0', cardinality='862')], feat_static_real=[], feat_dynamic_real=[], feat_dynamic_cat=[], prediction_length=24)

In [6]:
freq = dataset.metadata.freq
prediction_length = dataset.metadata.prediction_length

print(f"Frequency of the dataset: {freq}")
print(f"Prediction horizon: {prediction_length}")

Frequency of the dataset: H
Prediction horizon: 24


## Training Dataset

GluonTS has methods to present time series and split them into subsets for training estimators / machine learning models. 
Let's look at how GluonTS presents these datasets.

In [62]:
n_dim = len(dataset.train)
n_steps_training = next(iter(dataset.train))['target'].shape[0]

print(f"Number of time series (dimensions) in the training dataset: {n_dim}")
print(f"Number of time steps in time series in the training dataset: {n_steps_training}")

print("An example of a time series")
next(iter(dataset.train))

Number of time series (dimensions) in the training dataset: 862
Number of time steps in time series in the training dataset: 14036
An example of a time series


{'target': array([0.0048, 0.0072, 0.004 , ..., 0.053 , 0.0533, 0.05  ], dtype=float32),
 'start': Period('2015-01-01 00:00', 'h'),
 'feat_static_cat': array([0], dtype=int32),
 'item_id': 0}

Here is how time series data is arranged in the dataset - 

- 'target': Each dimension (multivariate has multiple dimensions) is represented as a 1D numpy array
- 'start': It contains information about the starting point and the period of the observations
- 'item_id': It's the id of the feature or dimension.
- 'feat_static_cat': it contains the index of the feature or dimension in a numpy array.

## Test Dataset

Following Lai et al. (2017), the forecasting task is to make hourly predictions for a day. The dataset has been split into 7 test series to assess the estimator on seven test series, such that the input to the model is training set.

The arrangement of test dataset into an iterable will become more clear when we create our own custom dataset below. 

In [10]:
print(f"Number of time series in the test dataset: {len(dataset.test)}")

lens = set()
for x in iter(dataset.test):
    lens.add(x['target'].shape[0])

print("Unique time horizons in the test dataset:", lens)

print("An example of a time series")
next(iter(dataset.test))

Number of time series in the test dataset: 6034
Unique time horizons in the test dataset: {14084, 14180, 14204, 14156, 14060, 14132, 14108}
An example of a time series


{'target': array([0.0048, 0.0072, 0.004 , ..., 0.0467, 0.0412, 0.0386], dtype=float32),
 'start': Period('2015-01-01 00:00', 'H'),
 'feat_static_cat': array([0], dtype=int32),
 'item_id': 0}

## Train-Test split

Given a dataset, how is this train-test split done? We will learn more about splitting in the notebook on forecasting.



## Custom Datasets with GluonTS

We can also import our datasets in GLuonTS format.

We will take an example of a raw dataset with no clear definition of frequency.

Specfically, the dataset we downloaded earlier in `forecasting` folder contains `economics_97.csv` with no clear definition of the time index. Let's load that dataset in GluonTS.


In [11]:
TS_DATA_FOLDER = pathlib.Path("./forecasting").resolve()
dataset = TS_DATA_FOLDER / "economics_97.csv"
data = utils_tfb.read_data(str(dataset))

In [12]:
data

Unnamed: 0_level_0,channel_1
date,Unnamed: 1_level_1
1970-01-01 00:00:00.000000001,38.396118
1970-01-01 00:00:00.000000002,32.301302
1970-01-01 00:00:00.000000003,28.912815
1970-01-01 00:00:00.000000004,31.747465
1970-01-01 00:00:00.000000005,38.663177
...,...
1970-01-01 00:00:00.000000068,30.995002
1970-01-01 00:00:00.000000069,35.926088
1970-01-01 00:00:00.000000070,30.613136
1970-01-01 00:00:00.000000071,28.850088


**Note:** You can see that the date column has a frequency of `nanoseconds`, wereas we expect this dataset to be yearly measurement of metrics in economics. Let's clean that up and load that in GluonTS.

Just for an example, we need to assume the following -- 

1. Prediction Length: This is the forecasting horizon which we care about. Let's we care about predicting 4 years ahead.
2. Rolling windows: This determines how mnay evaluations we need to do to setup the right metric.

Rolling window suggests that for every evaluation of time series metrics, we look at the historical data determined by it, and then make predictions about the forecasting horizon determined by the prediction length. 


In [117]:
from gluonts.dataset.common import ListDataset

freq = '1Y'
start = pd.Period("01-01-1990", freq=freq)
prediction_length = 4
rolling_windows = 6

train_ds = ListDataset(
    [{
        "target": data['channel_1'].values[:-rolling_windows*prediction_length], "start": start
    }],
    freq=freq
)

test_ds = ListDataset(
    [
        {
            "target": data['channel_1'].values[:-(rolling_windows - x - 1) * prediction_length], "start": start
        } 
    for x in range(rolling_windows)],
    freq=freq
)

  ProcessDataEntry(to_offset(freq), one_dim_target, use_timestamp),


As we see that the test dataset is an iterable due to a rolling window. We will learn more about this in the next notebook on forecasting.

## Next Steps

Now, move onto the next notebook to formally understand Forecasting, basics and how to do time series splitting for model development.