# Time Series Datasets Using GluonTS Library

**Approximate Learning time:** Up to 1 hour

--- 

In this notebook, we will explore a second method for accessing time series datasets. The GluonTS library (documentation) offers several functionalities for handling time series data, specifically for modeling purposes. The library also provides an interface to a wide variety of models. Although still under active development, the research community frequently uses its data-handling capabilities when building models.

We will use GluonTS for data handling in Module 6, so this notebook serves as a good introduction to interfacing with GluonTS, with the primary focus on exploring the datasets available within the library.

---

## Datasets through GluonTS

GluonTS provides the `gluonts.dataset` package ([documentation](https://ts.gluon.ai/stable/api/gluonts/gluonts.dataset.html)), which includes a subpackage `gluonts.dataset.repository` for downloading commonly available datasets. You can view the list of available datasets in the [source code](https://ts.gluon.ai/stable/_modules/gluonts/dataset/repository/datasets.html#get_download_path) or by accessing the `dataset_names` variable in the subpackage. Many of these datasets are similar to those we have explored in previous notebooks.

**Note:** The first download may take some time as the dataset is stored in `$HOME/.gluonts`. Subsequent accesses will check this directory before downloading again.

In [1]:
from gluonts.dataset.repository.datasets import get_dataset, dataset_names

print("Number of datasets available in GluonTS", len(dataset_names))
print("Available datasets: ", dataset_names)

Number of datasets available in GluonTS 62
Available datasets:  ['constant', 'exchange_rate', 'solar-energy', 'electricity', 'traffic', 'exchange_rate_nips', 'electricity_nips', 'traffic_nips', 'solar_nips', 'wiki2000_nips', 'wiki-rolling_nips', 'taxi_30min', 'kaggle_web_traffic_with_missing', 'kaggle_web_traffic_without_missing', 'kaggle_web_traffic_weekly', 'm1_yearly', 'm1_quarterly', 'm1_monthly', 'nn5_daily_with_missing', 'nn5_daily_without_missing', 'nn5_weekly', 'tourism_monthly', 'tourism_quarterly', 'tourism_yearly', 'cif_2016', 'london_smart_meters_without_missing', 'wind_farms_without_missing', 'car_parts_without_missing', 'dominick', 'fred_md', 'pedestrian_counts', 'hospital', 'covid_deaths', 'kdd_cup_2018_without_missing', 'weather', 'm3_monthly', 'm3_quarterly', 'm3_yearly', 'm3_other', 'm4_hourly', 'm4_daily', 'm4_weekly', 'm4_monthly', 'm4_quarterly', 'm4_yearly', 'm5', 'uber_tlc_daily', 'uber_tlc_hourly', 'airpassengers', 'australian_electricity_demand', 'electricity_h



--- 

## Traffic Dataset

The **Traffic dataset** is a multivariate dataset containing 48 months (2015-2016) of hourly data from the California Department of Transportation. It includes road occupancy rates (values between 0 and 1) measured by various sensors on San Francisco Bay area freeways. This dataset was first introduced in Lai et al. (2017). 


[(Lai et al. 2017) Modeling Long- and Short-Term Temporal Patterns with Deep Neural Networks](https://arxiv.org/abs/1703.07015)

--- 

GluonTS loads datasets using the function `get_datasets`([docs](https://ts.gluon.ai/stable/api/gluonts/gluonts.dataset.repository.html#gluonts.dataset.repository.get_dataset)). The loaded datasets are of type `TrainDatasets` ([docs](https://ts.gluon.ai/stable/api/gluonts/gluonts.dataset.common.html#gluonts.dataset.common.TrainDatasets)), which contains three main attributes: metadata, train, and test. 


In [2]:
dataset = get_dataset("traffic")
dataset.metadata

MetaData(freq='H', target=None, feat_static_cat=[CategoricalFeatureInfo(name='feat_static_cat_0', cardinality='862')], feat_static_real=[], feat_dynamic_real=[], feat_dynamic_cat=[], prediction_length=24)

**Metadata**:
GluonTS datasets include metadata that provides specific information about the time series, including:

- **Frequency of Data**: Indicates the time interval of the data points (e.g., hourly 'H', monthly 'M').
- **Prediction Length**: Specifies the length of the forecasting horizon.
- **Features**: A dataset can contain four types of features:
    - **feat_static_cat**: These are static categorical features. The ID of a time series is specified by this feature, and its cardinality indicates the number of time series. Other such features, representing different aspects of the time series, can also be present and are labeled as `feat_static_cat_x`.
    - **feat_static_real**: These are static real-valued features that remain constant for a particular time series.
    - **feat_dynamic_cat**: These categorical features are dynamic, meaning they change over time. However, their values at each time step $t$ are available before the target (or prediction) is observed, making them useful as features in modeling.
    - **feat_dynamic_real**: Similarly, these are real-valued features that vary over time.

- **target**: This is a list of the time series to be predicted. If none is specified, all unique categories in `feat_static_cat_0` are used as targets. This also influences how the training and testing split is handled, which we will cover in subsequent modules.

**Train & Test**:

There are various methods to split a time series dataset into training and testing subsets, which we will explore in detail starting in Module 3. 
In general, for time series forecasting tasks, the training dataset consists of values up to a certain time step. The test dataset includes all of those values plus additional time steps that need to be forecasted.

For example, GluonTS organizes its datasets with the minimum of the following attributes:
- **'target'**: Each time series is represented as a 1D numpy array.
- **'start'**: Contains information about the starting point and the frequency of the observations.
- **'item_id'**: The unique identifier for each time series.

The test dataset will have the same attributes as the training dataset but with more values corresponding to the additional time steps that need to be predicted.

In [3]:
next(iter(dataset.train))

{'target': array([0.0048, 0.0072, 0.004 , ..., 0.053 , 0.0533, 0.05  ], dtype=float32),
 'start': Period('2015-01-01 00:00', 'H'),
 'feat_static_cat': array([0], dtype=int32),
 'item_id': 0}

In [4]:
print(f"Number of observations in train: {next(iter(dataset.train))['target'].shape[0]}")
print(f"Number of observations in corresponding test: {next(iter(dataset.test))['target'].shape[0]}")

Number of observations in train: 14036
Number of observations in corresponding test: 14060


---

## Data Transformation with GluonTS

GluonTS provides a wide range of templated features to transform time series data. 
To take advantage of this, we need to define transformation templates and let GluonTS apply them to the datasets. 

In this section, we will define one such transformation.

**Transformation Steps:**

1. **Remove Unwanted Features**: In `dataset.train`, the feature 'feat_static_cat' is present. We will remove this feature.
2. **Ensure Data Format**: Verify that the data is in the form of a NumPy array.
3. **Add Time Features**: We will add time-related features that correspond to the time index, such as the month, week of the year, and others. These features are important at the point where predictions are made, and GluonTS easily handles this. Additionally, you can use `time_features_from_frequency_str`, which recommends the appropriate time features based on the frequency of the dataset.
4. **Add an Age Feature**: Although not commonly used in traditional models, the age feature becomes relevant in transformer-like architectures. We will learn more about these models in **Module 6**.
5. **Stack Time and Age Features**: Once the time and age features are created, we will stack them together for further processing.
6. **Rename Keys**: Finally, we will rename certain internal keys: change `time_feat` to `time_features` and `target` to `values` for clarity.

While most of the transformation functions are self-explanatory, some common arguments they take include: `target_field` denoting the time series, `output_field`, specifying the name of the field where the transformed dataset will be stored, and `start_field` indicating the starting datetime index.


Applying these transformations is straightforward: define them in a `Chain` and pass the dataset through the constructed transformation. Setting `is_train=True` ensures that the returned dataset has the same length as the target field already present in the dataset. If `is_train=False`, it will return `len(target) + prediction_length` values.


In [5]:
from gluonts.time_feature import time_features_from_frequency_str
from gluonts.dataset.field_names import FieldName # Offers a mapping from attributes to string names 
from gluonts.transform import (
    AddAgeFeature,
    AddTimeFeatures, 
    Chain,
    RemoveFields,
    RenameFields,
    AsNumpyArray,
    VstackFeatures,
)


print("Recommended time features: ", time_features_from_frequency_str(dataset.metadata.freq))

remove_field_names=[FieldName.FEAT_STATIC_REAL, FieldName.FEAT_DYNAMIC_REAL, FieldName.FEAT_STATIC_CAT]
transformation = Chain(
    [RemoveFields(field_names=remove_field_names)]
    + [
        AsNumpyArray(
            field=FieldName.TARGET,
            expected_ndim=1,
        ),
        AddTimeFeatures(
            start_field=FieldName.START,
            target_field=FieldName.TARGET,
            output_field=FieldName.FEAT_TIME,
            time_features=time_features_from_frequency_str(dataset.metadata.freq),
            pred_length=24,
        ),
        AddAgeFeature(
            target_field=FieldName.TARGET,
            output_field=FieldName.FEAT_AGE,
            pred_length=24,
            log_scale=True,
        ),
        VstackFeatures(
            output_field=FieldName.FEAT_TIME,
            input_fields=[FieldName.FEAT_TIME, FieldName.FEAT_AGE]
        ),
        RenameFields(
            mapping={
                FieldName.FEAT_TIME: "time_features",
                FieldName.TARGET: "values",
            }
        )
    ]
)

for batch in transformation.apply(dataset.train, is_train=True):
    print(batch)
    break

print("Shape of time features: ", batch['time_features'].shape)
print("Shape of values: ", batch['values'].shape)



Recommended time features:  [<function hour_of_day at 0x125f171a0>, <function day_of_week at 0x125f172e0>, <function day_of_month at 0x125f17420>, <function day_of_year at 0x125f17560>]
{'start': Period('2015-01-01 00:00', 'H'), 'item_id': 0, 'time_features': array([[-0.5       , -0.45652175, -0.41304347, ...,  0.23913044,
         0.2826087 ,  0.32608697],
       [ 0.        ,  0.        ,  0.        , ...,  0.5       ,
         0.5       ,  0.5       ],
       [-0.5       , -0.5       , -0.5       , ..., -0.3       ,
        -0.3       , -0.3       ],
       [-0.5       , -0.5       , -0.5       , ...,  0.1       ,
         0.1       ,  0.1       ],
       [ 0.30103   ,  0.47712126,  0.60206   , ...,  4.1472125 ,
         4.1472435 ,  4.1472745 ]], dtype=float32), 'values': array([0.0048, 0.0072, 0.004 , ..., 0.053 , 0.0533, 0.05  ], dtype=float32)}
Shape of time features:  (5, 14036)
Shape of values:  (14036,)


--- 
## Conclusion

We have learned how GluonTS handles time series datasets and explored its functionality for transforming them. This experience will be valuable when we build forecasting models using the Informer architecture in Module 6.

---
## Next Steps

Now that we are familiar with various time series datasets and tools to explore them, let's formally define the problem of forecasting in the next module. We will learn about dataset splitting required for modeling and various evaluation metrics to assess the modeling performance. 

---