# TFT pytorch model with baseline-v-0-raw.parquet
- The notebook is creating a model with sample data/code in Pytorch webpage
https://pytorch-forecasting.readthedocs.io/en/stable/_modules/pytorch_forecasting/models/temporal_fusion_transformer.html

Goal of the notebook
  - Create sample TFT model from data loading ~ prediction. 
  - Based on this notebook, the simple dataset in our study is applied to TFT model in the next notebook(link to be updated) 

- Edited by Rumi Nakagawa
- Spring 2023 Capstone


## Other references:
TFT with pytorch

1. https://pytorch-forecasting.readthedocs.io/en/stable/tutorials/stallion.html

2. https://pytorch-forecasting.readthedocs.io/en/stable/_modules/pytorch_forecasting/models/temporal_fusion_transformer.html

3. https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Forecasting/TFT#getting-the-data

4. https://towardsdatascience.com/temporal-fusion-transformer-time-series-forecasting-with-deep-learning-complete-tutorial-d32c1e51cd91

TFT with Tensorflow
1. https://github.com/greatwhiz/tft_tf2

2. https://towardsdatascience.com/temporal-fusion-transformer-googles-model-for-interpretable-time-series-forecasting-5aa17beb621


# 0. Preparation

## Mount google drive
- Make sure that available access is the user's own drive(no access across files in shared folder)

In [1]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [2]:
%cd drive/MyDrive/

/content/drive/MyDrive


In [3]:
if 'google.colab' in str(get_ipython()):
  IN_COLLAB = True
else:
  IN_COLLAB = False

#TODO: CHANGE THIS BASED ON YOUR OWN LOCAL SETTINGS
# MY_HOME_ABS_PATH = "/content/drive/MyDrive/W210/co2-flux-hourly-gpp-modeling"
MY_HOME_ABS_PATH =  "/content/drive/MyDrive"

In [4]:
# This is already done above
# if IN_COLLAB:
#   from google.colab import drive
#   drive.mount('/content/drive/')

## Import libraries

In [232]:
import os
import warnings

warnings.filterwarnings("ignore")  # avoid printing out absolute paths
print(os.getcwd())
# os.chdir("../../..")

/content/drive/MyDrive


#### (pip install)

In [233]:
!pip install pytorch_lightning

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
[31mERROR: Operation cancelled by user[0m[31m
[0m

In [234]:
!pip install pytorch_forecasting

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [235]:
! pip install statsmodels --upgrade

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [236]:
# This is must in order to avoid error 
!pip install pytorch_lightning==1.9.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


numba may potentially 

In [237]:
!pip install numba

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [238]:
!pip install azure.storage.blob 

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [239]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


### pytorch libraries

In [240]:
import copy
from pathlib import Path
import warnings

import numpy as np
import pandas as pd
import pytorch_lightning as pl
from pytorch_lightning.callbacks import EarlyStopping, LearningRateMonitor
from pytorch_lightning.loggers import TensorBoardLogger
import torch

from pytorch_forecasting import Baseline, TemporalFusionTransformer, TimeSeriesDataSet
from pytorch_forecasting.data import GroupNormalizer
from pytorch_forecasting.metrics import SMAPE, PoissonLoss, QuantileLoss
from pytorch_forecasting.models.temporal_fusion_transformer.tuning import optimize_hyperparameters
from pytorch_forecasting import BaseModel, MAE

# Load data from Azure blob

In [39]:
MY_HOME_ABS_PATH

'/content/drive/MyDrive'

In [138]:
import sys
sys.path.append('/content/drive/MyDrive/.cred')
sys.path.append('/content/drive/MyDrive/tools')
sys.path.append('/content/drive/MyDrive/tools/CloudIO')

In [241]:
import os
os.environ["PYARROW_IGNORE_TIMEZONE"] = "1"
import math
import json

import pyspark.pandas as pd
from calendar import monthrange
from datetime import datetime
from io import BytesIO

import matplotlib.pyplot as plt
import plotly.express as px
import plotly.graph_objects as go

# Load locale custome modules
import sys
if IN_COLLAB:
  os.chdir(MY_HOME_ABS_PATH)
  # sys.path.insert(0,os.path.abspath("./code/src/tools"))
  sys.path.insert(0,os.path.abspath("tools"))
else:
  sys.path.append(os.path.abspath("tools"))

from CloudIO.AzStorageClient import AzStorageClient
from data_pipeline_lib import *

pd.set_option('display.max_rows', 500)
pd.set_option('display.max_columns', None)
pd.set_option('display.float_format', lambda x: '%.5f' % x)
  

In [243]:
root_dir =  MY_HOME_ABS_PATH
tmp_dir =  root_dir + os.sep + '.tmp'
raw_data_dir = tmp_dir
data_dir = root_dir + os.sep + 'data'
cred_dir = root_dir + os.sep + '.cred'
az_cred_file = cred_dir + os.sep + 'azblobcred.json'

# if IN_COLLAB:
#   raw_data_dir = "/content/drive/MyDrive/CO2_flux_gpp_modeling/DS_capstone_23Spring_CO2/Data/half_hourly_data"

In [244]:
# Define target dataset
container = "baseline-data"
ext = "parquet"
# ver = "1"
# blob_name_base = f"baseline_all_v_{ver}"
# train_blob_name_base = f"baseline-train-v-{ver}"
# test_blob_name_base = f"baseline-test-v-{ver}"


In [245]:
# AzStorageClient.listBlobs(container)
container = "baseline-data"
cred_file = az_cred_file

In [246]:
from azure.storage.blob import BlobServiceClient

if os.path.exists(cred_file):
  connect_str = ""
  with open(cred_file, "rb") as f:
      data = json.load(f)
      connect_str = data['connectionstr']
      blob_svc_client = BlobServiceClient.from_connection_string(connect_str)
      tokens = connect_str.split(';')
      for t in tokens:
        if "AccountName=" in t:
          AccountName = t[len("AccountName="):]
        elif "AccountKey=" in t:
          AccountKey = t[len("AccountKey="):]

In [249]:
# Container and file name
container = "baseline-data"
blob_name = "baseline-train-v-0-raw.parquet"

In [250]:
# Download the parquet file if there is not local copy
# ref: https://stackoverflow.com/a/68940709

data_df = None
if not (os.path.exists(tmp_dir+blob_name)):
    # Initialized Azure Storage Client
    azStorageClient = AzStorageClient(az_cred_file)
    # Download blob to stream
    file_stream = azStorageClient.downloadBlob2Stream(container, blob_name)
    # read parquet
    data_df = pd.read_parquet(file_stream, engine='pyarrow')
    data_df.to_parquet(tmp_dir + blob_name)
else:
    data_df = pd.read_parquet(tmp_dir + blob_name)

print(f"size: {data_df.shape}")
data_df.head()

size: (1485926, 33)


Unnamed: 0,GPP_NT_VUT_REF,TA_ERA,SW_IN_ERA,LW_IN_ERA,VPD_ERA,P_ERA,PA_ERA,datetime,year,month,day,hour,date,EVI,NDVI,NIRv,b1,b2,b3,b4,b5,b6,b7,IGBP,koppen,minute,site_id,elevation,lat,long,koppen_sub,c3c4,c4_percent
16228,-0.53574,5.311,25.016,272.218,1.708,0.0,97.939,2001-01-01 08:30:00,2001,1,1,8,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,30,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59
16229,0.86438,5.744,59.734,272.218,1.738,0.0,97.939,2001-01-01 09:00:00,2001,1,1,9,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,0,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59
16230,-0.02627,6.176,91.235,272.218,1.767,0.0,97.939,2001-01-01 09:30:00,2001,1,1,9,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,30,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59
16231,-0.17229,6.608,79.264,333.933,1.797,0.05,97.939,2001-01-01 10:00:00,2001,1,1,10,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,0,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59
16232,1.20865,7.043,94.929,333.933,1.817,0.0,97.923,2001-01-01 10:30:00,2001,1,1,10,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,30,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59


In [251]:
type(data_df)

pandas.core.frame.DataFrame

# Data Preprocessing

In [252]:
data_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1485926 entries, 16228 to 2021172
Data columns (total 33 columns):
 #   Column          Non-Null Count    Dtype         
---  ------          --------------    -----         
 0   GPP_NT_VUT_REF  1485926 non-null  float64       
 1   TA_ERA          1485926 non-null  float64       
 2   SW_IN_ERA       1485926 non-null  float64       
 3   LW_IN_ERA       1485926 non-null  float64       
 4   VPD_ERA         1485926 non-null  float64       
 5   P_ERA           1485926 non-null  float64       
 6   PA_ERA          1485926 non-null  float64       
 7   datetime        1485926 non-null  datetime64[ns]
 8   year            1485926 non-null  int64         
 9   month           1485926 non-null  int64         
 10  day             1485926 non-null  int64         
 11  hour            1485926 non-null  int64         
 12  date            1485926 non-null  datetime64[ns]
 13  EVI             1485926 non-null  float64       
 14  NDVI          

In [150]:
data_df.describe()

Unnamed: 0,GPP_NT_VUT_REF,TA_ERA,SW_IN_ERA,LW_IN_ERA,VPD_ERA,P_ERA,PA_ERA,year,month,day,hour,EVI,NDVI,NIRv,b1,b2,b3,b4,b5,b6,b7,minute,elevation,lat,long,koppen_sub,c4_percent
count,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0,1485926.0
mean,5.86211,14.87724,377.56323,316.15936,10.11574,0.03757,90.79849,2009.75172,6.53593,15.74438,11.89376,0.29739,0.53815,0.13063,0.07894,0.24123,0.04288,0.07109,0.25797,0.20036,0.11929,14.99925,971.07777,41.75447,-62.35813,16.51431,8.82611
std,7.52193,9.04516,268.53741,47.26856,9.61817,0.18091,8.84375,4.52259,2.98991,8.81076,3.7739,0.13132,0.21932,0.07137,0.0554,0.06086,0.03907,0.0416,0.0696,0.09369,0.07861,15.00001,866.88847,7.77408,59.15444,8.71496,15.9951
min,-49.7372,-29.74,0.001,142.77,0.0,0.0,67.405,2001.0,1.0,1.0,3.0,-0.11958,-0.18252,-0.01715,0.0054,0.0305,0.0,0.0,0.0,0.0132,0.0,0.0,129.0,31.7365,-121.5574,6.0,0.0
25%,0.45157,8.692,138.70625,283.869,3.18,0.0,85.242,2006.0,4.0,8.0,9.0,0.20336,0.3415,0.08147,0.0337,0.1984,0.0196,0.0422,0.2021,0.1189,0.0515,0.0,234.0,36.6058,-110.8661,8.0,0.0
50%,3.23428,14.931,344.5285,317.35,6.987,0.0,93.167,2010.0,7.0,16.0,12.0,0.28756,0.56813,0.11822,0.0639,0.2308,0.034,0.0632,0.2611,0.1868,0.1033,0.0,689.0,40.0329,-97.4888,14.0,0.04
75%,9.4872,21.282,587.463,349.289,13.732,0.0,98.717,2013.0,9.0,23.0,15.0,0.36012,0.70597,0.15872,0.1166,0.2775,0.0545,0.0912,0.3152,0.2791,0.1824,30.0,1531.0,45.5598,3.5957,26.0,10.72
max,85.0309,42.587,1094.341,473.011,75.684,15.493,103.383,2020.0,12.0,31.0,23.0,2.38835,0.93551,0.42385,0.7971,0.7729,0.7689,0.7865,0.4666,0.428,0.3573,30.0,3050.0,61.84741,24.29477,27.0,55.39


## Add static features

- `time_idx` determines the sequence of samples. This is used in TS dataset in pytorch. It is also used to make aggregated static features easier when there are multiple observations with same time point. 

- We could also create categorical features by using average or convert time features to categorical features(Ex. month)

In [151]:
# sample df from pytorch libraries(for reference)
# from pytorch_forecasting.data.examples import get_stallion_data
# sample_data = get_stallion_data()
# sample_data 

Unnamed: 0,agency,sku,volume,date,industry_volume,soda_volume,avg_max_temp,price_regular,price_actual,discount,avg_population_2017,avg_yearly_household_income_2017,easter_day,good_friday,new_year,christmas,labor_day,independence_day,revolution_day_memorial,regional_games,fifa_u_17_world_cup,football_gold_cup,beer_capital,music_fest,discount_in_percent,timeseries
0,Agency_22,SKU_01,52.27200,2013-01-01,492612703,718394219,25.84524,1168.90367,1069.16619,99.73748,48151,132110,0,0,1,0,0,0,0,0,0,0,0,0,8.53257,0
238,Agency_37,SKU_04,0.00000,2013-01-01,492612703,718394219,26.50500,1852.27364,1611.46630,240.80734,32769,96761,0,0,1,0,0,0,0,0,0,0,0,0,13.00064,5
237,Agency_59,SKU_03,812.92140,2013-01-01,492612703,718394219,22.21974,1270.79501,1197.18426,73.61075,1219986,218902,0,0,1,0,0,0,0,0,0,0,0,0,5.79250,9
236,Agency_11,SKU_01,316.44000,2013-01-01,492612703,718394219,25.36000,1176.15540,1082.75749,93.39791,135561,100461,0,0,1,0,0,0,0,0,0,0,0,0,7.94095,14
235,Agency_05,SKU_05,420.90930,2013-01-01,492612703,718394219,24.07901,1327.00340,1207.82299,119.18040,3044268,182944,0,0,1,0,0,0,0,0,0,0,0,0,8.98117,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6765,Agency_08,SKU_03,9.81360,2017-12-01,618073219,919709619,25.37367,1706.41026,1455.26206,251.14820,71662,123610,0,0,0,1,0,0,0,0,0,0,0,0,14.71793,336
6764,Agency_60,SKU_05,2235.34950,2017-12-01,618073219,919709619,23.08107,1898.98156,1528.61611,370.36545,2180611,211936,0,0,0,1,0,0,0,0,0,0,0,0,19.50337,188
6763,Agency_19,SKU_05,87.54300,2017-12-01,618073219,919709619,27.43259,1902.16069,1547.29973,354.86095,48146,122042,0,0,0,1,0,0,0,0,0,0,0,0,18.65568,162
6771,Agency_60,SKU_03,325.87920,2017-12-01,618073219,919709619,23.08107,1704.50381,1444.44391,260.05990,2180611,211936,0,0,0,1,0,0,0,0,0,0,0,0,15.25722,187


In [152]:
data_df.head()

Unnamed: 0,GPP_NT_VUT_REF,TA_ERA,SW_IN_ERA,LW_IN_ERA,VPD_ERA,P_ERA,PA_ERA,datetime,year,month,day,hour,date,EVI,NDVI,NIRv,b1,b2,b3,b4,b5,b6,b7,IGBP,koppen,minute,site_id,elevation,lat,long,koppen_sub,c3c4,c4_percent
16228,-0.53574,5.311,25.016,272.218,1.708,0.0,97.939,2001-01-01 08:30:00,2001,1,1,8,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,30,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59
16229,0.86438,5.744,59.734,272.218,1.738,0.0,97.939,2001-01-01 09:00:00,2001,1,1,9,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,0,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59
16230,-0.02627,6.176,91.235,272.218,1.767,0.0,97.939,2001-01-01 09:30:00,2001,1,1,9,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,30,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59
16231,-0.17229,6.608,79.264,333.933,1.797,0.05,97.939,2001-01-01 10:00:00,2001,1,1,10,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,0,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59
16232,1.20865,7.043,94.929,333.933,1.817,0.0,97.923,2001-01-01 10:30:00,2001,1,1,10,2001-01-01,0.24998,0.73349,0.10592,0.0222,0.1444,0.0074,0.0267,0.1486,0.0977,0.0,EBF,Temperate,30,FR-Pue,270.0,43.7413,3.5957,8,C3,6.59


In [153]:
type(data_df)

pandas.core.frame.DataFrame

### Add time index to df

- Using `Series.rank` looks the best and fastest!

In [154]:
# add time index
# Index is helpful to find the order of rows in timeline 
# data_df["time_idx"] = data_df["year"]*24*30* + data_df["month"]*24*30 + data_df["day"]*24 + data_df["hour"]# year times 12 + month and find the time "ID"
# data_df["time_idx"] -= data_df["time_idx"].min() # substract minimum "ID" from the original to reduce the magnitude of IDs  

data_df["time_idx"] = data_df['datetime'].rank(method='dense').sub(1).astype(int)
# print (data_df)
print(f'time index = 0 {data_df[data_df["time_idx"] == 0]["datetime"]}')
print("")
print(f'time index = 1 {data_df[data_df["time_idx"] == 1]["datetime"]}')
print("")
print(f'time index = mean({int(data_df["time_idx"].mean())}) {data_df[data_df["time_idx"] == int(data_df["time_idx"].mean())]["datetime"]}')
print("")
print(f'time index = max({int(data_df["time_idx"].median())}) {data_df[data_df["time_idx"] == int(data_df["time_idx"].median())]["datetime"]}')
print("")
print(f'time index = max({int(data_df["time_idx"].max())}) {data_df[data_df["time_idx"] == int(data_df["time_idx"].max())]["datetime"]}')

# data_df["time_idx"].min()


# add additional features
# data_df["month"] = data_df.date.dt.month.astype(str).astype("category")  # categories have be strings
# data_df["log_volume"] = np.log(data_df.volume + 1e-8)
# data_df["avg_volume_by_sku"] = data_df.groupby(["time_idx", "sku"], observed=True).volume.transform("mean")
# data_df["avg_volume_by_agency"] = data_df.groupby(["time_idx", "agency"], observed=True).volume.transform("mean")

# we want to encode special days as one variable and thus need to first reverse one-hot encoding
# special_days = [
#     "easter_day",
#     "good_friday",
#     "new_year",
#     "christmas",
#     "labor_day",
#     "independence_day",
#     "revolution_day_memorial",
#     "regional_games",
#     "fifa_u_17_world_cup",
#     "football_gold_cup",
#     "beer_capital",
#     "music_fest",
# ]
# data_df[special_days] = data_df[special_days].apply(lambda x: x.map({0: "-", 1: x.name})).astype("category")
# data_df.sample(10, random_state=521)

time index = 0 247290   2001-01-01 07:30:00
554392   2001-01-01 07:30:00
Name: datetime, dtype: datetime64[ns]

time index = 1 247291   2001-01-01 08:00:00
554393   2001-01-01 08:00:00
Name: datetime, dtype: datetime64[ns]

time index = mean(92468) 95009     2010-04-19 13:00:00
400565    2010-04-19 13:00:00
514543    2010-04-19 13:00:00
633123    2010-04-19 13:00:00
726338    2010-04-19 13:00:00
930452    2010-04-19 13:00:00
1080385   2010-04-19 13:00:00
1245209   2010-04-19 13:00:00
1316391   2010-04-19 13:00:00
1415774   2010-04-19 13:00:00
1520379   2010-04-19 13:00:00
1720111   2010-04-19 13:00:00
1954425   2010-04-19 13:00:00
Name: datetime, dtype: datetime64[ns]

time index = max(92431) 94975     2010-04-18 09:30:00
312148    2010-04-18 09:30:00
400532    2010-04-18 09:30:00
514510    2010-04-18 09:30:00
633090    2010-04-18 09:30:00
726305    2010-04-18 09:30:00
930419    2010-04-18 09:30:00
1080351   2010-04-18 09:30:00
1245175   2010-04-18 09:30:00
1316358   2010-04-18 09:30:0

## Convert to TS dataset

In [215]:
max_prediction_length = 1000
max_encoder_length = 180000
training_cutoff = data_df["time_idx"].max() - max_prediction_length
training_cutoff

195704

In [216]:
sample_data

Unnamed: 0,agency,sku,volume,date,industry_volume,soda_volume,avg_max_temp,price_regular,price_actual,discount,avg_population_2017,avg_yearly_household_income_2017,easter_day,good_friday,new_year,christmas,labor_day,independence_day,revolution_day_memorial,regional_games,fifa_u_17_world_cup,football_gold_cup,beer_capital,music_fest,discount_in_percent,timeseries
0,Agency_22,SKU_01,52.27200,2013-01-01,492612703,718394219,25.84524,1168.90367,1069.16619,99.73748,48151,132110,0,0,1,0,0,0,0,0,0,0,0,0,8.53257,0
238,Agency_37,SKU_04,0.00000,2013-01-01,492612703,718394219,26.50500,1852.27364,1611.46630,240.80734,32769,96761,0,0,1,0,0,0,0,0,0,0,0,0,13.00064,5
237,Agency_59,SKU_03,812.92140,2013-01-01,492612703,718394219,22.21974,1270.79501,1197.18426,73.61075,1219986,218902,0,0,1,0,0,0,0,0,0,0,0,0,5.79250,9
236,Agency_11,SKU_01,316.44000,2013-01-01,492612703,718394219,25.36000,1176.15540,1082.75749,93.39791,135561,100461,0,0,1,0,0,0,0,0,0,0,0,0,7.94095,14
235,Agency_05,SKU_05,420.90930,2013-01-01,492612703,718394219,24.07901,1327.00340,1207.82299,119.18040,3044268,182944,0,0,1,0,0,0,0,0,0,0,0,0,8.98117,22
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6765,Agency_08,SKU_03,9.81360,2017-12-01,618073219,919709619,25.37367,1706.41026,1455.26206,251.14820,71662,123610,0,0,0,1,0,0,0,0,0,0,0,0,14.71793,336
6764,Agency_60,SKU_05,2235.34950,2017-12-01,618073219,919709619,23.08107,1898.98156,1528.61611,370.36545,2180611,211936,0,0,0,1,0,0,0,0,0,0,0,0,19.50337,188
6763,Agency_19,SKU_05,87.54300,2017-12-01,618073219,919709619,27.43259,1902.16069,1547.29973,354.86095,48146,122042,0,0,0,1,0,0,0,0,0,0,0,0,18.65568,162
6771,Agency_60,SKU_03,325.87920,2017-12-01,618073219,919709619,23.08107,1704.50381,1444.44391,260.05990,2180611,211936,0,0,0,1,0,0,0,0,0,0,0,0,15.25722,187


**Some key variables**

`group_ids` (List[str])

- list of column names identifying a time series. This means that the group_ids identify a sample together with the time_idx. If you have only one timeseries, set this to the name of column that is constant.

`allow_missing_timesteps=True` 

- if to allow missing timesteps that are automatically filled up. Missing values refer to gaps in the time_idx, e.g. if a specific timeseries has only samples for 1, 2, 4, 5, the sample for 3 will be generated on-the-fly. Allow missings does not deal with NA values. You should fill NA values before passing the dataframe to the TimeSeriesDataSet.




(TS parameters to be updated on Sunday)

In [217]:
data_df

Unnamed: 0,GPP_NT_VUT_REF,time_idx,site_id,TA_ERA,SW_IN_ERA,LW_IN_ERA,VPD_ERA,P_ERA,PA_ERA,EVI,NDVI,NIRv,b1,b2,b3,b4,b5,b6,b7,elevation,lat,long
16228,-0.53574,2,FR-Pue,5.31100,25.01600,272.21800,1.70800,0.00000,97.93900,0.24998,0.73349,0.10592,0.02220,0.14440,0.00740,0.02670,0.14860,0.09770,0.00000,270.00000,43.74130,3.59570
16229,0.86438,3,FR-Pue,5.74400,59.73400,272.21800,1.73800,0.00000,97.93900,0.24998,0.73349,0.10592,0.02220,0.14440,0.00740,0.02670,0.14860,0.09770,0.00000,270.00000,43.74130,3.59570
16230,-0.02627,4,FR-Pue,6.17600,91.23500,272.21800,1.76700,0.00000,97.93900,0.24998,0.73349,0.10592,0.02220,0.14440,0.00740,0.02670,0.14860,0.09770,0.00000,270.00000,43.74130,3.59570
16231,-0.17229,5,FR-Pue,6.60800,79.26400,333.93300,1.79700,0.05000,97.93900,0.24998,0.73349,0.10592,0.02220,0.14440,0.00740,0.02670,0.14860,0.09770,0.00000,270.00000,43.74130,3.59570
16232,1.20865,6,FR-Pue,7.04300,94.92900,333.93300,1.81700,0.00000,97.92300,0.24998,0.73349,0.10592,0.02220,0.14440,0.00740,0.02670,0.14860,0.09770,0.00000,270.00000,43.74130,3.59570
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
2021168,7.46182,189873,IT-Lav,7.23800,243.77800,247.15600,4.12800,0.00000,87.15300,0.27206,0.49962,0.09898,0.06610,0.19810,0.05090,0.06500,0.15800,0.06300,0.02880,1353.00000,45.95620,11.28132
2021169,5.30198,189874,IT-Lav,7.21100,203.38000,247.15600,3.78300,0.00000,87.14600,0.27206,0.49962,0.09898,0.06610,0.19810,0.05090,0.06500,0.15800,0.06300,0.02880,1353.00000,45.95620,11.28132
2021170,8.55760,189875,IT-Lav,7.18400,132.72200,244.51000,3.43900,0.00000,87.13900,0.27206,0.49962,0.09898,0.06610,0.19810,0.05090,0.06500,0.15800,0.06300,0.02880,1353.00000,45.95620,11.28132
2021171,4.31962,189876,IT-Lav,7.57500,85.64500,244.51000,3.94700,0.00000,87.13000,0.27206,0.49962,0.09898,0.06610,0.19810,0.05090,0.06500,0.15800,0.06300,0.02880,1353.00000,45.95620,11.28132


In [218]:
data_df.columns

Index(['GPP_NT_VUT_REF', 'time_idx', 'site_id', 'TA_ERA', 'SW_IN_ERA',
       'LW_IN_ERA', 'VPD_ERA', 'P_ERA', 'PA_ERA', 'EVI', 'NDVI', 'NIRv', 'b1',
       'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 'elevation', 'lat', 'long'],
      dtype='object')

In [219]:
data_df = data_df[['GPP_NT_VUT_REF', 'time_idx', 'site_id',
                   'TA_ERA', 'SW_IN_ERA', 'LW_IN_ERA', 'VPD_ERA','P_ERA', 'PA_ERA',
                   'EVI', 'NDVI', 'NIRv', 'b1', 'b2', 'b3', 'b4', 'b5', 'b6', 'b7', 
                   'elevation', 'lat', 'long'
                   ]]

In [220]:
training = TimeSeriesDataSet(
    # limit training 
    data_df[lambda x: x.time_idx <= training_cutoff], 
    # time index is used as input to create TS dataset
    time_idx="time_idx",
    target="GPP_NT_VUT_REF",
    group_ids=["site_id"],
    allow_missing_timesteps=True,
    # min_encoder_length=max_encoder_length // 2,  # keep encoder length long (as it is in the validation set)
    # max_encoder_length=max_encoder_length,
    # min_prediction_length=1,
    # max_prediction_length=max_prediction_length,
    # static_categoricals=["agency", "sku"],
    # static_reals=["avg_population_2017", "avg_yearly_household_income_2017"],
    # time_varying_known_categoricals=["special_days", "month"],
    # variable_groups={"special_days": special_days},  # group of categorical variables can be treated as one variable
    # time_varying_known_reals=["time_idx", "price_regular", "discount_in_percent"],
    # time_varying_unknown_categoricals=[],
    # time_varying_unknown_reals=[
    #     "volume",
    #     "log_volume",
    #     "industry_volume",
    #     "soda_volume",
    #     "avg_max_temp",
    #     "avg_volume_by_agency",
    #     "avg_volume_by_sku",
    # ],
    # target_normalizer=GroupNormalizer(
    #     groups=["agency", "sku"], transformation="softplus"
    # ),  # use softplus and normalize by group
    # add_relative_time_idx=True,
    # add_target_scales=True,
    # add_encoder_length=True,
)

#allow_missings=True

Hint of handleing error 
```
AssertionError: Time difference between steps has been idenfied as larger than 1 - set allow_missing_timesteps=True
```
https://github.com/jdb78/pytorch-forecasting/issues/134



### Create validation set

In [221]:
# create validation set (predict=True) which means to predict the last max_prediction_length points in time
# for each series
validation = TimeSeriesDataSet.from_dataset(training, data_df, predict=True, stop_randomization=True)
validation

TimeSeriesDataSet[length=14](
	time_idx='time_idx',
	target='GPP_NT_VUT_REF',
	group_ids=['site_id'],
	weight=None,
	max_encoder_length=30,
	min_encoder_length=30,
	min_prediction_idx=0,
	min_prediction_length=1,
	max_prediction_length=1,
	static_categoricals=[],
	static_reals=[],
	time_varying_known_categoricals=[],
	time_varying_known_reals=[],
	time_varying_unknown_categoricals=[],
	time_varying_unknown_reals=[],
	variable_groups={},
	constant_fill_strategy={},
	allow_missing_timesteps=True,
	lags={},
	add_relative_time_idx=False,
	add_target_scales=False,
	add_encoder_length=False,
	target_normalizer=EncoderNormalizer(
	method='standard',
	center=True,
	max_length=None,
	transformation=None,
	method_kwargs={}
),
	categorical_encoders={'__group_id__site_id': NaNLabelEncoder(add_nan=False, warn=True)},
	scalers={},
	randomize_length=None,
	predict_mode=True
)

# Create dataloader from model

In [222]:
# create dataloaders for model
batch_size = 32  # set this between 32 to 128
train_dataloader = training.to_dataloader(train=True, batch_size=batch_size, num_workers=0)
val_dataloader = validation.to_dataloader(train=False, batch_size=batch_size * 10, num_workers=0)

# Create "Baseline" model 

### (Reference on "baseline":\)

- Baseline model that uses last known target value to make prediction.

https://pytorch-forecasting.readthedocs.io/en/stable/api/pytorch_forecasting.models.baseline.Baseline.html#pytorch_forecasting.models.baseline.Baseline

In [223]:
# from pytorch_forecasting import BaseModel, MAE

# # generating predictions
# predictions = Baseline().predict(dataloader)

# # calculate baseline performance in terms of mean absolute error (MAE)
# metric = MAE()
# model = Baseline()
# for x, y in dataloader:
#     metric.update(model(x), y)

# metric.compute()

In [224]:
# calculate baseline mean absolute error, i.e. predict next value as the last available value from the history
actuals = torch.cat([y for x, (y, weight) in iter(val_dataloader)])
baseline_predictions = Baseline().predict(val_dataloader)
(actuals - baseline_predictions).abs().mean().item()

0.828112781047821

# Train the Temporal Fusion Transformer

- set optimizer adam, otherwise we will receive error

In [225]:
# configure network and trainer
pl.seed_everything(42)
trainer = pl.Trainer(
    gpus=0,
    # clipping gradients is a hyperparameter and important to prevent divergance
    # of the gradient for recurrent neural networks
    # gradient_clip_val=0.1,
)


tft = TemporalFusionTransformer.from_dataset(
    training,
    # not meaningful for finding the learning rate but otherwise very important
    learning_rate=0.03,
    hidden_size=16,  # most important hyperparameter apart from learning rate
    # number of attention heads. Set to up to 4 for large datasets
    attention_head_size=1,
    dropout=0.1,  # between 0.1 and 0.3 are good values
    hidden_continuous_size=8,  # set to <= hidden_size
    output_size=7,  # 7 quantiles by default
    loss=QuantileLoss(),
    # reduce learning rate if no improvement in validation loss after x epochs
    reduce_on_plateau_patience=4,
    optimizer="adam"
)
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")

INFO:lightning_fabric.utilities.seed:Global seed set to 42
  rank_zero_deprecation(
INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


Number of parameters in network: 14.2k


In [226]:
# find optimal learning rate
res = trainer.tuner.lr_find(
    tft,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
    max_lr=10.0,
    min_lr=1e-6,
)

# print(f"suggested learning rate: {res.suggestion()}")
# fig = res.plot(show=True, suggest=True)
# fig.show()


Finding best initial lr:   0%|          | 0/100 [00:00<?, ?it/s]

AssertionError: ignored

## tft with optimized learning rate

In [227]:
# configure network and trainer
# early_stop_callback = EarlyStopping(monitor="val_loss", min_delta=1e-4, patience=10, verbose=False, mode="min")
lr_logger = LearningRateMonitor()  # log the learning rate
logger = TensorBoardLogger("lightning_logs")  # logging results to a tensorboard

In [228]:
trainer = pl.Trainer(
    max_epochs=5,
    # max_steps=100,
    gpus=0,
    enable_model_summary=True,
    gradient_clip_val=0.1,
    limit_train_batches=30,  # coment in for training, running valiation every 30 batches
    fast_dev_run=False,  # comment in to check that networkor dataset has no serious bugs
    # fit may stop when the fast_dev_run is set as true
    callbacks=[lr_logger],#, early_stop_callback], # logger + early stopping callback
    logger=logger,
)

INFO:pytorch_lightning.utilities.rank_zero:GPU available: False, used: False
INFO:pytorch_lightning.utilities.rank_zero:TPU available: False, using: 0 TPU cores
INFO:pytorch_lightning.utilities.rank_zero:IPU available: False, using: 0 IPUs
INFO:pytorch_lightning.utilities.rank_zero:HPU available: False, using: 0 HPUs


In [229]:
tft = TemporalFusionTransformer.from_dataset(
    training,
    
    # not meaningful for finding the learning rate but otherwise very important
    learning_rate=0.03,
    hidden_size=16,  # most important hyperparameter apart from learning rate
    # number of attention heads. Set to up to 4 for large datasets
    attention_head_size=1,
    dropout=0.1,  # between 0.1 and 0.3 are good values
    hidden_continuous_size=8,  # set to <= hidden_size
    output_size=7,  # 7 quantiles by default
    loss=QuantileLoss(),
    # reduce learning rate if no improvement in validation loss after x epochs
    reduce_on_plateau_patience=4,
    optimizer="adam")
print(f"Number of parameters in network: {tft.size()/1e3:.1f}k")

Number of parameters in network: 14.2k


Even if `max-steps=100` is set it keeps stopping and saying `max_steps=1` reached, if the `fast_dev_run` was set as true in the config


In [230]:
val_dataloader

<torch.utils.data.dataloader.DataLoader at 0x7f699f23afa0>

In [231]:
trainer.fit(
    tft,
    train_dataloaders=train_dataloader,
    val_dataloaders=val_dataloader,
)

INFO:pytorch_lightning.callbacks.model_summary:
   | Name                               | Type                            | Params
----------------------------------------------------------------------------------------
0  | loss                               | QuantileLoss                    | 0     
1  | logging_metrics                    | ModuleList                      | 0     
2  | input_embeddings                   | MultiEmbedding                  | 0     
3  | prescalers                         | ModuleDict                      | 0     
4  | static_variable_selection          | VariableSelectionNetwork        | 0     
5  | encoder_variable_selection         | VariableSelectionNetwork        | 0     
6  | decoder_variable_selection         | VariableSelectionNetwork        | 0     
7  | static_context_variable_selection  | GatedResidualNetwork            | 1.1 K 
8  | static_context_initial_hidden_lstm | GatedResidualNetwork            | 1.1 K 
9  | static_context_initial_cell_

Sanity Checking: 0it [00:00, ?it/s]

Training: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

Validation: 0it [00:00, ?it/s]

INFO:pytorch_lightning.utilities.rank_zero:`Trainer.fit` stopped: `max_epochs=5` reached.


# Evaluate performance

In [180]:
# load the best model according to the validation loss
# (given that we use early stopping, this is not necessarily the last epoch)
best_model_path = trainer.checkpoint_callback.best_model_path
best_tft = TemporalFusionTransformer.load_from_checkpoint(best_model_path)

IsADirectoryError: ignored

In [None]:
# calcualte mean absolute error on validation set
actuals = torch.cat([y[0] for x, y in iter(val_dataloader)])
predictions = best_tft.predict(val_dataloader)
(actuals - predictions).abs().mean()

In [None]:
# raw predictions are a dictionary from which all kind of information including quantiles can be extracted
raw_predictions, x = best_tft.predict(val_dataloader, mode="raw", return_x=True)

In [None]:
for idx in range(10):  # plot 10 examples
    best_tft.plot_prediction(x, raw_predictions, idx=idx, add_loss_to_title=True);