# Module - Benchmarking
Ontime provides a Benchmark class that can be used to run a number of prediction models on a number of datasets.

In [1]:
# Import to be able to import python package from src
import sys
sys.path.insert(0, '../../../../src')

from ontime.module.benchmarking.benchmark import Benchmark

  if pd.api.types.is_categorical_dtype(dtype)


## Initialization
A Benchmark instance can be initialized with a list of datasets, models and metrics to run through. When invoking run(), it will train and test every dataset on every model, and compute every metric on the predicted data.


### Preparing models
Models are wrapped in BenchmarkModelHolders that will instanciate them for each dataset.
If a model can't be instanciated and invoked as in BenchmarkModelHolder's implementation, a child class can be written and submitted.

In [2]:
from ontime.core.time_series.time_series import TimeSeries
from typing import Any
from darts.models import ARIMA, BATS

m1 = Benchmark.BenchmarkModelHolder(ARIMA, 'ARIMA', {'p': 12, 'd': 1, 'q': 2})
m2 = Benchmark.BenchmarkModelHolder(BATS, 'BATS', {'use_trend': True})

### BenchmarkModelHolder child class example
Here we are using a Darts RNN (based on pytorch) since BenchmarkModelHolder does not currently support pytorch models by default. 

In [27]:
from darts.models import RNNModel
import pandas as pd

class RNNHolder(Benchmark.BenchmarkModelHolder):
    def __init__(self, name = "RNN", arguments_dict=None):
        if arguments_dict is None:
            arguments_dict = {'input_chunk_length': 1, 'pred_length': 1}
        self.model = None
        self.name = name
        self.input_chunk_length = arguments_dict['input_chunk_length']
        self.pred_length = arguments_dict['pred_length']
        self.train_tmp = None
        self.train_cov = None
    
    def instantiate(self, train_set: TimeSeries, test_set: TimeSeries, **kwargs):
        self.model = RNNModel(    
            model="RNN",
            input_chunk_length=self.input_chunk_length,
            output_chunk_length=1,
            n_epochs=20,
        )
    
    def fit(self, training_set: TimeSeries, test_set: TimeSeries, target_column, multivariate= False):
        train_target = training_set[target_column]
        self.train_cov = training_set.drop_columns(target_column)
        self.model.fit(train_target)#, future_covariates = self.train_cov)
        
    def predict(self, pred_length, test_set: TimeSeries, target_column, multivariate = False):
        pred_length = len(test_set.time_index) - 1
        target = test_set[target_column]
        cov = pd.concat([self.train_cov.pd_dataframe(), test_set.drop_columns(target_column).pd_dataframe()])
        cov = TimeSeries.from_pandas(cov)
        return self.model.predict(pred_length, 
                    series = target, 
                    #past_covariates = cov, 
                    #future_covariates = cov, 
                    verbose = False)

In [28]:
m3 = RNNHolder(name='RNN', arguments_dict = {'input_chunk_length': 10, 'pred_length': 1})


### Preparing datasets
Datasets submitted to a Benchmark must be of type TimeSeries. Pre-wrapping them into a BenchmarkDataset allows to give them a name, give training and test sets, and declare if it's univariate or multivariate. A tuple of timeseries (train set, test set) can also be submitted.

In [29]:
from ontime.module.data.dataset import Dataset
from darts.utils.model_selection import train_test_split

d1 = Dataset.AirPassengersDataset.load() 
ausbeer = Dataset.AusBeerDataset.load()
d2 = Benchmark.BenchmarkDataset(ausbeer, multivariate = False, name = "AusBeerDataset")
heartrate = Dataset.HeartRateDataset.load()
heartrate_train, heartrate_test = train_test_split(heartrate, test_size = 0.5)
d3 = (heartrate_train, heartrate_test)

### Preparing metrics
Metrics must be wrapped in a Benchmark.BenchmarkMetric instance. Again, if the function can't be invoked as in BenchmarkMetric's implementation, a child class can be written and submitted.

In [30]:
import darts

me1 = Benchmark.BenchmarkMetric(name="RMSE", metric_function=darts.metrics.metrics.coefficient_of_variation)
me2 = Benchmark.BenchmarkMetric(name="MAE", metric_function=darts.metrics.metrics.mae)

## Creating a Benchmark

In [31]:
benchmark = Benchmark(datasets = [d1, d2, d3], # datasets submitted as simple TimeSeries will be given a number as a name
                      models = [m1, m2, m3], 
                      metrics = [me1, me2], 
                      train_proportion=0.9)

Datasets, models and metrics can also be added after instanciation. This allows to name datasets.

In [32]:
benchmark2 = Benchmark()
benchmark2.add_model(m1)
benchmark2.add_dataset(d1, name = "AirPassengerDataset")
benchmark2.add_metric(me1)

Once the models and datasets have been added, the run() method will train instances of all the models on all the datasets individually and compute metrics. The verbose parameter will print the status and results of the process as it progresses, and the debug parameter will print error messages (warnings are printed anyways).

In [33]:
benchmark.run(verbose = True, debug = True)

Starting evaluation...
Evaluation for model ARIMA
on dataset 0, column #Passengers 
training... 

  warn('Non-stationary starting autoregressive parameters'


done, took 1.317589521408081
testing... done, took 0.004683971405029297
RMSE: Index(['#Passengers'], dtype='object', name='component') #Passengers
3.6823375607771083
MAE: Index(['#Passengers'], dtype='object', name='component') #Passengers
13.397485829577487
on dataset AusBeerDataset, column Y 
training... 



done, took 1.8718085289001465
testing... done, took 0.00490117073059082
RMSE: Index(['Y'], dtype='object', name='component') Y
4.350713610378453
MAE: Index(['Y'], dtype='object', name='component') Y
15.369339480209591
on dataset 2, column Heart rate 
training... done, took 3.502897024154663
testing... done, took 0.028519868850708008
RMSE: Index(['Heart rate'], dtype='object', name='component') Heart rate
6.606337568439427
MAE: Index(['Heart rate'], dtype='object', name='component') Heart rate
5.044218953626089
Evaluation for model BATS
on dataset 0, column #Passengers 
training... 



done, took 8.070533275604248
testing... done, took 0.0036497116088867188
RMSE: Index(['#Passengers'], dtype='object', name='component') #Passengers
8.321211346269285
MAE: Index(['#Passengers'], dtype='object', name='component') #Passengers
33.570078021466806
on dataset AusBeerDataset, column Y 
training... done, took 13.329428911209106
testing... done, took 0.0023772716522216797
RMSE: Index(['Y'], dtype='object', name='component') Y
4.378445630437863
MAE: Index(['Y'], dtype='object', name='component') Y
13.398303193058819
on dataset 2, column Heart rate 
training... 

darts.models.forecasting.torch_forecasting_model INFO  Train dataset contains 104 samples.
darts.models.forecasting.torch_forecasting_model INFO  Time series values are 64-bits; casting model to float64.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type             | Params
---------------------------------------------------
0 | criterion     | MSELoss          | 0     
1 | train_metrics | MetricCollection | 0     
2 | val_metrics   | MetricCollection | 0     
3 | rnn           | RNN              | 700   
4 | V             | Linear           | 26    
---------------------------------------------------
726       Trainable params
0         Non-trainable params
726       Total params
0.003     Total estimated model params size (MB)


done, took 17.23955535888672
testing... done, took 0.002389192581176758
RMSE: Index(['Heart rate'], dtype='object', name='component') Heart rate
6.577731759671614
MAE: Index(['Heart rate'], dtype='object', name='component') Heart rate
4.963032869193333
Evaluation for model RNN
on dataset 0, column #Passengers 
training... 

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=20` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
darts.models.forecasting.torch_forecasting_model INFO  Train dataset contains 165 samples.
darts.models.forecasting.torch_forecasting_model INFO  Time series values are 64-bits; casting model to float64.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type             | Params
---------------------------------------------------
0 | criterion     | MSELoss          | 0     
1 | train_metrics | MetricCollection | 

done, took 1.2166600227355957
testing... done, took 0.08661961555480957
RMSE: Index(['#Passengers'], dtype='object', name='component') #Passengers
nan
MAE: Index(['#Passengers'], dtype='object', name='component') #Passengers
nan
on dataset AusBeerDataset, column Y 
training... 

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=20` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
darts.models.forecasting.torch_forecasting_model INFO  Train dataset contains 876 samples.
darts.models.forecasting.torch_forecasting_model INFO  Time series values are 64-bits; casting model to float64.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]

  | Name          | Type             | Params
---------------------------------------------------
0 | criterion     | MSELoss          | 0     
1 | train_metrics | MetricCollection | 

done, took 1.6621193885803223
testing... done, took 0.08975815773010254
RMSE: Index(['Y'], dtype='object', name='component') Y
nan
MAE: Index(['Y'], dtype='object', name='component') Y
nan
on dataset 2, column Heart rate 
training... 

Training: |          | 0/? [00:00<?, ?it/s]

`Trainer.fit` stopped: `max_epochs=20` reached.
GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
HPU available: False, using: 0 HPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]


done, took 7.097806453704834
testing... done, took 0.28192710876464844
RMSE: Index(['Heart rate'], dtype='object', name='component') Heart rate
nan
MAE: Index(['Heart rate'], dtype='object', name='component') Heart rate
nan


  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)
  return _methods._mean(a, axis=axis, dtype=dtype,
  ret = ret.dtype.type(ret / rcount)


To view the results, you can call get_report() and print the returned value

In [35]:
print(benchmark.get_report())

Model ARIMA:
Supported univariate datasets: ✓
Supported multivariate datasets: unknown
Dataset 0, #Passengers:
nb features: 1
target column: #Passengers
training set size: 128
training time: 1.317589521408081
test set size: 16
testing time: 0.004683971405029297
RMSE: 3.6823375607771083
MAE: 13.397485829577487
Dataset AusBeerDataset, Y:
nb features: 1
target column: Y
training set size: 189
training time: 1.8718085289001465
test set size: 22
testing time: 0.00490117073059082
RMSE: 4.350713610378453
MAE: 15.369339480209591
Dataset 2, Heart rate:
nb features: 1
target column: Heart rate
training set size: 900
training time: 3.502897024154663
test set size: 900
testing time: 0.028519868850708008
RMSE: 6.606337568439427
MAE: 5.044218953626089


Model BATS:
Supported univariate datasets: ✓
Supported multivariate datasets: unknown
Dataset 0, #Passengers:
nb features: 1
target column: #Passengers
training set size: 128
training time: 8.070533275604248
test set size: 16
testing time: 0.00364971

You can also get results by calling get_report_dataframes(). The results are then returned as a dictionary with the model names as keys and dataframes as values, where columns are measures (testing time, metrics, etc.) and index is the dataset names

In [36]:
dfs = benchmark.get_report_dataframes()
for df in dfs.keys():
    print(f'{df}:')
    print(dfs[df])
    print("--------------------------------------------------")

ARIMA:
                nb features target column  training set size  training time  \
0                         1   #Passengers                128       1.317590   
AusBeerDataset            1             Y                189       1.871809   
2                         1    Heart rate                900       3.502897   

                test set size  testing time  prediction      RMSE        MAE  
0                          16      0.004684         NaN  3.682338  13.397486  
AusBeerDataset             22      0.004901         NaN  4.350714  15.369339  
2                         900      0.028520         NaN  6.606338   5.044219  
--------------------------------------------------
BATS:
                nb features target column  training set size  training time  \
0                         1   #Passengers                128       8.070533   
AusBeerDataset            1             Y                189      13.329429   
2                         1    Heart rate                900      