# Module - Benchmarking
Ontime provides a Benchmark class that can be used to run a number of prediction models on a number of datasets.

In [1]:
# Import to be able to import python package from src
import sys
sys.path.insert(0, '../../../../src')

from ontime.module.benchmarking.benchmark import Benchmark

## Initialization
A Benchmark instance can be initialized with a list of datasets, models and metrics to run through. When invoking run(), it will train and test every dataset on every model, and compute every metric on the predicted data.


### Preparing models
Models are wrapped in BenchmarkModelHolders that will instanciate them for each dataset.
If a model can't be instanciated and invoked as in BenchmarkModelHolder's implementation, a child class can be written and submitted.

In [2]:
from darts.models import ARIMA

m1 = Benchmark.BenchmarkModelHolder(ARIMA, 'ARIMA', {'p': 12, 'd': 1, 'q': 2})

In [3]:
from ontime.core.time_series.time_series import TimeSeries
from typing import Any
from darts.models import BATS

# BenchmarkModelHolder child class example (BATS could be submitted as was done with ARIMA, this is for the sake of example)
class BATSHolder(Benchmark.BenchmarkModelHolder):
    def __init__(self, name: str, arguments_dict: dict = None):
        super().__init__(BATS, name, arguments_dict=arguments_dict)
        
    def instantiate(self, train_set: TimeSeries, test_set: TimeSeries):
        self.model_instance = BATS(**self.args)

    def fit(self, training_set: TimeSeries, test_set: TimeSeries):
        self.model_instance.fit(training_set)

    def predict(self, horizon: Any, dataset: TimeSeries) -> Any:
        return self.model_instance.predict(horizon)

In [4]:
m2 = BATSHolder(name='BATS', arguments_dict = {'use_trend': True})


### Preparing datasets
Datasets submitted to a Benchmark must be of type TimeSeries. Pre-wrapping them into a BenchmarkDataset allows to give them a name, give training and test sets, and declare if it's univariate or multivariate. A tuple of timeseries (train set, test set) can also be submitted.

In [5]:
from ontime.module.data.dataset import Dataset
from darts.utils.model_selection import train_test_split

d1 = Dataset.AirPassengersDataset.load() 
ausbeer = Dataset.AusBeerDataset.load()
d2 = Benchmark.BenchmarkDataset(ausbeer, multivariate = False, name = "AusBeerDataset")
heartrate = Dataset.HeartRateDataset.load()
heartrate_train, heartrate_test = train_test_split(heartrate, test_size = 0.5)
d3 = (heartrate_train, heartrate_test)

### Preparing metrics
Metrics must be wrapped in a Benchmark.BenchmarkMetric instance. Again, if the function can't be invoked as in BenchmarkMetric's implementation, a child class can be written and submitted.

In [6]:
import darts

me1 = Benchmark.BenchmarkMetric(name="RMSE", metric_function=darts.metrics.metrics.coefficient_of_variation)
me2 = Benchmark.BenchmarkMetric(name="MAE", metric_function=darts.metrics.metrics.mae)

## Creating a Benchmark

In [7]:
benchmark = Benchmark(datasets = [d1, d2, d3], # datasets submitted as simple TimeSeries will be given a number as a name
                      models = [m1, m2], 
                      metrics = [me1, me2], 
                      train_proportion=0.9)

Datasets, models and metrics can also be added after instanciation. This allows to name datasets.

In [8]:
benchmark2 = Benchmark()
benchmark2.add_model(m1)
benchmark2.add_dataset(d1, name = "AirPassengerDataset")
benchmark2.add_metric(me1)

Once the models and datasets have been added, the run() method will train instances of all the models on all the datasets individually and compute metrics. The verbose parameter will print the status and results of the process as it progresses, and the debug parameter will print error messages (warnings are printed anyways).

In [9]:
benchmark.run(verbose = True, debug = True)

Starting evaluation...
Evaluation for model ARIMA
on dataset 1 
training... 

  warn('Non-stationary starting autoregressive parameters'


done, took 0.7708535194396973
testing... done, took 0.004489421844482422
RMSE: 3.6823375607771083
MAE: 13.397485829577487
on dataset AusBeerDataset 
training... 



done, took 0.8976242542266846
testing... done, took 0.004706382751464844
RMSE: 4.350713610378453
MAE: 15.369339480209591
on dataset 3 
training... done, took 3.46706223487854
testing... done, took 0.030436277389526367
RMSE: 6.606337568439427
MAE: 5.044218953626089
Evaluation for model BATS
on dataset 1 
training... 



done, took 7.672419548034668
testing... done, took 0.0022764205932617188
RMSE: 8.321211346269285
MAE: 33.570078021466806
on dataset AusBeerDataset 
training... done, took 12.87717056274414
testing... done, took 0.0022611618041992188
RMSE: 4.378445630437863
MAE: 13.398303193058819
on dataset 3 
training... done, took 16.338427543640137
testing... done, took 0.002260923385620117
RMSE: 6.577731759671614
MAE: 4.963032869193333


To view the results, you can call get_report() and print the returned value

In [10]:
print(benchmark.get_report())

Model ARIMA:
Supported univariate datasets: ✓
Supported multivariate datasets: unknown
Dataset 1:
nb features: 1
training set size: 128
training time: 0.7708535194396973
test set size: 16
testing time: 0.004489421844482422
RMSE: 3.6823375607771083
MAE: 13.397485829577487
Dataset AusBeerDataset:
nb features: 1
training set size: 189
training time: 0.8976242542266846
test set size: 22
testing time: 0.004706382751464844
RMSE: 4.350713610378453
MAE: 15.369339480209591
Dataset 3:
nb features: 1
training set size: 900
training time: 3.46706223487854
test set size: 900
testing time: 0.030436277389526367
RMSE: 6.606337568439427
MAE: 5.044218953626089


Model BATS:
Supported univariate datasets: ✓
Supported multivariate datasets: unknown
Dataset 1:
nb features: 1
training set size: 128
training time: 7.672419548034668
test set size: 16
testing time: 0.0022764205932617188
RMSE: 8.321211346269285
MAE: 33.570078021466806
Dataset AusBeerDataset:
nb features: 1
training set size: 189
training time: 

You can also get results by calling get_report_dataframes(). The results are then returned as a dictionary with the model names as keys and dataframes as values, where columns are measures (testing time, metrics, etc.) and index is the dataset names

In [13]:
dfs = benchmark.get_report_dataframes()
for df in dfs.keys():
    print(f'{df}:')
    print(dfs[df])
    print("")

ARIMA:
               nb features training set size training time test set size  \
1                      1.0             128.0      0.770854          16.0   
AusBeerDataset         1.0             189.0      0.897624          22.0   
3                      1.0             900.0      3.467062         900.0   

               testing time      RMSE        MAE  
1                  0.004489  3.682338  13.397486  
AusBeerDataset     0.004706  4.350714  15.369339  
3                  0.030436  6.606338   5.044219  

BATS:
               nb features training set size training time test set size  \
1                      1.0             128.0       7.67242          16.0   
AusBeerDataset         1.0             189.0     12.877171          22.0   
3                      1.0             900.0     16.338428         900.0   

               testing time      RMSE        MAE  
1                  0.002276  8.321211  33.570078  
AusBeerDataset     0.002261  4.378446  13.398303  
3                  