# Module - Benchmarking
Ontime provides a Benchmark class that can be used to run a number of prediction models on a number of datasets.

In [24]:
from ontime.module.benchmarking import AbstractBenchmarkModel, BenchmarkMode, BenchmarkDataset, BenchmarkMetric, Benchmark
import ontime as on

## Initialization
A Benchmark instance can be initialized with a list of datasets, models and metrics to run through. When invoking run(), it will train (if needed) and test every dataset on every model, and compute every metric on the predicted data.


### Preparing datasets

Datasets submitted to a Benchmark must be of type TimeSeries, wrapped into BenchmarkDataset. BenchmarkDataset allows to give datasets a name, give training and test splits, and define how data will be split to perform a rolling evaluation.

In [25]:
from ontime.module.datasets.dataset import Dataset
from darts.utils.missing_values import fill_missing_values # for filling missing values in the time series (for models that don't handle missing values)

datasets = [
    BenchmarkDataset(on.TimeSeries.from_darts(fill_missing_values(Dataset.TemperatureDataset.load())), input_length=96, gap=0, stride=96, horizon=24, name="Daily temperature"),
    BenchmarkDataset(on.TimeSeries.from_darts(fill_missing_values(Dataset.MonthlyMilkDataset.load())), input_length=24, gap=0, stride=24, horizon=2, name="Monthly milk production", train_proportion=0.8),
]

### Preparing models  

Models are wrapped according to the AbstractBenchmarkModel interface. Wrappers implementing this interface will instanciate the model for each dataset.  
In Ontime, some wrappers around darts models are provided. For specific models, whose wrappers are not provided, a custom wrapper can be written, implementing the AbstractBenchmarkModel interface.  
A mode should be provided to the wrapper constructor, which will define how the model is evaluated. In can be either:
- 'ZERO_SHOT': the model is not trained, and the evaluation is done on the test set. It is used for models that already has trained weights, available through checkpoints, or for some models from darts, where predictions are directly made using the fitted data as input.

In [26]:
from ontime.core.time_series.time_series import TimeSeries
from ontime.module.benchmarking.darts import SimpleDartsBenchmarkModel
from typing import Any
from darts.models import AutoARIMA, ExponentialSmoothing

models = [
    SimpleDartsBenchmarkModel("AutoARIMA", model=AutoARIMA(start_p=8, max_p=12, start_q=1), mode=BenchmarkMode.ZERO_SHOT),
    SimpleDartsBenchmarkModel("ExponentialSmoothing", model=ExponentialSmoothing(), mode=BenchmarkMode.ZERO_SHOT),
]

### Preparing metrics
Metrics must be given to the BenchmarkMetric constructor. If the function can't be invoked as is in BenchmarkMetric's implementation, a child class can be written and submitted.

In [27]:
import darts

metrics = [
   BenchmarkMetric(name="RMSE", metric_function=darts.metrics.metrics.coefficient_of_variation),
   BenchmarkMetric(name="MAE", metric_function=darts.metrics.metrics.mae),
]

## Creating and running a Benchmark

In [28]:
benchmark = Benchmark(datasets=datasets,
                      models=models, 
                      metrics=metrics)

Datasets, models and metrics can also be added after instanciation. This allows to name datasets.

In [29]:
benchmark.add_dataset(BenchmarkDataset(Dataset.ETTh1Dataset.load(), input_length=1000, gap=0, stride=96, horizon=96, name = "ETTh1"))

Once the models and datasets have been added, the run() method will train instances of all the models on all the datasets individually and compute metrics. The verbose parameter will print the status and results of the process as it progresses, and the debug parameter will print error messages (warnings are printed anyways).

In [30]:
benchmark.run(verbose=True)

Starting evaluation...
Evaluation for model AutoARIMA
on dataset Daily temperature
testing... done, took 36.21504044532776
getting predictions... Computed metrics: {'RMSE': 26.641847048151092, 'MAE': 2.3869256792441744}
on dataset Monthly milk production
testing... done, took 2.63100528717041
getting predictions... Computed metrics: {'RMSE': 6.546880336744911, 'MAE': 42.87108187419369}
on dataset ETTh1
testing... 

darts.models.forecasting.forecasting_model ERROR ValueError: Model `AutoARIMA` only supports univariate TimeSeries instances


Couldn't complete evaluation.
Evaluation for model ExponentialSmoothing
on dataset Daily temperature
testing... done, took 1.1827118396759033
getting predictions... Computed metrics: {'RMSE': 23.865682024876946, 'MAE': 2.1524183769911573}
on dataset Monthly milk production
testing... done, took 0.08522582054138184
getting predictions... Computed metrics: {'RMSE': 0.26974472413569617, 'MAE': 2.2499863752185547}
on dataset ETTh1
testing... 

darts.models.forecasting.forecasting_model ERROR ValueError: Model `ExponentialSmoothing` only supports univariate TimeSeries instances


Couldn't complete evaluation.


## Visualizing results

The benchmark automatically stores measures and metrics computed during the run, available through class attributes.

### Measures and metrics
To view the results, you can call get_report() and print the returned value

In [31]:
print(benchmark.get_report())

Model AutoARIMA:
Supported univariate datasets: ✓
Supported multivariate datasets: X
Dataset Daily temperature:
nb features: 1
target column: ['Daily minimum temperatures']
training set size: 2555
training time: 0
test set size: 1097
testing time: 36.21504044532776
metrics: {'RMSE': 26.641847048151092, 'MAE': 2.3869256792441744}
Dataset Monthly milk production:
nb features: 1
target column: ['Pounds per cow']
training set size: 133
training time: 0
test set size: 35
testing time: 2.63100528717041
metrics: {'RMSE': 6.546880336744911, 'MAE': 42.87108187419369}
Dataset ETTh1:
couldn't complete training on ETTh1


Model ExponentialSmoothing:
Supported univariate datasets: ✓
Supported multivariate datasets: X
Dataset Daily temperature:
nb features: 1
target column: ['Daily minimum temperatures']
training set size: 2555
training time: 0
test set size: 1097
testing time: 1.1827118396759033
metrics: {'RMSE': 23.865682024876946, 'MAE': 2.1524183769911573}
Dataset Monthly milk production:
nb fea

You can also get results as dataframes by calling get_report_df(). The results are then returned as a dataframe with model names as columns, dataset names as main rows, and measure as sub rows.

In [32]:
df_1, df_2 = benchmark.get_report_df()
df_1

Unnamed: 0_level_0,AutoARIMA,ExponentialSmoothing
Statistic,Unnamed: 1_level_1,Unnamed: 2_level_1
supports univariate,✓,✓
supports multivariate,X,X


In [33]:
df_2

Unnamed: 0_level_0,Unnamed: 1_level_0,AutoARIMA,ExponentialSmoothing
Dataset,Metric,Unnamed: 2_level_1,Unnamed: 3_level_1
Daily temperature,training time,0.0,0.0
Daily temperature,testing time,36.21504,1.182712
Daily temperature,RMSE,26.641847,23.865682
Daily temperature,MAE,2.386926,2.152418
Monthly milk production,training time,0.0,0.0
Monthly milk production,testing time,2.631005,0.085226
Monthly milk production,RMSE,6.54688,0.269745
Monthly milk production,MAE,42.871082,2.249986


### Plotting

By default (argument `nb_predictions` of `benchmark.run()` method), the benchmark will generate a prediction for one random input sample of each dataset with each model. The predictions, along input and target series, are stored in a dictionnary and can be retrieved by calling `benchmark.get_predictions()`. The predictions can be plotted using the `plot_predictions` method.

In [34]:
predictions = benchmark.get_predictions()

In [35]:
# currently, Ontime plotting module needs the time index to be named 'time'
def rename_index(ts, name='time'):
    df = ts.pd_dataframe()
    df.rename_axis(name, inplace=True)
    return TimeSeries.from_dataframe(df)

In [36]:
input = rename_index(predictions['inputs']['Daily temperature'][0]).rename({'Daily minimum temperatures': 'input'})
target = rename_index(predictions['targets']['Daily temperature'][0]).rename({'Daily minimum temperatures': 'target'})
prediction = rename_index(predictions['predictions']['AutoARIMA']['Daily temperature'][0]).rename({'Daily minimum temperatures': 'prediction'})

In [37]:
(on.Plot()
    .add(on.marks.line, input)
    .add(on.marks.line, target)
    .add(on.marks.line, prediction, type='dashed')
    .properties(width=600, height=200)
    .show()
)

In [38]:
prediction = rename_index(predictions['predictions']['ExponentialSmoothing']['Daily temperature'][0]).rename({'Daily minimum temperatures': 'prediction'})

In [39]:
(on.Plot()
    .add(on.marks.line, input)
    .add(on.marks.line, target)
    .add(on.marks.line, prediction, type='dashed')
    .properties(width=600, height=200)
    .show()
)