[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ourownstory/neural_prophet/blob/master/tutorials/UnderstandeTheBenchmarkingPipeline.ipynb)

# How to create a custom evaluation pipeline
This tutorial takes you behind the scenes of the benchmark template and guides you in creating your custom evaluation
pipeline by explaining the processing step-by-step.


In [7]:
if 'google.colab' in str(get_ipython()):
    !pip install git+https://github.com/ourownstory/test-of-time.git # may take a while
    #!pip install test-of-time # much faster, but may not have the latest upgrades/bugfixes

import pandas as pd
from neuralprophet import set_log_level, set_random_seed
from tot.df_utils import _check_min_df_len, prep_or_copy_df, check_dataframe, handle_missing_data, split_df, return_df_in_original_format, maybe_drop_added_dates
from tot.exp_utils import evaluate_forecast
from tot.models_neuralprophet import NeuralProphetModel

In [8]:
set_log_level("ERROR")

The benchmark templates test-of-time offers are a quick and simple way to compare multiple models and datasets.
Defining a benchmark and running it will trigger a pipeline that returns the benchmark results. The benchmark
is sub-divided into multiple experiments, that are executed in the pipeline. Every experiment run follows the
same evaluation steps. Let's have a closer look at the individual steps.

### Evaluation processing steps
1. Data-specific pre-processing:
    Remark: This processing is independent of the model and specific to the data)
    - Prepare dataframe to have an ID column
    - Performs a basic sanity check
    - Handles missing data
    - Splits the data into train and test datasets

2. Model definition
    - Set random seed to ensure reproducibility
    - define the model parameters
    - instantiate your model

3. Model-specific data pre-processing:
    - Adjusts the data according to the model configuration to have each model be fitted and predicted correctly

4. Fit model:
    - Calls the fit method on the instantiated model object

5. Predict model:
    - Calls the predict method to create the forecast

6. Model-specific data post-processing:
    - Adjusts the data according to the model configuration to be returned consistently

7. Data-specific post-processing:
    - Drops any added dates

8. Evaluation:
    - Evaluates the forecasts based on selected error metrics

### Load data
Let's load the AirPassenger dataset as an example dataset to walk through the pipeline step by step.

In [9]:
data_location = "https://raw.githubusercontent.com/ourownstory/neuralprophet-data/main/datasets/"
df_air = pd.read_csv(data_location + 'air_passengers.csv')

### 1. Data-specific pre-processing

In [10]:
# prep_or_copy_df() ensures that the df has an "ID" column to be usable in the further process
df_air, received_ID_col, received_single_time_series, _ = prep_or_copy_df(df_air)
# check_dataframe() performs a basic sanity check on the data
df_air = check_dataframe(df_air, check_y=True)
# handle_missing_data() imputes missing data
df_air = handle_missing_data(df_air, freq='MS')
# split_df() splits the data into train and test data
df_air_train, df_air_test = split_df(
    df=df_air,
    test_percentage=0.40,
)

### 2. Model definition

In [11]:
set_random_seed(42)
model_class = NeuralProphetModel
params =  {
    "n_forecasts": 3,
    "n_lags":7,
    "seasonality_mode": "multiplicative",
    "learning_rate": 0.03,
    "_data_params":{},
}
model=model_class(params=params)

### 3. Model-specific data pre-processing

In [12]:
# check if train and test df contain enough samples
_check_min_df_len(df=df_air_train, min_len=model.n_forecasts + model.n_lags)
_check_min_df_len(df=df_air_test, min_len=model.n_forecasts)
# extend the test df with historic values from the train df
df_air_test = model.maybe_extend_df(df_air_train, df_air_test)

### 4. Fit model

In [13]:
model.model.fit(df=df_air_train, freq='MS', progress="none", minimal=True)

### 5. Predict model

In [14]:
# the model-individual predict function outputs the forecasts as a df
fcst_train = model.model.predict(df=df_air_train)
fcst_test = model.model.predict(df=df_air_test)

### 6. Model-specific post-processing:

In [15]:
# As you can see, the method is a class method and hence linked to the model
fcst_test, df_air_test = model.maybe_drop_added_values_from_df(fcst_test, df_air_test)

### 7. Data-specific data post-processing:

In [16]:
# in case an 'ID' column was previously added, return_df_in_original_format() will remove it again
fcst_train_df = return_df_in_original_format(fcst_train, received_ID_col, received_single_time_series)
fcst_test_df = return_df_in_original_format(fcst_test, received_ID_col, received_single_time_series)
# in case, missing data was imputed maybe_drop_added_dates() removes it again
fcst_train_df, df_air_train = maybe_drop_added_dates(fcst_train_df, df_air_train)
fcst_test_df, df_air_test = maybe_drop_added_dates(fcst_test_df, df_air_test)

### 8. Evaluation:

In [17]:
# evaluate_forecast() computes the selected error metrics
result_train, result_test = evaluate_forecast(fcst_train_df, fcst_test_df, metrics=['MAPE','MAE','RMSE'], metadata=None)
print(result_test)

{'MAPE': 8.895154, 'MAE': 37.84229, 'RMSE': 46.43487}
