Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Confusing documentation regarding Independent multi-time series forecasting and exogenous features #531

Closed
valentin-fngr opened this issue Sep 2, 2023 · 7 comments
Labels
documentation Improvements or additions to documentation question Further information is requested

Comments

@valentin-fngr
Copy link

Hi,

in your code example : Independent multi-time series forecasting.
You are providing a very good tutorial on how to learn from independent multi time series, BUT, you do not provide any indication on how to provide exogenous features ?

There is drawing which confuses me a lot, as I do not know I am supposed to process my dataframe that way or you are just showing what is happening behind the hood :

image

I have the following columns :

id : the id of my time series
X1 - X800 : a list of 800 features
y : my target

What dataframe format is backtesting_forecaster_multiseries expecting ?

The given dataset used in the example and the drawing above do not align.

Thank you very much

@valentin-fngr
Copy link
Author

After reading the source code, I figured that the drawing is representing what is happening under the hood.
Nonetheless, I believe this is confusing and does not help on how to prepare the data in the case of multi_series + exogenous features.

I still do not know how to prepare my data in that case.

@JoaquinAmatRodrigo
Copy link
Owner

Hi @valentin-fngr,
Thanks for using skforecast.
You are right, the graph represents the transformation applied internally by the forecaster.
The current implementation does not allow to use a different exogenous variable for each time series, they are considered as global for all of them.

We will try to improve the documentation in the next release for better clarity.

@JavierEscobarOrtiz
Copy link
Collaborator

JavierEscobarOrtiz commented Sep 3, 2023

Hello @valentin-fngr

Thanks for opening the issue. If I understand you correctly, your problem is that you want to add a different exogenous variable per series. Could you give us some information about your use case? We have this feature in the backlog, but the only use case that comes to my mind that needs this is for example modeling a time series by countries and you need a different vacation indicator for each of them.

As for the documentation, yes, you are right, this is what happens under the hood. Please see the updated user guide at the following link, where I have added a new section on exogenous variables:

https://skforecast.org/latest/user_guides/independent-multi-time-series-forecasting

It is important to know that the dataframe that ForecasterAutoregMultiSeries expects looks like this:

imagen

One column for each target variable (item_1, item_2 and item_3) and additionals with the exogenous variables.

Please, let us know your thoughts on this 😄

@valentin-fngr
Copy link
Author

Hi @JavierEscobarOrtiz,

Thanks for your reply.
Here is my use case :

I have 10 different stocks (my time series). All of them have the same 100 features X1 to X100 but with different values.
for example :

image

Here, HUB is the time serie id. imagine you have different values for hub.

I thought about implementing the feature and submitting a pull request, I started, but time was ticking and I could not propose a clean pull request with all associated tests.

Regarding your modification, I think it is much better and clears any confusion. I would really suggest to build that feature as a lot of people might benefit from it.

Valentin

@valentin-fngr
Copy link
Author

Do you have any workaround regarding my specific problem ?
I would like to keep using your library for its simplicity.

@JoaquinAmatRodrigo
Copy link
Owner

Hi @valentin-fngr,
Unfortunately, we do not currently have a solution for incorporating different exogenous variables for each series in the multi-series forecasting models. However, this feature is currently on our radar and is part of our backlog for upcoming feature releases.

We are always open to new ideas. If you have any suggestions, we would be happy to discuss them with you.

@valentin-fngr
Copy link
Author

So, I cam with a work around which is not exactly what I want but it somehow yields interesting results so I might as well share it.

I create pivot tables for both the targets and the features. This is in the case of a multi-series approach :

# features 
_df_features = _df.pivot_table(index='timestamp', columns='hub', values=[col for col in _df.columns[:140] )
_df_features.columns = [f'{col[0]}_{col[1]}' for col in _df_features.columns]
_df_features = _df.sort_index()
_df_features = _df.asfreq('H')


# targets 
_df_series = pd.pivot_table(
        data=_df, 
        values="y", 
        index="timestamp", 
        columns=["hub"]
    )

  _df_series.columns.name = None
  _df_series.columns = [f"hub_{col}" for col in _df_series.columns]
  _df_series = _df_series.sort_index()
  _df_series = _df_series.asfreq('H')

train_end = "2022-03-31 11:00:00" 

results = grid_search_forecaster_multiseries(
              forecaster         = forecaster,
              series             = _df,
              exog               = _df_features,
              levels             = None,
              lags_grid          = lags_grid,
              param_grid         = param_grid,
              steps              = 24,
              gap                = 12,
              metric             = [mean_absolute_error, mean_squared_error],
              initial_train_size = len(_df.loc[:train_end]),
              refit              = True,
              fixed_train_size   = True,
              return_best        = False,
              n_jobs             = 'auto',
              verbose            = False,
              show_progress      = True
          )
    
    return results

So basically this allows me to stack, for each timestamp, all features for each time series.
This is the best I can do, which is not too bad. But, you can imagine that if I have 800 features for 10 time series, it gives me 8000 exogenous features, which, surprisingly helps my Ridge model ahah.

@JavierEscobarOrtiz JavierEscobarOrtiz added documentation Improvements or additions to documentation question Further information is requested labels Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants