Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Seasonality changepoint detection does not seem to work with cross-validation for Silverkite #55

Closed
julioasotodv opened this issue Oct 14, 2021 · 6 comments

Comments

@julioasotodv
Copy link

Hi,

First of all thank you for open-sourcing this library. It's really complete and well though (as well as the Silverkite algorithm itself).

However, I think I have spotted a potential bug:

It seems that the option seasonality_changepoints_dict in ModelComponentsParam does seem to break some functionality within pandas, when running Silverkite with cross-validation.

Here's a complete example (using Greykite 0.2.0):

import pandas as pd
import numpy as np

# Load airline passengers dataset (with monthly data):
air_passengers = pd.read_csv("https://raw.githubusercontent.com/jbrownlee/Datasets/master/airline-passengers.csv")
air_passengers["Month"] = pd.to_datetime(air_passengers["Month"])
air_passengers = air_passengers.set_index("Month").asfreq("MS").reset_index()

# Prepare Greykite configs:
from greykite.framework.templates.autogen.forecast_config import (ComputationParam, 
                                                                  EvaluationMetricParam, 
                                                                  EvaluationPeriodParam,
                                                                  ForecastConfig, 
                                                                  MetadataParam, 
                                                                  ModelComponentsParam)

# Metadata:
metadata_params = MetadataParam(date_format=None,  # infer
                                freq="MS",
                                time_col="Month",
                                train_end_date=None,
                                value_col="Passengers")

# Eval metric:
evaluation_metric_params = EvaluationMetricParam(agg_func=np.sum,   # Sum all forecasts...
                                                 agg_periods=12,    # ...Over 12 months
                                                 cv_report_metrics=["MeanSquaredError", "MeanAbsoluteError", "MeanAbsolutePercentError"],
                                                 cv_selection_metric="MeanAbsolutePercentError",
                                                 null_model_params=None,
                                                 relative_error_tolerance=None)

# Eval procedure (CV & backtest):
evaluation_period_params = EvaluationPeriodParam(cv_expanding_window=False,
                                                 cv_horizon=0,   # No CV for now. CHANGE THIS
                                                 cv_max_splits=5,
                                                 cv_min_train_periods=24,
                                                 cv_periods_between_splits=6,
                                                 cv_periods_between_train_test=0,
                                                 cv_use_most_recent_splits=False,
                                                 periods_between_train_test=0,
                                                 test_horizon=12)

# Config for seasonality changepoints
seasonality_components_df = pd.DataFrame({"name": ["conti_year"],
                                          "period": [1.0],
                                          "order": [5],
                                          "seas_names": ["yearly"]})

# Model components (quite long):
model_components_params = ModelComponentsParam(autoregression={"autoreg_dict": "auto"},
                                               
                                               changepoints={"changepoints_dict":  [{"method":"auto",
                                                                                     "potential_changepoint_n": 50,
                                                                                     "no_changepoint_proportion_from_end": 0.2,
                                                                                     "regularization_strength": 0.01}],
                                                             
                                                             # Seasonality changepoints
                                                             "seasonality_changepoints_dict": [{"regularization_strength": 0.6,
                                                                                                "no_changepoint_proportion_from_end": 0.8,
                                                                                                "seasonality_components_df": seasonality_components_df,
                                                                                                "potential_changepoint_n": 50,
                                                                                                "resample_freq":"MS"},
                                                                                               ]
                                                            },
                                               
                                               custom={"fit_algorithm_dict": [{"fit_algorithm": "linear"},
                                                                              ],
                                                       "feature_sets_enabled": "auto",
                                                       "min_admissible_value": 0.0},
                                               
                                               events={"holiday_lookup_countries": None,
                                                       "holidays_to_model_separately": None,
                                                       },
                                               
                                               growth={"growth_term":["linear"]},
                                               
                                               hyperparameter_override={"input__response__outlier__z_cutoff": [100.0],
                                                                        "input__response__null__impute_algorithm": ["ts_interpolate"]},
                                               
                                               regressors=None,
                                               
                                               lagged_regressors=None,
                                               
                                               seasonality={"yearly_seasonality": [5],
                                                            "quarterly_seasonality": ["auto"],
                                                            "monthly_seasonality": False,
                                                            "weekly_seasonality": False,
                                                            "daily_seasonality": False},
                                               
                                               uncertainty=None)

# Computation
computation_params = ComputationParam(n_jobs=1,
                                      verbose=3)


# Define forecaster:
from greykite.framework.templates.forecaster import Forecaster

# defines forecast configuration
config=ForecastConfig(model_template="SILVERKITE",
                      forecast_horizon=12,
                      coverage=0.8,
                      metadata_param=metadata_params,
                      evaluation_metric_param=evaluation_metric_params,
                      evaluation_period_param=evaluation_period_params,
                      model_components_param=model_components_params,
                      computation_param=computation_params,
                     )

# Run:
# creates forecast
forecaster = Forecaster()
result = forecaster.run_forecast_config(df=air_passengers, 
                                        config=config 
                                        )

If we run the piece of code above, everything works as expected. However, if we activate cross-validation (increasing cv_horizon to 5 for instance), Greykite crashes. This happens unless we remove seasonality changepoints (through removing seasonality_changepoints_dict).

The crash traceback looks as follows:

5 fits failed out of a total of 5.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
5 fits failed with the following error:
Traceback (most recent call last):
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\sklearn\model_selection\_validation.py", line 681, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\sklearn\pipeline.py", line 394, in fit
    self._final_estimator.fit(Xt, y, **fit_params_last_step)
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\greykite\sklearn\estimator\simple_silverkite_estimator.py", line 239, in fit
    self.model_dict = self.silverkite.forecast_simple(
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\greykite\algo\forecast\silverkite\forecast_simple_silverkite.py", line 708, in forecast_simple
    trained_model = super().forecast(**parameters)
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\greykite\algo\forecast\silverkite\forecast_silverkite.py", line 719, in forecast
    seasonality_changepoint_result = get_seasonality_changepoints(
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\greykite\algo\changepoint\adalasso\changepoint_detector.py", line 1177, in get_seasonality_changepoints
    result = cd.find_seasonality_changepoints(**seasonality_changepoint_detection_args)
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\greykite\common\python_utils.py", line 787, in fn_ignore
    return fn(*args, **kwargs)
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\greykite\algo\changepoint\adalasso\changepoint_detector.py", line 736, in find_seasonality_changepoints
    seasonality_df = build_seasonality_feature_df_with_changes(
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\greykite\algo\changepoint\adalasso\changepoints_utils.py", line 237, in build_seasonality_feature_df_with_changes
    fs_truncated_df.loc[(features_df["datetime"] < date).values, cols] = 0
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\pandas\core\indexing.py", line 719, in __setitem__
    indexer = self._get_setitem_indexer(key)
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\pandas\core\indexing.py", line 646, in _get_setitem_indexer
    self._ensure_listlike_indexer(key)
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\pandas\core\indexing.py", line 709, in _ensure_listlike_indexer
    self.obj._mgr = self.obj._mgr.reindex_axis(
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\pandas\core\internals\base.py", line 89, in reindex_axis
    return self.reindex_indexer(
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\pandas\core\internals\managers.py", line 670, in reindex_indexer
    self.axes[axis]._validate_can_reindex(indexer)
  File "C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\pandas\core\indexes\base.py", line 3785, in _validate_can_reindex
    raise ValueError("cannot reindex from a duplicate axis")
ValueError: cannot reindex from a duplicate axis


C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\sklearn\model_selection\_search.py:969: UserWarning:

One or more of the test scores are non-finite: [nan]

C:\Users\SOTOVJU1\Anaconda3\envs\greykite\lib\site-packages\sklearn\model_selection\_search.py:969: UserWarning:

One or more of the train scores are non-finite: [nan]

It would be great to cross-validate when seasonality changepoint is activated, as it allows to learn multiplicative seasonalities for instance in a similar fashion as Prophet or Orbit do.

Thank you!

@KaixuYang
Copy link
Contributor

Hi @julioasotodv thanks for the detailed example. Somehow the code ran without error on my machine (after I change cv_horizon to 5). I see you are using the Anaconda environment. What is the python version and pandas/sklearn version?

@julioasotodv
Copy link
Author

Hi @KaixuYang, after some more testing (also on Mac, as I was using Windows before) it seems that my pandas version was the issue.

I was using pandas=1.3.3. After downgrading to pandas 1.2 (for instance, pandas=1.2.5) the issue goes away and CV results are not NaN anymore, so it looks like some pandas op in greykite\algo\changepoint\adalasso\changepoint_detector.py does not seem to play well with pandas>=1.3.

So downgrading to pandas 1.2 solves the issue, which is great to know.

Shall I keep this issue open or perhaps a PR in the meantime for modifying setup.py accordingly?

Thanks a lot

@KaixuYang
Copy link
Contributor

Hi @julioasotodv thanks for the investigation. Actually the root cause of this issue is that this is a monthly data set, and since the number of potential changepoints are too many, it brings duplicates into the columns. Instead of forcing the pandas version to be 1.2, we would like to fix this issue so it will get along with pandas 1.3 as well. Could you help us submitting a PR to fix this? I think a reasonable fix would be that: in line 230 of greykite.algo.changepoint.adalasso, the changepoints contains duplicates. We want to eliminate the duplicates with the util function greykite.common.python_utils.unique_elements_in_list. Could you test if this resolves the problem? Thanks!

@julioasotodv
Copy link
Author

Hi @KaixuYang, thank you for reaching out.

I understand the issue. However, there is not change in pandas 1.3 that should yield these two different behaviors as far as I know.

Will try to debug and search in pandas' changelogs before "hardcoding" a greykite.common.python_utils.unique_elements_in_list patch, just to make sure there would not be any side effects.

Thanks!

@KaixuYang
Copy link
Contributor

Thanks @julioasotodv ! Yeah actually unique_elements_in_list should be a safe guard for the function even for the earlier versions of pandas, but we failed to put it there. It would be great to add it there so no duplicated columns are generated.

@KaixuYang
Copy link
Contributor

Fixed in the next release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants