Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DateTimeIndex warning #329

Closed
Hussam1 opened this issue Jan 16, 2023 · 5 comments
Closed

DateTimeIndex warning #329

Hussam1 opened this issue Jan 16, 2023 · 5 comments
Labels
question Further information is requested

Comments

@Hussam1
Copy link

Hussam1 commented Jan 16, 2023

I am having warning of:

UserWarning: y has DatetimeIndex index but no frequency. Index is overwritten with a RangeIndex of step 1

Although the index is: DateTimeIndex, and the frequency is defined as "D"

if I try to replicate ur codes for LGBMRegressor:

import yfinance as yf
import datetime as dt

spxl = yf.Ticker("SPXL")
hist = spxl.history(start="2015-01-01")
hist = hist.asfreq("D")
data = hist.dropna()
type(data.index)
#Output: pandas.core.indexes.datetimes.DatetimeIndex

#Split data into train-val-test
#==============================================================================
data = data.loc['2015-01-01': '2022-12-31']
end_train = '2019-12-31'
end_validation = '2020-12-31'
data_train = data.loc[: end_train, :].copy()
data_val   = data.loc[end_train:end_validation, :].copy()
data_test  = data.loc[end_validation:, :].copy()

#Create forecaster
#==============================================================================
forecaster = ForecasterAutoreg(
                regressor = LGBMRegressor(),
                lags = 7
            )

#Grid search of hyper-parameters and lags
#==============================================================================
#Regressor hyper-parameters
param_grid = {
    'n_estimators': [100, 500],
    'max_depth': [3, 5, 10],
    'learning_rate': [0.01, 0.1]
}

#Lags used as predictors
lags_grid = [7]
results_grid_q10 = grid_search_forecaster(
                            forecaster         = forecaster,
                            y                  = data.loc[:end_validation, 'Close'],
                            param_grid         = param_grid,
                            lags_grid          = lags_grid,
                            steps              = 7,
                            refit              = True,
                            metric             = 'mean_squared_error',
                            initial_train_size = int(len(data_train)),
                            fixed_train_size   = False,
                            return_best        = True,
                            verbose            = False
                    )
@JavierEscobarOrtiz
Copy link
Collaborator

JavierEscobarOrtiz commented Jan 17, 2023

Hello @Hussam1,

This line of code is causing the problem data = hist.dropna(). Dropping NaNs creates a gap in the series and then the series loses its frequency. Check this issue.

Check how freq disappears (and the length is reduced):

import yfinance as yf
import datetime as dt

spxl = yf.Ticker("SPXL")
hist = spxl.history(start="2015-01-01")
hist = hist.asfreq("D")
print(hist.index)

DatetimeIndex(['2015-01-02 00:00:00-05:00', '2015-01-03 00:00:00-05:00',
'2015-01-04 00:00:00-05:00', '2015-01-05 00:00:00-05:00',
'2015-01-06 00:00:00-05:00', '2015-01-07 00:00:00-05:00',
'2015-01-08 00:00:00-05:00', '2015-01-09 00:00:00-05:00',
'2015-01-10 00:00:00-05:00', '2015-01-11 00:00:00-05:00',
...
'2023-01-04 00:00:00-05:00', '2023-01-05 00:00:00-05:00',
'2023-01-06 00:00:00-05:00', '2023-01-07 00:00:00-05:00',
'2023-01-08 00:00:00-05:00', '2023-01-09 00:00:00-05:00',
'2023-01-10 00:00:00-05:00', '2023-01-11 00:00:00-05:00',
'2023-01-12 00:00:00-05:00', '2023-01-13 00:00:00-05:00'],
dtype='datetime64[ns, America/New_York]', name='Date', length=2934, freq='D')

data = hist.dropna()
print(data.index)

DatetimeIndex(['2015-01-02 00:00:00-05:00', '2015-01-05 00:00:00-05:00',
'2015-01-06 00:00:00-05:00', '2015-01-07 00:00:00-05:00',
'2015-01-08 00:00:00-05:00', '2015-01-09 00:00:00-05:00',
'2015-01-12 00:00:00-05:00', '2015-01-13 00:00:00-05:00',
'2015-01-14 00:00:00-05:00', '2015-01-15 00:00:00-05:00',
...
'2022-12-30 00:00:00-05:00', '2023-01-03 00:00:00-05:00',
'2023-01-04 00:00:00-05:00', '2023-01-05 00:00:00-05:00',
'2023-01-06 00:00:00-05:00', '2023-01-09 00:00:00-05:00',
'2023-01-10 00:00:00-05:00', '2023-01-11 00:00:00-05:00',
'2023-01-12 00:00:00-05:00', '2023-01-13 00:00:00-05:00'],
dtype='datetime64[ns, America/New_York]', name='Date', length=2023, freq=None)

@JavierEscobarOrtiz JavierEscobarOrtiz added the question Further information is requested label Jan 17, 2023
@Hussam1
Copy link
Author

Hussam1 commented Jan 17, 2023

Thanks @JavierEscobarOrtiz for the response. Problem is sometimes filling the "gap" might not be optimal or correct from business case prospective, for instance in this situation trading happen only in business days and it will distort the purpose to assume any results in the "gap" days.

even if we take hist = hist.asfreq("B") we will still have the gap. Or do you mean it is ok to leave them as null as long as there are no gaps in DateTimeIndex?

Edit:

I tested your answer, yes by filling the gap with any figures the error is gone but how would you solve the problem of having to fill the gap when it doesn't make sense business wise, for instance if a company doesn't have sales every day and you still need to model the daily sales? of course you can always resample till higher frequency but don't you think this is a limitation of the library?

@JoaquinAmatRodrigo
Copy link
Owner

Hi @Hussam1,

Forecasting with missing values is always a challenge. How to solve it depends a lot on the business case. Based on what you are explaining, it may make sense to propagate the value of the last business day.

You may also benefit from the weighted time series forecasting feature that skforecast offers.

https://www.cienciadedatos.net/documentos/py46-forecasting-time-series-missing-values.html

https://joaquinamatrodrigo.github.io/skforecast/0.6.0/faq/forecasting-time-series-with-missing-values.html

@JavierEscobarOrtiz
Copy link
Collaborator

JavierEscobarOrtiz commented Jan 18, 2023

Hello @Hussam1,

Yes, you are right. One of the main limitations of an autoregressive model is that the series cannot be incomplete. Since the prediction t+1 depends on its past values (lags) it will not make sense for the gap between the lags to be different for each prediction.

Along with @JoaquinAmatRodrigo's solutions, I think another one can be tried:

  • Did you try your idea of hist = hist.asfreq("B")? Since the series has a freq the error should be gone. It only makes sense as you mention in a series that stops on Friday and starts again on Monday. Disclaimer: business days values cannot be NaN.

@Hussam1
Copy link
Author

Hussam1 commented Jan 18, 2023

Thanks you very much @JoaquinAmatRodrigo @JavierEscobarOrtiz for your answers and suggestions. I agree for company's sales business case (which is the real case for me) propagating last business day's sales might be good option in addition to resampling to a higher frequency such as weekly/monthly.

I appreciate the efforts you put in this library, it is super helpful!

@Hussam1 Hussam1 closed this as completed Jan 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants