Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

time series forecasting on data with missing values #297

Closed
hadisotudeh opened this issue Nov 20, 2021 · 9 comments · Fixed by #362
Closed

time series forecasting on data with missing values #297

hadisotudeh opened this issue Nov 20, 2021 · 9 comments · Fixed by #362
Assignees

Comments

@hadisotudeh
Copy link
Collaborator

It seems the example provided in the documentation does not support data with some missing values.

I thought the example will handle missing value, but it does not seem so:

image

@hadisotudeh
Copy link
Collaborator Author

I filled in the missing rows/data and still I get the same error. The time column is in '2021-11-06 23:59:00' format and it increases minute by minute.

@hadisotudeh
Copy link
Collaborator Author

Adding this line, fixed the issue

estimator_list": ["prophet"]
It seems the error is because of the other estimators

@int-chaos
Copy link
Collaborator

This is because arima and sarimax do no support missing values, but prophet does.

import numpy as np
from flaml import AutoML

X_train = np.arange('2021-11-06', '2021-11-07', dtype='datetime64[m]')
y_train = np.random.random(size=len(X_train))

print(X_train)

automl = AutoML()
automl.fit(
    X_train=X_train[:1380],  # a single column of timestamp
    y_train=y_train[:1380],  # value for each timestamp
    period=60,  # time horizon to forecast, e.g., 60 minutes
    task="ts_forecast",
    time_budget=5,  # time budget in seconds
    estimator_list=["arima", "sarimax"],
    log_file_name="test_minutes.log",
)
print(automl.predict(X_train[60:]))

I just tested this for arima and sarimax and it works.

@sonichi
Copy link
Collaborator

sonichi commented Nov 21, 2021

@int-chaos DataTransformer.fit_transform() is supposed to fill the missing values. Is it not doing so for time series data?

@int-chaos
Copy link
Collaborator

int-chaos commented Nov 21, 2021

No it is not. In DataTransformer.fit_transform(), the time stamp column is popped out then after all the necessary transformations inserted back in, so it would not know that it is missing a time series data and only fill in missing values for the exogenous variables.

That is something I can work on implementing in DataTransformer.fit_transform()

@int-chaos
Copy link
Collaborator

@hadisotudeh Does it work after you filled in the missing timestamp? If not, can you please check that you have the latest version of flaml because I know that there was a problem with this in a previous version where the data was shuffled, thus causing the problem. If the problem still exists, would you please send me the dataset and I will test it out on my end.

@hadisotudeh
Copy link
Collaborator Author

@int-chaos, First, I had missing values. I filled it and tried, but it did not work. Later, I found out that my data also has duplicate rows and rows with the same timestamp, but different values for the output column. "prophet" was working without throwing any exception, and ARIMA family libraries threw a vague exception out.

After I also fixed the duplicate issue, the model.fit part worked.
It seems the steps I mentioned above are good ideas to be implemented as part of the pipeline and output meaningful messages.

I expected to hear from automl that my data has missing values or duplicate rows (no matter if it handles it or not).

P.S. is there an option in the forecasting model to limit it to only positive predictions?

@int-chaos
Copy link
Collaborator

Yes, duplicate rows will lead to errors for the ARIMA family libraries and I agree with you that the exceptions are unclear.

Thank you for your suggestions on data handling and warnings/exceptions. I will work on implementing those.

To my knowledge, I do not believe there is a built-in functionality in the forecasting models to limit it to only positive predictions. You can check out facebook/prophet#1668 for more information. TLDR: clipping the predictions manually using np.clip(predictions, 0) is the best option, but currently clipping is not supported in FLAML.

@sonichi
Copy link
Collaborator

sonichi commented Nov 22, 2021

One way to ensure positive predictions is to log-transform the labels before fit, and exp-transform the predictions afterwards. If the labels are distributed in a wide range this is worth trying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants