Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

reproducibility and random state for AutoML.fit() #151

Closed
zhonghua-zheng opened this issue Aug 4, 2021 · 13 comments · Fixed by #150
Closed

reproducibility and random state for AutoML.fit() #151

zhonghua-zheng opened this issue Aug 4, 2021 · 13 comments · Fixed by #150

Comments

@zhonghua-zheng
Copy link
Collaborator

Hello, I wonder if it is possible to reproduce the results of "flaml.AutoML.fit()"?
If possible, could you please kindly let me know how to set up the random_state (or seed) for the "flaml.AutoML.fit()"?
Thanks!

@qingyun-wu
Copy link
Collaborator

Hi @zzheng93, that's a very good question, and thank you for raising it.

A short answer is: For now, unfortunately, we cannot guarantee flaml.AutoML.fit() is 100% reproducible.

Here is the reason: although we used random seeds to control all the explicit randomness in AutoML, we are not able to control the randomness in run time. And the measurement of time is used in multiple places of flaml.AutoML to prioritize the search over multiple estimators and retraining, given the time budget and time left. The possible fluctuation of time measurement (although usually subtle for a fixed task) makes flaml.AutoML.fit() not reproducible.

We didn't anticipate it to be a big issue because such randomness usually won't make a big difference in terms of the performance. There can be ways to work this issue around if reproducible is really desired. Would you like to share your use case and how much reproducibility matters?

Thanks!

@zhonghua-zheng
Copy link
Collaborator Author

Hi @zzheng93, that's a very good question, and thank you for raising it.

A short answer is: For now, unfortunately, we cannot guarantee flaml.AutoML.fit() is 100% reproducible.

Here is the reason: although we used random seeds to control all the explicit randomness in AutoML, we are not able to control the randomness in run time. And the measurement of time is used in multiple places of flaml.AutoML to prioritize the search over multiple estimators and retraining, given the time budget and time left. The possible fluctuation of time measurement (although usually subtle for a fixed task) makes flaml.AutoML.fit() not reproducible.

We didn't anticipate it to be a big issue because such randomness usually won't make a big difference in terms of the performance. There can be ways to work this issue around if reproducible is really desired. Would you like to share your use case and how much reproducibility matters?

Thanks!

Hi @qingyun-wu , thank you very much for your prompt reply.
I have "estimator_list": ['lgbm', 'rf', 'xgboost', 'catboost']. With the same code and data, sometimes I got that lgbm is the best estimator and sometimes xgboost. Even with the same "best estimator" (e.g., lgbm), the models were different every time. So I wonder if there is any way to keep the results to be similar?

@sonichi
Copy link
Collaborator

sonichi commented Aug 4, 2021

How long is the time budget? The results seem to indicate that the time budget is not enough to conclude what model is best.

@zhonghua-zheng
Copy link
Collaborator Author

How long is the time budget? The results seem to indicate that the time budget is not enough to conclude what model is best.

Hi @sonichi , thank you very much! I just used the default configuration. I have ~1M training samples. It would be great if you could offer some suggestions regarding the configurations (e.g., time budget)!

@sonichi
Copy link
Collaborator

sonichi commented Aug 4, 2021

Oh the default budget is 60 seconds, which is too small for your use case. I'd suggest trying 1 hour (time_budget=3600) if that's affordable. If your desired time budget is shorter than that, then simply use your desired budget.

@zhonghua-zheng
Copy link
Collaborator Author

Oh the default budget is 60 seconds, which is too small for your use case. I'd suggest trying 1 hour (time_budget=3600) if that's affordable. If your desired time budget is shorter than that, then simply use your desired budget.

Thank you @sonichi , I will follow your suggestions!

@qingyun-wu
Copy link
Collaborator

Hi @zzheng93, that's a very good question, and thank you for raising it.
A short answer is: For now, unfortunately, we cannot guarantee flaml.AutoML.fit() is 100% reproducible.
Here is the reason: although we used random seeds to control all the explicit randomness in AutoML, we are not able to control the randomness in run time. And the measurement of time is used in multiple places of flaml.AutoML to prioritize the search over multiple estimators and retraining, given the time budget and time left. The possible fluctuation of time measurement (although usually subtle for a fixed task) makes flaml.AutoML.fit() not reproducible.
We didn't anticipate it to be a big issue because such randomness usually won't make a big difference in terms of the performance. There can be ways to work this issue around if reproducible is really desired. Would you like to share your use case and how much reproducibility matters?
Thanks!

Hi @qingyun-wu , thank you very much for your prompt reply.
I have "estimator_list": ['lgbm', 'rf', 'xgboost', 'catboost']. With the same code and data, sometimes I got that lgbm is the best estimator and sometimes xgboost. Even with the same "best estimator" (e.g., lgbm), the models were different every time. So I wonder if there is any way to keep the results to be similar?

Hi @zzheng93,

Could you try np.random.seed(seed) before automl.fit()

This presumably can help reduce randomness.

@zhonghua-zheng
Copy link
Collaborator Author

Hi @zzheng93, that's a very good question, and thank you for raising it.
A short answer is: For now, unfortunately, we cannot guarantee flaml.AutoML.fit() is 100% reproducible.
Here is the reason: although we used random seeds to control all the explicit randomness in AutoML, we are not able to control the randomness in run time. And the measurement of time is used in multiple places of flaml.AutoML to prioritize the search over multiple estimators and retraining, given the time budget and time left. The possible fluctuation of time measurement (although usually subtle for a fixed task) makes flaml.AutoML.fit() not reproducible.
We didn't anticipate it to be a big issue because such randomness usually won't make a big difference in terms of the performance. There can be ways to work this issue around if reproducible is really desired. Would you like to share your use case and how much reproducibility matters?
Thanks!

Hi @qingyun-wu , thank you very much for your prompt reply.
I have "estimator_list": ['lgbm', 'rf', 'xgboost', 'catboost']. With the same code and data, sometimes I got that lgbm is the best estimator and sometimes xgboost. Even with the same "best estimator" (e.g., lgbm), the models were different every time. So I wonder if there is any way to keep the results to be similar?

Hi @zzheng93,

Could you try np.random.seed(seed) before automl.fit()

This presumably can help reduce randomness.

Thank you @qingyun-wu ! I'll try it

@sonichi sonichi linked a pull request Aug 12, 2021 that will close this issue
@zhonghua-zheng
Copy link
Collaborator Author

Hi, I have followed your suggestions (seed and time_budget=3600). Seems the best_estimator is converged (in my case, they converged to "lgbm"). Although the results are different, they are very close (e.g., r2 of testing data varied from 0.80 to 0.82).

However, I noticed the models are overfitting. For example, the n_estimator is >25000. I wonder if there is any approach in FLAML to prevent overfitting and make the models simpler?

{'n_estimators': 26744,
 'num_leaves': 1009,
 'min_child_samples': 24,
 'learning_rate': 0.01940044907863862,
 'subsample': 1.0,
 'log_max_bin': 7,
 'colsample_bytree': 0.5739585077256129,
 'reg_alpha': 0.0009765625,
 'reg_lambda': 1.087179087107877,
 'FLAML_sample_size': 160000}

Here is the configuration:

automl = AutoML()
automl_settings = {
    "metric": 'r2',
    "estimator_list": ['lgbm', 'rf', 'xgboost', 'catboost'],
    "task": 'regression',
    "log_file_name": "./train.log"
}

np.random.seed(66)
automl.fit(X_train=X_train, y_train=y_train,
           verbose=0, time_budget=3600,
           **automl_settings)

print(f"best_etimator:{automl.best_estimator}")
print(automl.best_config)

Thanks!

@qingyun-wu
Copy link
Collaborator

Hi, I have followed your suggestions (seed and time_budget=3600). Seems the best_estimator is converged (in my case, they converged to "lgbm"). Although the results are different, they are very close (e.g., r2 of testing data varied from 0.80 to 0.82).

However, I noticed the models are overfitting. For example, the n_estimator is >25000. I wonder if there is any approach in FLAML to prevent overfitting and make the models simpler?

{'n_estimators': 26744,
 'num_leaves': 1009,
 'min_child_samples': 24,
 'learning_rate': 0.01940044907863862,
 'subsample': 1.0,
 'log_max_bin': 7,
 'colsample_bytree': 0.5739585077256129,
 'reg_alpha': 0.0009765625,
 'reg_lambda': 1.087179087107877,
 'FLAML_sample_size': 160000}

Here is the configuration:

automl = AutoML()
automl_settings = {
    "metric": 'r2',
    "estimator_list": ['lgbm', 'rf', 'xgboost', 'catboost'],
    "task": 'regression',
    "log_file_name": "./train.log"
}

np.random.seed(66)
automl.fit(X_train=X_train, y_train=y_train,
           verbose=0, time_budget=3600,
           **automl_settings)

print(f"best_etimator:{automl.best_estimator}")
print(automl.best_config)

Thanks!

Hi @zzheng93, thank you for following up. I am wondering why do you conclude that the models are overfitting? Is this conclusion obtained purely based on the fact the this model found is complex, or do you also observe that the gap between validation error and the test error is large on this resulting model? Thanks!

@zhonghua-zheng
Copy link
Collaborator Author

Hi @qingyun-wu , thank you for your prompt response. I guess it is overfitting because the n_estimator is much larger than the model trained from time_budget=300 (n_estimator is about 1200). But the testing errors are similar (R2 = ~0.8).

The training error (R2 = 0.99) from configuration "time_budget=3600" is indeed smaller than the one trained from "time_budget=300" (R2 = 0.93).

@qingyun-wu
Copy link
Collaborator

Hi @qingyun-wu , thank you for your prompt response. I guess it is overfitting because the n_estimator is much larger than the model trained from time_budget=300 (n_estimator is about 1200). But the testing errors are similar (R2 = ~0.8).

The training error (R2 = 0.99) from configuration "time_budget=3600" is indeed smaller than the one trained from "time_budget=300" (R2 = 0.93).

Hi @zzheng93, in your case, at least the larger model (the model found with a larger time budget) does not give worse testing results, so it is less worrisome (assuming you only care about the test error and do not worry about the model size), right? Another more proactive suggestion is to design your own metric to guide the search of AutoML. For example, this custom_metric penalizes the training loss with the hope of alleviating overfitting. It is suggested by other data scientists using FLAML. Perhaps you can also give it a try and let us know whether it helps in your case.

Thank you!

@sonichi
Copy link
Collaborator

sonichi commented Sep 1, 2021

This custom metric function is added as an example to the notebook https://github.com/microsoft/FLAML/blob/main/notebook/flaml_automl.ipynb in #178

@sonichi sonichi closed this as completed Sep 8, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants