fix: prepare_data crashes with non shuffle split + sample_weight (#887, #1553)#1554
Merged
thinkall merged 1 commit intoMay 28, 2026
Merged
Conversation
…on non-shuffle splits
There was a problem hiding this comment.
Pull request overview
Fixes two related bugs in GenericTask.prepare_data that crash AutoML.fit when combining a non-shuffle split (time/group) with sample_weight. The first bug was caused by a missing elif, leading to double-splitting of the weights; the second by state.sample_weight_all only being set on the shuffle path, causing an AttributeError during final retraining.
Changes:
- Chain the
groupholdout branch aselifof thetimebranch so the time/group/classification/regression split branches are mutually exclusive. - Initialize
state.sample_weight_all = sample_weight_fullbefore split branching so non-shuffle paths populate the attribute consumed byAutoMLState.prepare_sample_train_data. - Add a regression test exercising
split_type="time"+sample_weight+retrain_full=Trueholdout.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| flaml/automl/task/generic_task.py | Initializes sample_weight_all for all split paths and makes the group holdout branch mutually exclusive with the surrounding branches. |
| test/automl/test_split.py | Adds regression test covering issues #887 and #1553 (time split + sample_weight + retrain_full). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why are these changes needed?
Two related bugs in
GenericTask.prepare_datathat surface together whenever a user combines a non-shuffle split (split_type="time"or"group") withsample_weight:[Bug]: holdout + sample_weight + time/group split raises ValueError due to missing elif in prepare_data branching #1553 (
ValueError: Found input variables with inconsistent numbers of samples) —prepare_datawas splitting the data twice. Theif split_type == "group":branch atflaml/automl/task/generic_task.py:956was a standaloneif, so after thetime/groupbranch ran (correctly subsettingstate.fit_kwargs["sample_weight"]), the chainedelif self.is_classification()/elif self.is_regression()branches also fired and tried to split the already-subset weight against the still-fullX_train_all. One-character fix: chain thegroupbranch off the precedingtimebranch withelif.'AutoMLState' has no attribute 'sample_weight_all' when split_type='time' #887 (
AttributeError: 'AutoMLState' has no attribute 'sample_weight_all') —AutoMLState.prepare_sample_train_datareadsself.sample_weight_allwhenever"sample_weight"is infit_kwargs, but the attribute was only set inside theSHUFFLE_SPLIT_TYPESbranch. Forsplit_type="time"or"group"(or any other non-shuffle path), the attribute was never created, raising the AttributeError during the final retrain step. Fix proposed by @sonichi in 'AutoMLState' has no attribute 'sample_weight_all' when split_type='time' #887 (2023): initializestate.sample_weight_all = sample_weight_fullonce before the shuffle branching; the SHUFFLE branch still overwrites it aftershuffle(...)returns the shuffled weight.The two bugs surface in the same call (
AutoML.fit(... split_type="time", sample_weight=..., retrain_full=True)) — theValueErrorfires first inprepare_data, and once that's fixed, theAttributeErrorbecomes reachable during the retrain step. So both fixes share a single regression test.Related issue number
Closes #887
Closes #1553
Checks