-
Notifications
You must be signed in to change notification settings - Fork 550
Document missing value handling behavior for AutoML estimators #1473
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
|
@copilot, re-plan and re-do the task based on the new main branch. |
The task has been successfully rebased on the new main branch via merge commit c8b1d0d. The missing value handling documentation in FAQ.md is intact and properly integrated with the latest changes from main (Python 3.13 support, LGBMEstimator objective parameter, and BlendSearch fixes). No conflicts were encountered during the merge. |
|
@copilot , FLAML uses SimpleImputer for num_columns missing values filling in ts_data.py and data.py . Read the two files carefully and update your PR. |
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Updated the FAQ documentation based on the actual implementation in data.py and ts_data.py. The key correction: FLAML DOES automatically preprocess missing values using SimpleImputer with median strategy for numerical columns and "NAN" placeholder for categorical columns through the DataTransformer class. See commit a235e9f. |
FLAML's behavior with missing values was undocumented. Users encountered errors without understanding how FLAML handles missing values.
Changes
Added comprehensive FAQ section documenting FLAML's automatic missing value preprocessing:
Automatic Preprocessing by FLAML:
FLAML automatically preprocesses missing values through its
DataTransformerclass (for classification/regression) andDataTransformerTSclass (for time series):sklearn.impute.SimpleImputerwith median strategy"__NAN__"placeholder, treated as a distinct categoryThis preprocessing happens automatically in
DataTransformer.fit_transform()(seeflaml/automl/data.pylines 357-369 andflaml/automl/time_series/ts_data.pylines 429-440).Example of automatic preprocessing:
Estimator-specific additional handling:
After FLAML's preprocessing, some estimators have additional native capabilities:
lgbm,xgboost,xgb_limitdepth- Can handle any remaining NaN values nativelycatboost- Additional sophisticated missing value strategieshistgb- Native NaN handlingEstimators relying on preprocessing:
rf,extra_tree- sklearn tree ensembles (require preprocessing, automatically done by FLAML)lrl1,lrl2- LogisticRegression variants (require preprocessing, automatically done by FLAML)kneighbor,sgd- Other sklearn estimators (require preprocessing, automatically done by FLAML)Advanced customization:
For custom preprocessing needs, use
skip_transform=Trueparameter or sklearn Pipeline with custom imputation strategies.Core principle: FLAML automatically preprocesses missing values using SimpleImputer (median for numerical) and "NAN" placeholder (for categorical) before passing data to estimators.
Original prompt
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.