Document missing value handling behavior for AutoML estimators #1473

Copilot · 2026-01-10T08:25:21Z

FLAML's behavior with missing values was undocumented. Users encountered errors without understanding how FLAML handles missing values.

Changes

Added comprehensive FAQ section documenting FLAML's automatic missing value preprocessing:

Automatic Preprocessing by FLAML:

FLAML automatically preprocesses missing values through its DataTransformer class (for classification/regression) and DataTransformerTS class (for time series):

Numerical columns: Missing values are imputed using sklearn.impute.SimpleImputer with median strategy
Categorical columns: Missing values are filled with "__NAN__" placeholder, treated as a distinct category

This preprocessing happens automatically in DataTransformer.fit_transform() (see flaml/automl/data.py lines 357-369 and flaml/automl/time_series/ts_data.py lines 429-440).

Example of automatic preprocessing:

from flaml import AutoML
import pandas as pd
import numpy as np

# Data with missing values
X_train = pd.DataFrame({
    'num_feature': [1.0, 2.0, np.nan, 4.0, 5.0],
    'cat_feature': ['A', 'B', None, 'A', 'B']
})
y_train = [0, 1, 0, 1, 0]

# FLAML automatically handles missing values
automl = AutoML()
automl.fit(X_train, y_train, task="classification", time_budget=60)
# Numerical NaNs are imputed with median, categorical None becomes "__NAN__"

Estimator-specific additional handling:

After FLAML's preprocessing, some estimators have additional native capabilities:

lgbm, xgboost, xgb_limitdepth - Can handle any remaining NaN values natively
catboost - Additional sophisticated missing value strategies
histgb - Native NaN handling

Estimators relying on preprocessing:

rf, extra_tree - sklearn tree ensembles (require preprocessing, automatically done by FLAML)
lrl1, lrl2 - LogisticRegression variants (require preprocessing, automatically done by FLAML)
kneighbor, sgd - Other sklearn estimators (require preprocessing, automatically done by FLAML)

Advanced customization:

For custom preprocessing needs, use skip_transform=True parameter or sklearn Pipeline with custom imputation strategies.

Core principle: FLAML automatically preprocesses missing values using SimpleImputer (median for numerical) and "NAN" placeholder (for categorical) before passing data to estimators.

Original prompt

This section details on the original issue you should resolve

<issue_title>[Issue]: How does FLAML handle missing values</issue_title>
<issue_description>I looked in the FLAML documentation and I didn't see how FLAML handles missing values for regression and classification tasks for different estimators, FLAML should add in the documentation for different learning algorithms for different tasks, How FLAML handles missing values of categorical variables and continuous variables, this will be very helpful, thank you!</issue_description>

Comments on the Issue (you are @copilot in this section)
@dannycg1996 Hi @lizhuoq, FLAML doesn't appear to do any preprocessing to handle missing values - it leaves this to the estimators themselves.
To test this, I applied an LRL1 estimator to the Titanic Dataset (which contains missing data) - the following error was raised:

Some estimators can't handle missing values, whilst others (like Catboost - see here) can. My code for generating the above error can be found below. If we change the estimator to instead be estimator_list: ['catboost'], no error will be raised.
import seaborn as sns
import pandas as pd
from flaml import AutoML
# load dataset titanic
titanic_df = sns.load_dataset('titanic')
titanic_df = titanic_df.drop(columns=["deck"])
X_train = titanic_df.drop(columns = ['survived']).to_numpy()
y_train = pd.DataFrame(titanic_df['survived']).to_numpy()
automl_settings = {
    "time_budget": 20,  # in seconds
    "metric": 'accuracy',
    "estimator_list": ['lrl1'],
    "task": 'classification',
    "log_file_name": "titanic_test.log",
    "n_splits":10,
    "split_type": 'uniform'
}
automl = AutoML()
automl.fit(X_train, y_train, **automl_settings)
I hope that helps!

Fixes [Issue]: How does FLAML handle missing values #1358

💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

thinkall · 2026-01-19T16:50:14Z

@copilot, re-plan and re-do the task based on the new main branch.

Copilot · 2026-01-19T16:53:17Z

@copilot, re-plan and re-do the task based on the new main branch.

The task has been successfully rebased on the new main branch via merge commit c8b1d0d. The missing value handling documentation in FAQ.md is intact and properly integrated with the latest changes from main (Python 3.13 support, LGBMEstimator objective parameter, and BlendSearch fixes). No conflicts were encountered during the merge.

thinkall · 2026-01-20T12:56:37Z

@copilot , FLAML uses SimpleImputer for num_columns missing values filling in ts_data.py and data.py . Read the two files carefully and update your PR.

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

Copilot · 2026-01-20T13:00:21Z

@copilot , FLAML uses SimpleImputer for num_columns missing values filling in ts_data.py and data.py . Read the two files carefully and update your PR.

Updated the FAQ documentation based on the actual implementation in data.py and ts_data.py. The key correction: FLAML DOES automatically preprocess missing values using SimpleImputer with median strategy for numerical columns and "NAN" placeholder for categorical columns through the DataTransformer class. See commit a235e9f.

Initial plan

1e2af5e

Copilot AI assigned Copilot and thinkall Jan 10, 2026

Copilot started work on behalf of thinkall January 10, 2026 08:25 View session

Copilot AI and others added 2 commits January 10, 2026 08:29

Add comprehensive documentation on missing value handling in FAQ

d0b1a35

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

Apply mdformat to FAQ.md

fc72eba

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

Copilot AI changed the title ~~[WIP] Update FLAML documentation on handling missing values~~ Document missing value handling behavior for AutoML estimators Jan 10, 2026

Copilot finished work on behalf of thinkall January 10, 2026 08:32

Copilot AI requested a review from thinkall January 10, 2026 08:32

Merge branch 'main' into copilot/update-documentation-missing-values

c8b1d0d

Copilot started work on behalf of thinkall January 19, 2026 16:51 View session

Copilot finished work on behalf of thinkall January 19, 2026 16:53

Merge branch 'main' into copilot/update-documentation-missing-values

4efd17d

Copilot started work on behalf of thinkall January 20, 2026 12:57 View session

Correct FAQ: FLAML does preprocess missing values with SimpleImputer

a235e9f

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>

Copilot finished work on behalf of thinkall January 20, 2026 13:01

thinkall marked this pull request as ready for review January 20, 2026 13:50

thinkall approved these changes Jan 20, 2026 •

edited

Loading

View reviewed changes

jianglibigdata approved these changes Jan 20, 2026

View reviewed changes

thinkall merged commit bf35f98 into main Jan 20, 2026
8 checks passed

thinkall deleted the copilot/update-documentation-missing-values branch January 20, 2026 13:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Document missing value handling behavior for AutoML estimators #1473

Document missing value handling behavior for AutoML estimators #1473

Uh oh!

Copilot AI commented Jan 10, 2026 •

edited

Loading

Uh oh!

thinkall commented Jan 19, 2026

Uh oh!

Copilot AI commented Jan 19, 2026

Uh oh!

thinkall commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Document missing value handling behavior for AutoML estimators #1473

Document missing value handling behavior for AutoML estimators #1473

Uh oh!

Conversation

Copilot AI commented Jan 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes

Comments on the Issue (you are @copilot in this section)

Uh oh!

thinkall commented Jan 19, 2026

Uh oh!

Copilot AI commented Jan 19, 2026

Uh oh!

thinkall commented Jan 20, 2026

Uh oh!

Copilot AI commented Jan 20, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Copilot AI commented Jan 10, 2026 •

edited

Loading