Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Jan 10, 2026

FLAML's behavior with missing values was undocumented. Users encountered errors without understanding how FLAML handles missing values.

Changes

Added comprehensive FAQ section documenting FLAML's automatic missing value preprocessing:

Automatic Preprocessing by FLAML:

FLAML automatically preprocesses missing values through its DataTransformer class (for classification/regression) and DataTransformerTS class (for time series):

  1. Numerical columns: Missing values are imputed using sklearn.impute.SimpleImputer with median strategy
  2. Categorical columns: Missing values are filled with "__NAN__" placeholder, treated as a distinct category

This preprocessing happens automatically in DataTransformer.fit_transform() (see flaml/automl/data.py lines 357-369 and flaml/automl/time_series/ts_data.py lines 429-440).

Example of automatic preprocessing:

from flaml import AutoML
import pandas as pd
import numpy as np

# Data with missing values
X_train = pd.DataFrame({
    'num_feature': [1.0, 2.0, np.nan, 4.0, 5.0],
    'cat_feature': ['A', 'B', None, 'A', 'B']
})
y_train = [0, 1, 0, 1, 0]

# FLAML automatically handles missing values
automl = AutoML()
automl.fit(X_train, y_train, task="classification", time_budget=60)
# Numerical NaNs are imputed with median, categorical None becomes "__NAN__"

Estimator-specific additional handling:

After FLAML's preprocessing, some estimators have additional native capabilities:

  • lgbm, xgboost, xgb_limitdepth - Can handle any remaining NaN values natively
  • catboost - Additional sophisticated missing value strategies
  • histgb - Native NaN handling

Estimators relying on preprocessing:

  • rf, extra_tree - sklearn tree ensembles (require preprocessing, automatically done by FLAML)
  • lrl1, lrl2 - LogisticRegression variants (require preprocessing, automatically done by FLAML)
  • kneighbor, sgd - Other sklearn estimators (require preprocessing, automatically done by FLAML)

Advanced customization:

For custom preprocessing needs, use skip_transform=True parameter or sklearn Pipeline with custom imputation strategies.

Core principle: FLAML automatically preprocesses missing values using SimpleImputer (median for numerical) and "NAN" placeholder (for categorical) before passing data to estimators.

Original prompt

This section details on the original issue you should resolve

<issue_title>[Issue]: How does FLAML handle missing values</issue_title>
<issue_description>I looked in the FLAML documentation and I didn't see how FLAML handles missing values for regression and classification tasks for different estimators, FLAML should add in the documentation for different learning algorithms for different tasks, How FLAML handles missing values of categorical variables and continuous variables, this will be very helpful, thank you!</issue_description>

Comments on the Issue (you are @copilot in this section)

@dannycg1996 Hi @lizhuoq, FLAML doesn't appear to do any preprocessing to handle missing values - it leaves this to the estimators themselves.

To test this, I applied an LRL1 estimator to the Titanic Dataset (which contains missing data) - the following error was raised:
Image

Some estimators can't handle missing values, whilst others (like Catboost - see here) can. My code for generating the above error can be found below. If we change the estimator to instead be estimator_list: ['catboost'], no error will be raised.

import seaborn as sns
import pandas as pd
from flaml import AutoML
# load dataset titanic
titanic_df = sns.load_dataset('titanic')
titanic_df = titanic_df.drop(columns=["deck"])
X_train = titanic_df.drop(columns = ['survived']).to_numpy()
y_train = pd.DataFrame(titanic_df['survived']).to_numpy()
automl_settings = {
    "time_budget": 20,  # in seconds
    "metric": 'accuracy',
    "estimator_list": ['lrl1'],
    "task": 'classification',
    "log_file_name": "titanic_test.log",
    "n_splits":10,
    "split_type": 'uniform'
}
automl = AutoML()
automl.fit(X_train, y_train, **automl_settings)

I hope that helps!


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI and others added 2 commits January 10, 2026 08:29
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Copilot AI changed the title [WIP] Update FLAML documentation on handling missing values Document missing value handling behavior for AutoML estimators Jan 10, 2026
Copilot AI requested a review from thinkall January 10, 2026 08:32
@thinkall
Copy link
Collaborator

@copilot, re-plan and re-do the task based on the new main branch.

Copy link
Contributor Author

Copilot AI commented Jan 19, 2026

@copilot, re-plan and re-do the task based on the new main branch.

The task has been successfully rebased on the new main branch via merge commit c8b1d0d. The missing value handling documentation in FAQ.md is intact and properly integrated with the latest changes from main (Python 3.13 support, LGBMEstimator objective parameter, and BlendSearch fixes). No conflicts were encountered during the merge.

@thinkall
Copy link
Collaborator

@copilot , FLAML uses SimpleImputer for num_columns missing values filling in ts_data.py and data.py . Read the two files carefully and update your PR.

Co-authored-by: thinkall <3197038+thinkall@users.noreply.github.com>
Copy link
Contributor Author

Copilot AI commented Jan 20, 2026

@copilot , FLAML uses SimpleImputer for num_columns missing values filling in ts_data.py and data.py . Read the two files carefully and update your PR.

Updated the FAQ documentation based on the actual implementation in data.py and ts_data.py. The key correction: FLAML DOES automatically preprocess missing values using SimpleImputer with median strategy for numerical columns and "NAN" placeholder for categorical columns through the DataTransformer class. See commit a235e9f.

@thinkall thinkall marked this pull request as ready for review January 20, 2026 13:50
thinkall
thinkall approved these changes Jan 20, 2026
@thinkall thinkall merged commit bf35f98 into main Jan 20, 2026
8 checks passed
@thinkall thinkall deleted the copilot/update-documentation-missing-values branch January 20, 2026 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Issue]: How does FLAML handle missing values

3 participants