[BUG] mlflow.xgboost.load_model: Warning and Missing Parameters using xgboost 2.0.0 #9659

nelsoncardenas · 2023-09-18T19:44:51Z

Issues Policy acknowledgement

I have read and agree to submit bug reports in accordance with the issues policy

Willingness to contribute

No. I cannot contribute a bug fix at this time.

MLflow version

mlflow, version 2.7.0

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Databricks Runtime 11.3 LTS
Python version: 3.9.5
yarn version, if running the dev UI: na
xgboost version: 2.0.0

python packages:

black==22.3.0
cryptography==40.0.2
databricks-sql-connector==2.5.1
databricks==0.2
mlflow==2.7.0
neuralforecast==1.5.0
numpy==1.24.3
pandas==1.5.3
pmdarima==2.0.3
prophet==1.1.3
scikit-learn==0.24.2
snowflake-sqlalchemy==1.4.7
tokenize-rt==4.2.1
xgboost==2.0.0

Describe the problem

When using method mlflow.xgboost.load_model I receive this warning:

> /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/xgboost/core.py:160: 
UserWarning: [19:10:21] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated 
binary model format, please consider using `json` or `ubj`. Model format will default
 to JSON in XGBoost 2.2 if not specified.
  warnings.warn(smsg, UserWarning)

And when getting the model params all of them are null except by 'objective'.

# Load model using mlflow
loaded_model = mlflow.xgboost.load_model(model_path)

# Print parameters to verify
print("Model Params:", loaded_model.get_params())

Model Params: {'objective': 'reg:squarederror', 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None,
 'colsample_bytree': None, 'device': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None,
 'gamma': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 
'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'multi_strategy': None, 'n_estimators': None, 
'n_jobs': None, 'num_parallel_tree': None, 'random_state': None, 'reg_alpha': None,
 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None,
 'subsample': None, 'tree_method': None, 'validate_parameters': None, 
'verbosity': None}

expected output (showing 'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 5):

Model Params: {'objective': 'reg:squarederror', 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None,
 'colsample_bytree': None, 'device': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None,
 'gamma': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': 0.1, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 
'max_depth': 5, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'multi_strategy': None, 'n_estimators': 100, 
'n_jobs': None, 'num_parallel_tree': None, 'random_state': None, 'reg_alpha': None,
 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None,
 'subsample': None, 'tree_method': None, 'validate_parameters': None, 
'verbosity': None}

Tracking information

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/xgboost/core.py:160: UserWarning: [19:27:40] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
  warnings.warn(smsg, UserWarning)
Model Params: {'objective': 'reg:squarederror', 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'device': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'multi_strategy': None, 'n_estimators': None, 'n_jobs': None, 'num_parallel_tree': None, 'random_state': None, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None}
MLflow version: 2.7.0
Tracking URI: databricks
Artifact URI: dbfs:/databricks/mlflow-tracking/2473451629624570/6467ea86f65145e596c98951284a81a3/artifacts
System information: Linux #49~20.04.1-Ubuntu SMP Wed Jul 12 12:44:56 UTC 2023
Python version: 3.9.5
MLflow version: 2.7.0
MLflow module location: /local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/mlflow/__init__.py
Tracking URI: databricks
Registry URI: databricks
Databricks runtime version: 11.3.x-scala2.12
Active experiment ID: 2473451629624570
Active run ID: 6467ea86f65145e596c98951284a81a3
Active run artifact URI: dbfs:/databricks/mlflow-tracking/2473451629624570/6467ea86f65145e596c98951284a81a3/artifacts
MLflow environment variables: 
  MLFLOW_CONDA_HOME: /databricks/conda
  MLFLOW_PYTHON_EXECUTABLE: /databricks/spark/scripts/mlflow_python.sh
  MLFLOW_TRACKING_URI: databricks
MLflow dependencies: 
  Flask: 2.2.5
  Jinja2: 3.1.2
  aiohttp: 3.8.5
  alembic: 1.12.0
  boto3: 1.21.18
  click: 8.0.3
  cloudpickle: 2.2.1
  databricks-cli: 0.17.7
  docker: 6.1.3
  entrypoints: 0.3
  gitpython: 3.1.36
  gunicorn: 21.2.0
  importlib-metadata: 6.8.0
  markdown: 3.4.4
  matplotlib: 3.4.3
  numpy: 1.24.3
  packaging: 23.1
  pandas: 1.5.3
  protobuf: 4.21.5
  psutil: 5.8.0
  pyarrow: 7.0.0
  pytz: 2021.3
  pyyaml: 6.0.1
  querystring-parser: 1.2.4
  requests: 2.26.0
  scikit-learn: 0.24.2
  scipy: 1.11.2
  sqlalchemy: 1.4.49
  sqlparse: 0.4.4
  virtualenv: 20.8.0

Code to reproduce issue

import mlflow
import mlflow.xgboost
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split

# Load and split data
boston = load_boston()
X = boston.data
y = boston.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

# Define and train model
params = {'learning_rate': 0.1, 'n_estimators': 100, 'max_depth': 5}
xgb_r = xgb.XGBRegressor(**params)
xgb_r.fit(X_train, y_train)

# Save model using mlflow
model_path = ".test_saving/saving_1"
mlflow.xgboost.save_model(xgb_r, model_path)

# Load model using mlflow
loaded_model = mlflow.xgboost.load_model(model_path)

# Print parameters to verify
print("Model Params:", loaded_model.get_params())

Stack trace

/local_disk0/.ephemeral_nfs/cluster_libraries/python/lib/python3.9/site-packages/xgboost/core.py:160: UserWarning: [19:33:06] WARNING: /workspace/src/c_api/c_api.cc:1240: Saving into deprecated binary model format, please consider using `json` or `ubj`. Model format will default to JSON in XGBoost 2.2 if not specified.
  warnings.warn(smsg, UserWarning)
Model Params: {'objective': 'reg:squarederror', 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'device': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'multi_strategy': None, 'n_estimators': None, 'n_jobs': None, 'num_parallel_tree': None, 'random_state': None, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None}

Other info / logs

REPLACE_ME

What component(s) does this bug affect?

What interface(s) does this bug affect?

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

What language(s) does this bug affect?

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

What integration(s) does this bug affect?

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

The text was updated successfully, but these errors were encountered:

BenWilson2 · 2023-09-18T23:03:11Z

Hi @nelsoncardenas this isn't a bug. MLflow uses the legacy binary serialization format, not the .json or .ubj formats that were developed in more recent versions of XGBoost. As is mentioned in their documentation, when saving in binary format, the additional 2.x params that are exposed through the sklearn wrapper's .get_params() method are not populated.
See: https://xgboost.readthedocs.io/en/stable/tutorials/saving_model.html for more details.

This new functionality may eventually make its way into MLflow, but we currently don't have it prioritized (it will be a somewhat involved implementation to ensure that all of the various wrappers / utilizations of XGBoost models within MLflow are compatible with the change and that a backwards compatible implementation can be done so that functionality is not broken for older versions of XGBoost in serving environments.

BenWilson2 · 2023-09-20T15:16:07Z

Hi @nelsoncardenas I just did some validation of the behavior within mlflow.xgboost.log_model() to determine what the support for .ubj format is when specifying the optional model_format argument.

I was able to save and load successfully with both json and ubj formats (which retain the metadata params that you're interested in).

Could you try logging your model with this option (example below)?

mlflow.xgboost.log_model(xgb_model=model, 
                         artifact_path=artifact_path, 
                         input_example=train_x.iloc[[0]],
                         model_format="ubj",
                         metadata={"model_data_version": 1}
                         )

nelsoncardenas · 2023-09-25T19:01:36Z

Hi @BenWilson2, thanks for your effort in helping me out. I managed to solve it using version 1.7 of xgboost. For now, I haven't had the time to revisit this (cause I need to create a new Databricks cluster), but it's on my radar

github-actions · 2023-09-26T00:11:53Z

@mlflow/mlflow-team Please assign a maintainer and start triaging this issue.

nelsoncardenas added the bug Something isn't working label Sep 18, 2023

BenWilson2 closed this as completed Oct 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] mlflow.xgboost.load_model: Warning and Missing Parameters using xgboost 2.0.0 #9659

[BUG] mlflow.xgboost.load_model: Warning and Missing Parameters using xgboost 2.0.0 #9659

nelsoncardenas commented Sep 18, 2023 •

edited

BenWilson2 commented Sep 18, 2023

BenWilson2 commented Sep 20, 2023 •

edited

nelsoncardenas commented Sep 25, 2023

github-actions bot commented Sep 26, 2023

[BUG] mlflow.xgboost.load_model: Warning and Missing Parameters using xgboost 2.0.0 #9659

[BUG] mlflow.xgboost.load_model: Warning and Missing Parameters using xgboost 2.0.0 #9659

Comments

nelsoncardenas commented Sep 18, 2023 • edited

Issues Policy acknowledgement

Willingness to contribute

MLflow version

System information

Describe the problem

Tracking information

Code to reproduce issue

Stack trace

Other info / logs

What component(s) does this bug affect?

What interface(s) does this bug affect?

What language(s) does this bug affect?

What integration(s) does this bug affect?

BenWilson2 commented Sep 18, 2023

BenWilson2 commented Sep 20, 2023 • edited

nelsoncardenas commented Sep 25, 2023

github-actions bot commented Sep 26, 2023

nelsoncardenas commented Sep 18, 2023 •

edited

BenWilson2 commented Sep 20, 2023 •

edited