New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] [Draft] Autologging functionality for scikit-learn integration with XGBoost and LightGBM #4885
Conversation
mlflow/sklearn/__init__.py
Outdated
# copied from mlflow.xgboost | ||
# link: https://github.com/mlflow/mlflow/blob/master/mlflow/xgboost.py#L392 | ||
# avoid cyclic import | ||
def record_eval_results(eval_results, metrics_logger): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should reorganize helper functions in xgboost.py
to make it easier to reuse them from other modules.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. It is better to move record_eval_results
and log_feature_importance_plot
into a new file but not in mlflow.xgboost
. Otherwise, there would be a cyclic import issue. Do you have any idea where we should put them? Maybe a file in mlflow.utils
?
Regarding the feature importance plot, XGBoost sklearn estimators provide normalized feature importance via property feature_importances_
. The feature importance obtained from Booster.get_score()
isn't normalized.
mlflow/sklearn/__init__.py
Outdated
@@ -827,6 +839,8 @@ def autolog( | |||
silent=False, | |||
max_tuning_runs=5, | |||
log_post_training_metrics=True, | |||
xgboost_estimator=False, | |||
lightgbm_estimator=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lightgbm_estimator=False, |
Can we address LightGBM in a separate PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's no problem!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than exposing xgboost_estimator
in the top-level sklearn API, can we define a separate internal _autolog()
API that contains this parameter and can be called by mlflow.sklearn.autolog()
and mlflow.xboost.autolog()
? This way, mlflow.sklearn.autolog()
only has one behavior: enable autologging for scikit-learn estimators, and mlflow.xgboost.autolog()
has one behavior: enable autologging for XGBoost estimators (and other types of XGBoost models).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure. This plan sounds good to me. I will revise the implementation accordingly. Thanks!
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Hi @dbczumar @harupy, thank you for your feedback! I have revised the implementation.
I'd like to hear your feedback on this new version. Thanks! 🙏 |
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
mlflow/xgboost.py
Outdated
# save xgboost.sklearn estimators | ||
if xgb_model.__module__ == "xgboost.sklearn": | ||
import mlflow.sklearn | ||
extra_xgboost_pip_requirements = get_default_pip_requirements() | ||
if extra_pip_requirements: | ||
extra_xgboost_pip_requirements += extra_pip_requirements | ||
mlflow.sklearn.save_model( | ||
sk_model=xgb_model, | ||
path=path, | ||
conda_env=conda_env, | ||
mlflow_model=mlflow_model, | ||
serialization_format=serialization_format, | ||
signature=signature, | ||
input_example=input_example, | ||
pip_requirements=pip_requirements, | ||
extra_pip_requirements=extra_xgboost_pip_requirements, | ||
) | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fortunately, I don't think we need this logic since XGBoostRegressor
and other XGBoost scikit-learn models have a save_model()
method. We should be able to use the existing mlflow.xgboost.save_model()
to save XGBoost scikit-learn estimators.
mlflow/xgboost.py
Outdated
@@ -99,6 +101,7 @@ def save_model( | |||
conda_env=None, | |||
mlflow_model=None, | |||
signature: ModelSignature = None, | |||
serialization_format: str = SERIALIZATION_FORMAT_CLOUDPICKLE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can drop this argument since XGBoost scikit-learn models can be saved using the same <xgboost_model>.save_model()
API as other XGBoost models. (See https://github.com/mlflow/mlflow/pull/4885/files#r730028775). Instead, we should make some changes that allow XGBoost sklearn models to be loaded using mlfllow.xgboost.load_model()
and used for inference via the pyfunc representation. See https://github.com/mlflow/mlflow/pull/4885/files#r730036638.
@@ -107,7 +110,7 @@ def save_model( | |||
Save an XGBoost model to a path on the local file system. | |||
|
|||
:param xgb_model: XGBoost model (an instance of `xgboost.Booster`_) to be saved. | |||
Note that models that implement the `scikit-learn API`_ are not supported. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the reason we said this before is that we don't currently handle loading of XGBoost scikit-learn models correctly.
For example:
from pprint import pprint
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import numpy as np
import mlflow
import mlflow.xgboost
boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target)
X_train, X_test, y_train, y_test = train_test_split(X, y)
regressor = xgb.XGBRegressor(
n_estimators=100,
reg_lambda=1,
gamma=0,
max_depth=3
)
regressor.fit(X_train, y_train)
print(type(regressor))
with mlflow.start_run():
x = mlflow.xgboost.log_model(regressor, "foo")
uri = "runs:/" + mlflow.active_run().info.run_id + "/foo"
loaded_model = mlflow.xgboost.load_model(uri)
print(type(loaded_model))
This prints:
<class 'xgboost.sklearn.XGBRegressor'>
<class 'xgboost.core.Booster'>
Which indicates that the XGBRegressor
model is correctly saved in MLflow format but is incorrectly related as an xgboost.core.Booster
object due to hardcoded logic here:
Line 260 in 706758e
model = xgb.Booster() |
Can we update the model flavor specification to include the model class name and then, when the model is loaded, instantiate an instance of the model class and call load_model()
. This should provide full support for saving / loading XGBoost scikit-learn models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other part that we need to address is the pyfunc inference format for XGBoost models. Currently, pyfunc inference assumes that booster classes are being used and converts the input into an xgb.DMatrix
object. Instead, for scikit-learn XGBoost models, we should pass the input directly through to the model without converting it to a DMatrix
.
For this purpose, it may be useful to record an additional piece of flavor state indicating the model type: "xgboost" or "xgboost-sklearn".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sounds good to me! We can address this issue in a separate PR. If necessary, I'd be happy to help.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this purpose, it may be useful to record an additional piece of flavor state indicating the model type: "xgboost" or "xgboost-sklearn".
Can we just inspect the model type?
class _NewXGBModelWrapper:
def __init__(self, xgb_model):
self.xgb_model = xgb_model
def predict(self, dataframe):
import xgboost as xgb
if isinstance(self.xgb_model, xgb.Booster):
dataframe = xgb.DMatrix(dataframe)
return self.xgb_model.predict(xgb.DMatrix(dataframe))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we patch log_model
, save_model
, load_model
, and _load_pyfunc
directly to loaded mlflow.sklearn
in mlflow.xgboost
? Then saving / loading model can be also handled by mlflow.xgboost
without using sklearn functions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwyyy Awesome work! I left some initial feedback about saving / loading / performing inference via pyfunc with XGBoost scikit-learn models. Mainly, I think we should expand mlflow.xgboost.save_model()
to support XGBoost + scikit-learn without having to call mlflow.sklearn.save_model()
, as proposed here: https://github.com/mlflow/mlflow/pull/4885/files#r730036638.
For modularity, it may be easier to break this part into a separate PR. Thank you so much for your contributions so far!
@dbczumar Thank you for your review and detailed feedback! I will address them in the next round of revision. Will comment more if I have questions.
Can you explain a little bit what is this part in the sentence above? Are you referring to |
Hi @dbczumar, I studied your suggestions more over the weekend and have more thoughts to share. I think we agreed that This means that when we load it back, it should be loaded using However, this behavior is different from That is, Originally, the issue (the boston.txt example in #4296) was when XGBooost sklearn estimator is called with To make it uniform across all cases, we should decide whether
For (1), we could update Thanks again for your feedback! Looking forward to hearing more discussions from you and @harupy . |
Hi @jwyyy, the Let me know if you have any questions here. Thank you for your contributions! |
Hi @dbczumar, thank you for your clarification! Now this internal behavior is clearer to me, and it makes a lot more sense. I will revise the implementation soon. Will let you know if other issues come up. |
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Hi @dbczumar @harupy, I revised the implementation as per our discussion and made a new commit. Here is a summary of what have been changed:
(When
Please let me know if I missed anything. I'd like to hear more feedback and suggestions from you! Thanks! |
|
||
# initialize autologging for XGBoost sklearn estimators | ||
import mlflow.sklearn | ||
_wrap_patch(mlflow.sklearn, "log_model", log_model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use setattr()
here?
Hi @jwyyy , apologies for the delay here. We're working to release MLflow 1.21.0. I'll make sure to provide thorough PR feedback within the next few days. |
Sure. That's no problem! Looking forward to the new release 👍 I will make changes accordingly once we have more discussion. Thanks in advance! |
if MODEL_CLASS == "Booster": | ||
model = xgb.Booster() | ||
else: | ||
model = getattr(xgb, MODEL_CLASS)() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of using a global variable, which has the downside of limiting how many times users can load different models (like you mentioned), can we read the model_class
attribute of the flavor specification from the MLflow Model?
I realize this is challenging for _load_pyfunc
because the pyfunc model only gives us access to the XGBoost model path. This is because we pass the data
keyword argument to pyfunc.add_to_model
here:
Line 159 in 40df337
data=model_data_subpath, |
mlflow/mlflow/pyfunc/__init__.py
Line 666 in 40df337
data_path = os.path.join(local_path, conf[DATA]) if (DATA in conf) else local_path |
aa563fb demonstrates how we can safely stop adding the data
keyword to mlflow.pyfunc.add_to_model
while maintaining backwards compatibility with older models that were saved with the data
field.
@jwyyy can we split this work into a separate PR?
_wrap_patch(mlflow.sklearn, "log_model", log_model) | ||
_wrap_patch(mlflow.sklearn, "save_model", save_model) | ||
_wrap_patch(mlflow.sklearn, | ||
"get_default_pip_requirements", | ||
get_default_pip_requirements) | ||
_wrap_patch(mlflow.sklearn, | ||
"get_default_conda_env", | ||
get_default_conda_env) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than patching methods from mlflow.sklearn
, can we instead add logic inside of mlflow.sklearn._autolog()
that checks the model class and calls mlflow.xgboost.log_model()
if the model class comes from the XGBoost scikit-learn integration?
Perhaps we can work on this after addressing https://github.com/mlflow/mlflow/pull/4885/files#r737029226 as part of a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One potential issue with this approach is cyclic import: To use sklearn autologging routine, we call import mlflow.sklearn
inside mlflow.xgboost
. To use mlflow.xgboost.log_model()
inside mlflow.sklearn
, we need to call import mlflow.xgboost
insider mlflow.sklearn
. It is not necessarily a problem, depending on how we implement it. But would it be a cleaner way to move mlflow.xgboost.log_model()
and .save_model()
to a util file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized that cyclic import may not be a problem, since each import occurs only its enclosing function is called.
if isinstance(self.xgb_model, xgb.Booster): | ||
return self.xgb_model.predict(xgb.DMatrix(dataframe)) | ||
else: | ||
return self.xgb_model.predict(dataframe) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks awesome! Thanks @jwyyy!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwyyy Awesome progress! Thank you for your ongoing contribution! I've left a few more comments. Can we work on adding support for logging / loading XGBoost scikit-learn models via mlflow.xgboost
in a separate PR for reviewability purposes?
@dbczumar Thanks for your feedback! I am working on the revision. Will let you know if there is any problem. I can create separate PRs to address each small issue. But we can still leave this PR as a template / draft. Since adding sklearn autologging for LightGBM would be very similar, I think it is a good idea to keep this PR as roadmap. |
Closing this PR, since #4296 is now resolved. |
Signed-off-by: Junwen Yao, jwyiao@gmail.com.
What changes are proposed in this pull request?
This PR will enable autologging for XGBoost and LightGBM sklearn estimators. Resolves #4296.
(Part 1) autologging for XGBoost sklearn estimators.
(Part 2) working on autologging for LightGBM sklearn estimators.
How is this patch tested?
A short example is provided. Tests will be added later.
Release Notes
Is this a user-facing change?
This PR will enable autologging for XGBoost (LightGBM) sklearn estimators using
mlflow.xgboost.autolog()
(mlflow.lightgbm.autolog()
).What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes