Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] [Draft] Autologging functionality for scikit-learn integration with XGBoost and LightGBM #4885

Closed
wants to merge 16 commits into from

Conversation

jwyyy
Copy link
Contributor

@jwyyy jwyyy commented Oct 12, 2021

Signed-off-by: Junwen Yao, jwyiao@gmail.com.

What changes are proposed in this pull request?

This PR will enable autologging for XGBoost and LightGBM sklearn estimators. Resolves #4296.

(Part 1) autologging for XGBoost sklearn estimators.
(Part 2) working on autologging for LightGBM sklearn estimators.

How is this patch tested?

A short example is provided. Tests will be added later.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

This PR will enable autologging for XGBoost (LightGBM) sklearn estimators using mlflow.xgboost.autolog() (mlflow.lightgbm.autolog()).

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

@github-actions github-actions bot added area/examples Example code area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs. labels Oct 12, 2021
@jwyyy jwyyy marked this pull request as draft October 12, 2021 04:48
# copied from mlflow.xgboost
# link: https://github.com/mlflow/mlflow/blob/master/mlflow/xgboost.py#L392
# avoid cyclic import
def record_eval_results(eval_results, metrics_logger):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should reorganize helper functions in xgboost.py to make it easier to reuse them from other modules.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. It is better to move record_eval_results and log_feature_importance_plot into a new file but not in mlflow.xgboost. Otherwise, there would be a cyclic import issue. Do you have any idea where we should put them? Maybe a file in mlflow.utils?

Regarding the feature importance plot, XGBoost sklearn estimators provide normalized feature importance via property feature_importances_. The feature importance obtained from Booster.get_score() isn't normalized.

@@ -827,6 +839,8 @@ def autolog(
silent=False,
max_tuning_runs=5,
log_post_training_metrics=True,
xgboost_estimator=False,
lightgbm_estimator=False,
Copy link
Member

@harupy harupy Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
lightgbm_estimator=False,

Can we address LightGBM in a separate PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's no problem!

Copy link
Collaborator

@dbczumar dbczumar Oct 12, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than exposing xgboost_estimator in the top-level sklearn API, can we define a separate internal _autolog() API that contains this parameter and can be called by mlflow.sklearn.autolog() and mlflow.xboost.autolog()? This way, mlflow.sklearn.autolog() only has one behavior: enable autologging for scikit-learn estimators, and mlflow.xgboost.autolog() has one behavior: enable autologging for XGBoost estimators (and other types of XGBoost models).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. This plan sounds good to me. I will revise the implementation accordingly. Thanks!

@jwyyy jwyyy changed the title [WIP] Autologging functionality for scikit-learn integration with XGBoost and LightGBM [WIP] Autologging functionality for scikit-learn integration with XGBoost and LightGBM (Part 1) Oct 12, 2021
@github-actions github-actions bot removed the rn/feature Mention under Features in Changelogs. label Oct 12, 2021
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy
Copy link
Contributor Author

jwyyy commented Oct 12, 2021

Hi @dbczumar @harupy, thank you for your feedback! I have revised the implementation.

  1. LightGBM placeholders are removed from this PR. Will make another one for it separately.
  2. record_eval_results and log_feature_importance_plot are moved to mlflow.utils._xgboost_utils.
  3. _autolog is added to mlflow.sklearn to separate autologging behaviors for xgboost and sklearn.
  4. Add xgboost pip requirements to sklearn autologging when the estimator is from xgboost.sklearn.
  5. Tested the autologging for sklearn and xgboost examples.
  6. (sorry for messy commit history 😓 )

I'd like to hear your feedback on this new version. Thanks! 🙏

@github-actions github-actions bot added the rn/feature Mention under Features in Changelogs. label Oct 13, 2021
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy jwyyy requested review from harupy and dbczumar October 13, 2021 17:19
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy jwyyy marked this pull request as ready for review October 14, 2021 21:24
Signed-off-by: Junwen Yao <jwyiao@gmail.com>
Comment on lines 147 to 164
# save xgboost.sklearn estimators
if xgb_model.__module__ == "xgboost.sklearn":
import mlflow.sklearn
extra_xgboost_pip_requirements = get_default_pip_requirements()
if extra_pip_requirements:
extra_xgboost_pip_requirements += extra_pip_requirements
mlflow.sklearn.save_model(
sk_model=xgb_model,
path=path,
conda_env=conda_env,
mlflow_model=mlflow_model,
serialization_format=serialization_format,
signature=signature,
input_example=input_example,
pip_requirements=pip_requirements,
extra_pip_requirements=extra_xgboost_pip_requirements,
)
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fortunately, I don't think we need this logic since XGBoostRegressor and other XGBoost scikit-learn models have a save_model() method. We should be able to use the existing mlflow.xgboost.save_model() to save XGBoost scikit-learn estimators.

@@ -99,6 +101,7 @@ def save_model(
conda_env=None,
mlflow_model=None,
signature: ModelSignature = None,
serialization_format: str = SERIALIZATION_FORMAT_CLOUDPICKLE,
Copy link
Collaborator

@dbczumar dbczumar Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can drop this argument since XGBoost scikit-learn models can be saved using the same <xgboost_model>.save_model() API as other XGBoost models. (See https://github.com/mlflow/mlflow/pull/4885/files#r730028775). Instead, we should make some changes that allow XGBoost sklearn models to be loaded using mlfllow.xgboost.load_model() and used for inference via the pyfunc representation. See https://github.com/mlflow/mlflow/pull/4885/files#r730036638.

@@ -107,7 +110,7 @@ def save_model(
Save an XGBoost model to a path on the local file system.

:param xgb_model: XGBoost model (an instance of `xgboost.Booster`_) to be saved.
Note that models that implement the `scikit-learn API`_ are not supported.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the reason we said this before is that we don't currently handle loading of XGBoost scikit-learn models correctly.

For example:

from pprint import pprint

import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import numpy as np
import mlflow
import mlflow.xgboost


boston = load_boston()
X = pd.DataFrame(boston.data, columns=boston.feature_names)
y = pd.Series(boston.target)
X_train, X_test, y_train, y_test = train_test_split(X, y)

regressor = xgb.XGBRegressor(
    n_estimators=100,
    reg_lambda=1,
    gamma=0,
    max_depth=3
)
regressor.fit(X_train, y_train)

print(type(regressor))

with mlflow.start_run():
    x = mlflow.xgboost.log_model(regressor, "foo")
    uri = "runs:/" + mlflow.active_run().info.run_id + "/foo"

loaded_model = mlflow.xgboost.load_model(uri)
print(type(loaded_model))

This prints:

<class 'xgboost.sklearn.XGBRegressor'>
<class 'xgboost.core.Booster'>

Which indicates that the XGBRegressor model is correctly saved in MLflow format but is incorrectly related as an xgboost.core.Booster object due to hardcoded logic here:

model = xgb.Booster()
.

Can we update the model flavor specification to include the model class name and then, when the model is loaded, instantiate an instance of the model class and call load_model(). This should provide full support for saving / loading XGBoost scikit-learn models.

Copy link
Collaborator

@dbczumar dbczumar Oct 15, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other part that we need to address is the pyfunc inference format for XGBoost models. Currently, pyfunc inference assumes that booster classes are being used and converts the input into an xgb.DMatrix object. Instead, for scikit-learn XGBoost models, we should pass the input directly through to the model without converting it to a DMatrix.

For this purpose, it may be useful to record an additional piece of flavor state indicating the model type: "xgboost" or "xgboost-sklearn".

Copy link
Member

@harupy harupy Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This sounds good to me! We can address this issue in a separate PR. If necessary, I'd be happy to help.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this purpose, it may be useful to record an additional piece of flavor state indicating the model type: "xgboost" or "xgboost-sklearn".

Can we just inspect the model type?

class _NewXGBModelWrapper:
    def __init__(self, xgb_model):
        self.xgb_model = xgb_model

    def predict(self, dataframe):
        import xgboost as xgb
        
        if isinstance(self.xgb_model, xgb.Booster):
            dataframe = xgb.DMatrix(dataframe)

        return self.xgb_model.predict(xgb.DMatrix(dataframe))

Copy link
Contributor Author

@jwyyy jwyyy Oct 20, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we patch log_model, save_model, load_model, and _load_pyfunc directly to loaded mlflow.sklearn in mlflow.xgboost? Then saving / loading model can be also handled by mlflow.xgboost without using sklearn functions.

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwyyy Awesome work! I left some initial feedback about saving / loading / performing inference via pyfunc with XGBoost scikit-learn models. Mainly, I think we should expand mlflow.xgboost.save_model() to support XGBoost + scikit-learn without having to call mlflow.sklearn.save_model(), as proposed here: https://github.com/mlflow/mlflow/pull/4885/files#r730036638.

For modularity, it may be easier to break this part into a separate PR. Thank you so much for your contributions so far!

@jwyyy
Copy link
Contributor Author

jwyyy commented Oct 15, 2021

@dbczumar Thank you for your review and detailed feedback! I will address them in the next round of revision. Will comment more if I have questions.

For modularity, it may be easier to break this part into a separate PR. Thank you so much for your contributions so far!

Can you explain a little bit what is this part in the sentence above? Are you referring to xgboost flavor vs xgboost-sklearn flavor?

@jwyyy
Copy link
Contributor Author

jwyyy commented Oct 18, 2021

Hi @dbczumar, I studied your suggestions more over the weekend and have more thoughts to share.

I think we agreed that mlflow.sklearn autologging routine is used for logging XGBoost sklearn estimators. This logs model during training using mlflow.sklearn flavor (e.g. fit_mlflow() patched to the originalfit()). In particular, in _log_posttraining_metadata() , it is the mlflow.sklearn.log_model() that is called (L1285). When XGBoost sklearn estimator's fit() is patched, the estimator will be saved/logged as a sklearn estimator in mlflow.

This means that when we load it back, it should be loaded using mlflow.sklearn API to make everything consistent with what has been logged during the training. It also avoids DMatrix in inference, b/c predict() can take inputs without conversion (a xgboost.sklearn implementation).

However, this behavior is different from mlflow.xgboost.log_model() and .save_model() as you have shown in this example.

That is, mlflow.sklearn saves/logs models in mlflow.sklearn flavor, but mlflow.xgboost saves/logs models as Boosters. (This also explains why serialization_format was added.) Inconsistency can happen if users try to load model that was previously saved using mlflow.sklearn by xgboost.sklearn API.

Originally, the issue (the boston.txt example in #4296) was when XGBooost sklearn estimator is called with fit(), the training isn't autologged. (The official sklearn autologging example logs model automatically.) Explicitly calling mlflow.xgboost.log_model() not only logs model as Boosters but also is inconsistent with our proposal: using mlflow.sklearn to handle autologging. (Plus, it requires users to manually log training, and there can be early stopping parameters. They need to be checked and logged at each iteration.)

To make it uniform across all cases, we should decide whether

  1. saving/loading models are all in mlflow.sklearn flavor in both mlflow.sklearn and mlflow.xgboost for XGBoost sklearn estimators; or
  2. all in mlflow.xgboost flavor, i.e., using the provided load_model() and save_model() method in xgboost.sklearn.

For (1), we could update mlflow.xgboost.load_model() by adding new model flavor xgboost-sklearn as you suggested. But I think essentially we still need to utilize functions defined in mlflow.sklearn.

Thanks again for your feedback! Looking forward to hearing more discussions from you and @harupy .

@dbczumar
Copy link
Collaborator

Hi @dbczumar, I studied your suggestions more over the weekend and have more thoughts to share.

I think we agreed that mlflow.sklearn autologging routine is used for logging XGBoost sklearn estimators. This logs model during training using mlflow.sklearn flavor (e.g. fit_mlflow() patched to the originalfit()). In particular, in _log_posttraining_metadata() , it is the mlflow.sklearn.log_model() that is called (L1285). When XGBoost sklearn estimator's fit() is patched, the estimator will be saved/logged as a sklearn estimator in mlflow.

This means that when we load it back, it should be loaded using mlflow.sklearn API to make everything consistent with what has been logged during the training. It also avoids DMatrix in inference, b/c predict() can take inputs without conversion (a xgboost.sklearn implementation).

However, this behavior is different from mlflow.xgboost.log_model() and .save_model() as you have shown in this example.

That is, mlflow.sklearn saves/logs models in mlflow.sklearn flavor, but mlflow.xgboost saves/logs models as Boosters. (This also explains why serialization_format was added.) Inconsistency can happen if users try to load model that was previously saved using mlflow.sklearn by xgboost.sklearn API.

Originally, the issue (the boston.txt example in #4296) was when XGBooost sklearn estimator is called with fit(), the training isn't autologged. (The official sklearn autologging example logs model automatically.) Explicitly calling mlflow.xgboost.log_model() not only logs model as Boosters but also is inconsistent with our proposal: using mlflow.sklearn to handle autologging. (Plus, it requires users to manually log training, and there can be early stopping parameters. They need to be checked and logged at each iteration.)

To make it uniform across all cases, we should decide whether

  1. saving/loading models are all in mlflow.sklearn flavor in both mlflow.sklearn and mlflow.xgboost for XGBoost sklearn estimators; or
  2. all in mlflow.xgboost flavor, i.e., using the provided load_model() and save_model() method in xgboost.sklearn.

For (1), we could update mlflow.xgboost.load_model() by adding new model flavor xgboost-sklearn as you suggested. But I think essentially we still need to utilize functions defined in mlflow.sklearn.

Thanks again for your feedback! Looking forward to hearing more discussions from you and @harupy .

Hi @jwyyy, the mlflow.xgboost.autolog() routine should be used to trigger autologging for XGBoost scikit-learn estimators, but this should be handled by an internal method within the mlflow.sklearn module called mlflow.sklearn._autolog(). When mlflow.sklearn._autolog() sees that the model class comes from xgboost, it should call mlflow.xgboost.log_model() instead of mlflow.sklearn.log_model(), and mlflow.xgboost.load_model() / the XGBoost pyfunc representation should be extended to support XGBoost scikit-learn models, as described in https://github.com/mlflow/mlflow/pull/4885/files#r730036638.

Let me know if you have any questions here. Thank you for your contributions!

@jwyyy
Copy link
Contributor Author

jwyyy commented Oct 20, 2021

When mlflow.sklearn._autolog() sees that the model class comes from xgboost, it should call mlflow.xgboost.log_model() instead of mlflow.sklearn.log_model(), and mlflow.xgboost.load_model() / the XGBoost pyfunc representation should be extended to support XGBoost scikit-learn models.

Hi @dbczumar, thank you for your clarification! Now this internal behavior is clearer to me, and it makes a lot more sense. I will revise the implementation soon. Will let you know if other issues come up.

Signed-off-by: Junwen Yao <jwyiao@gmail.com>
@jwyyy
Copy link
Contributor Author

jwyyy commented Oct 20, 2021

Hi @dbczumar @harupy, I revised the implementation as per our discussion and made a new commit. Here is a summary of what have been changed:

  1. mlflow.sklearn saving / loading are now dropped and directly patched by corresponding saving / loading functions in mlflow.xgboost. I didn't use safe_patch() since inside sklearn autologging routine, safe_patch() will be used to patch fit() anyway. Plus, the input arguments are not the same. This resolves the first issue: using mlflow.xgboost for both Booster and XGBoost sklearn estimators.
  2. Regarding the model class specification, I chose to add a new module level variable MODEL_CLASS to mlflow.xgboost. This also avoids adding new model specific flavors. An extra flavor key-value, specifying what XGBoost slearn model is used, will be logged in MLmodel: (an example of new MLmodel file)
...
flavors:
  python_function:
    data: model.xgb
    env: conda.yaml
    loader_module: mlflow.xgboost
    python_version: 3.6.13
  xgboost:
    data: model.xgb
    model_class: XGBRegressor # new 
    xgb_version: 1.4.2
...

(When xgboost.train() is used, model_class is Booster.) When the model is loaded, MODEL_CLASS will be set to the model class it reads. It resolves the issue of hard saving / loading XGBoost models. Alternatively, we can ask users to specify model class. But I think it is better to have mlflow directly handle model class specification. For example, the user who uses a XGBoost model might not be the person who trained it. Manually setting model class requires end users to know which class was used before hand, making the process not very smart.

  1. Since MODEL_CLASS is new, it should be correctly handled in mlflow.pyfunc.load_model(). A few lines were added to set MODEL_CLASS before calling module level _load_pyfunc(). The prediction part is implemented as @harupy suggested.
  2. One drawback of this approach is once MODEL_CLASS is set, it should not be used to log other models. So if users want to log another XGBoost model, they need to create a new autologging routine. (I suppose this case is rare also.)

Please let me know if I missed anything. I'd like to hear more feedback and suggestions from you! Thanks!


# initialize autologging for XGBoost sklearn estimators
import mlflow.sklearn
_wrap_patch(mlflow.sklearn, "log_model", log_model)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can use setattr() here?

@dbczumar
Copy link
Collaborator

Hi @jwyyy , apologies for the delay here. We're working to release MLflow 1.21.0. I'll make sure to provide thorough PR feedback within the next few days.

@jwyyy
Copy link
Contributor Author

jwyyy commented Oct 22, 2021

Hi @jwyyy , apologies for the delay here. We're working to release MLflow 1.21.0. I'll make sure to provide thorough PR feedback within the next few days.

Sure. That's no problem! Looking forward to the new release 👍 I will make changes accordingly once we have more discussion. Thanks in advance!

Comment on lines +259 to +262
if MODEL_CLASS == "Booster":
model = xgb.Booster()
else:
model = getattr(xgb, MODEL_CLASS)()
Copy link
Collaborator

@dbczumar dbczumar Oct 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using a global variable, which has the downside of limiting how many times users can load different models (like you mentioned), can we read the model_class attribute of the flavor specification from the MLflow Model?

I realize this is challenging for _load_pyfunc because the pyfunc model only gives us access to the XGBoost model path. This is because we pass the data keyword argument to pyfunc.add_to_model here:

data=model_data_subpath,
, which causes special logic to execute when loading a pyfunc model here:
data_path = os.path.join(local_path, conf[DATA]) if (DATA in conf) else local_path
.

aa563fb demonstrates how we can safely stop adding the data keyword to mlflow.pyfunc.add_to_model while maintaining backwards compatibility with older models that were saved with the data field.

@jwyyy can we split this work into a separate PR?

Comment on lines +560 to +567
_wrap_patch(mlflow.sklearn, "log_model", log_model)
_wrap_patch(mlflow.sklearn, "save_model", save_model)
_wrap_patch(mlflow.sklearn,
"get_default_pip_requirements",
get_default_pip_requirements)
_wrap_patch(mlflow.sklearn,
"get_default_conda_env",
get_default_conda_env)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than patching methods from mlflow.sklearn, can we instead add logic inside of mlflow.sklearn._autolog() that checks the model class and calls mlflow.xgboost.log_model() if the model class comes from the XGBoost scikit-learn integration?

Perhaps we can work on this after addressing https://github.com/mlflow/mlflow/pull/4885/files#r737029226 as part of a separate PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One potential issue with this approach is cyclic import: To use sklearn autologging routine, we call import mlflow.sklearn inside mlflow.xgboost. To use mlflow.xgboost.log_model() inside mlflow.sklearn, we need to call import mlflow.xgboost insider mlflow.sklearn. It is not necessarily a problem, depending on how we implement it. But would it be a cleaner way to move mlflow.xgboost.log_model() and .save_model() to a util file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized that cyclic import may not be a problem, since each import occurs only its enclosing function is called.

Comment on lines +328 to +331
if isinstance(self.xgb_model, xgb.Booster):
return self.xgb_model.predict(xgb.DMatrix(dataframe))
else:
return self.xgb_model.predict(dataframe)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome! Thanks @jwyyy!

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jwyyy Awesome progress! Thank you for your ongoing contribution! I've left a few more comments. Can we work on adding support for logging / loading XGBoost scikit-learn models via mlflow.xgboost in a separate PR for reviewability purposes?

@jwyyy
Copy link
Contributor Author

jwyyy commented Oct 27, 2021

@jwyyy Awesome progress! Thank you for your ongoing contribution! I've left a few more comments. Can we work on adding support for logging / loading XGBoost scikit-learn models via mlflow.xgboost in a separate PR for reviewability purposes?

@dbczumar Thanks for your feedback! I am working on the revision. Will let you know if there is any problem.

I can create separate PRs to address each small issue. But we can still leave this PR as a template / draft. Since adding sklearn autologging for LightGBM would be very similar, I think it is a good idea to keep this PR as roadmap.

@jwyyy jwyyy changed the title [WIP] Autologging functionality for scikit-learn integration with XGBoost and LightGBM (Part 1) [WIP] [Draft] Autologging functionality for scikit-learn integration with XGBoost and LightGBM Oct 27, 2021
@jwyyy jwyyy marked this pull request as draft November 12, 2021 02:00
@jwyyy
Copy link
Contributor Author

jwyyy commented Jan 14, 2022

Closing this PR, since #4296 is now resolved.

@jwyyy jwyyy closed this Jan 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/examples Example code area/tracking Tracking service, tracking client APIs, autologging rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FR] Autologging functionality for scikit-learn integration with XGBoost (and LightGBM)
3 participants