Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update mlflow.spark.log_model() to accept descendants of pyspark.Model #1519

merged 7 commits into from Jul 3, 2019


Copy link

ankitmathur-db commented Jun 27, 2019

What changes are proposed in this pull request?

The mlflow.spark.log_model() method defines a spark_model parameter. This parameter corresponds to the PySpark Pipeline that the caller is persisting in MLflow format. Currently, the spark_model parameter must be of type; the mlflow.spark._validate_model() method throws an exception if spark_model is not of type PipelineModel.

However, a PipelineModel is just a collection of PySpark stages, and each stage is a descendant of We should be able to handle any descendant of (a descendant of Transformer) because those generate predictions, and we can wrap these under the hood in a PipelineModel that we then save. We cannot support arbitrary Transformers because many of them generate "features" and not predictions which breaks some other assumptions in mlflow.spark

This commit extends the set of supported types for the spark_model parameter of mlflow.spark.log_model() to include any descendant of We convert it to a PipelineModel to use a consistent save/load implementation before serializing

How is this patch tested?

Added new tests for different types of pyspark models for log_model and save_model

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

mlflow.spark.log_model and mlflow.spark.save_model can now serialize any descendant of which implements MLReadable and MLWritable.

What component(s) does this PR affect?

  • UI
  • CLI
  • API
  • Examples
  • Docs
  • Tracking
  • Projects
  • Artifacts
  • Models
  • Scoring
  • Serving
  • R
  • Java
  • Python

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes
@ankitmathur-db ankitmathur-db requested a review from dbczumar Jun 27, 2019
@ankitmathur-db ankitmathur-db self-assigned this Jun 27, 2019
Copy link

dbczumar left a comment

Looks great in general! Left a few nits. Additionally, we should update the parameter documentation for spark_model to reflect the new requirements (spark_model can now be any instance implementing the MLWritable and MLReadable interfaces).

mlflow/ Outdated Show resolved Hide resolved
tests/spark/ Show resolved Hide resolved
tests/spark/ Outdated Show resolved Hide resolved
Copy link

dbczumar left a comment

LGTM! Nice work @ankitmathur-db!

@ankitmathur-db ankitmathur-db merged commit 2dadafa into master Jul 3, 2019
2 checks passed
2 checks passed
continuous-integration/travis-ci/pr The Travis CI build passed
continuous-integration/travis-ci/push The Travis CI build passed
@ankitmathur-db ankitmathur-db deleted the ML-7612 branch Jul 3, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
None yet
3 participants
You can’t perform that action at this time.