Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add utility for signature validation of logged model to dataset #6494

Merged
merged 18 commits into from
Sep 1, 2022

Conversation

serena-ruan
Copy link
Collaborator

@serena-ruan serena-ruan commented Aug 17, 2022

Signed-off-by: Xinyue Ruan serena.rxy@gmail.com

Related Issues/PRs

Resolve #6092

What changes are proposed in this pull request?

Add utility method for get_model_info and validate_schema.

How is this patch tested?

  • I have written tests (not required for typo or doc fix) and confirmed the proposed feature/bug-fix/change works.

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly by following the steps below.
  1. Click the Details link on the Preview docs check.
  2. Find the changed pages / sections and make sure they render correctly.

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Add utility functions that enable users to 1. easily get model info from the model uri directly; 2. validate dataset against the target schema.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/pipelines: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
@github-actions github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs. labels Aug 17, 2022
PyFuncInput = Union[
pandas.DataFrame,
np.ndarray,
scipy.sparse.csc_matrix, # Why do we use a string format "scipy.sparse.csc_matrix" here before?
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raising a question here: this was previous set as the string format of "scipy.sparse.csc_matrix" and "scipy.sparse.csr_matrix", is there any specific reason for them?

…ds on numpy and pands

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
@dbczumar dbczumar self-requested a review August 18, 2022 05:00
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
@WeichenXu123 WeichenXu123 self-requested a review August 19, 2022 12:32
)


def get_model_signature(model_uri: str) -> dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of having two methods, it would be great to include the model signature in the ModelInfo class so that users can call get_model_info().signature. I realize that ModelInfo has a signature_dict attribute; is it possible to add a signature attribute of type ModelSignature (rather than dict) and mark signature_dict as deprecated via the @deprecated decorator?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually @depracted decorator can only be applied on classes or methods, so I added another similar decorator for this reference (let me know if you have other preferences since I'm not super expert here).

mlflow/models/utils.py Outdated Show resolved Hide resolved
Comment on lines 562 to 573
if isinstance(
data,
(
pd.DataFrame,
np.ndarray,
csc_matrix,
csr_matrix,
dict,
pd.Series,
list,
),
):
Copy link
Collaborator

@dbczumar dbczumar Aug 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this check? Looks like pyfunc.predict() doesn't check isinstance() before calling _enforce_schema:

if input_schema is not None:
data = _enforce_schema(data, input_schema)
.

If we do need the check, can we check isinstance(data, PyFuncInput)? Though I don't think the check is necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it doesn't check for predict, perhaps I can just delete this check and call _enforce_schema directly.

csc_matrix,
csr_matrix,
dict,
pd.Series,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It doesn't look like pd.Series is considered valid by _enforce_schema (it will throw)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I did miss some parts, will add support as well.

)


def validate_schema(data: DataInputType, expected_schema: Schema) -> DataInputType:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following from https://github.com/mlflow/mlflow/pull/6494/files#r950650995, should we change this to:

Suggested change
def validate_schema(data: DataInputType, expected_schema: Schema) -> DataInputType:
def validate_schema(data: PyFuncInput, expected_schema: Schema) -> DataInputType:

Since _enforce_schema doesn't support pandas series and validation of output schemas doesn't appear to be critically important?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually added support of pandas series, as it can be directly turned into a pandas DataFrame and does the following validation. And in fact we could accept pandas.Series as a dataset input.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is nice to make PyFuncInput support pd.Series type. @dbczumar Thoughts ?

But, @serena-ruan , your definition DataInputType = Union[PyFuncInput, PyFuncOutput] looks weird. We can directly reuse PyFuncInput here, if we add pd.Series into PyFuncInput

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh yes, will drop DataInputType.

Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serena-ruan Apologies for the delay in the review. This is looking great! Just left a few comments - let me know if you have any questions.

@serena-ruan
Copy link
Collaborator Author

@serena-ruan Apologies for the delay in the review. This is looking great! Just left a few comments - let me know if you have any questions.

Thanks for those comments! Didn't get a chance to take a look over the weekend, will address them later today!

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
@serena-ruan
Copy link
Collaborator Author

@dbczumar There's a constraint that if ModelInfo depends on ModelSignature, then we need to rely on pandas/numpy so we can't import them in mlflow-skinny. Besides, Model contains a method get_model_info so it also relies on ModelInfo, thus relying on pandas/numpy if we add ModelSignature into ModelInfo field. So I suggest we keep using signature_dict in ModelInfo and add another classmethod for ModelInfo to convert signature_dict back to ModelSignature using from_dict method. WDYT?

@dbczumar
Copy link
Collaborator

@dbczumar There's a constraint that if ModelInfo depends on ModelSignature, then we need to rely on pandas/numpy so we can't import them in mlflow-skinny. Besides, Model contains a method get_model_info so it also relies on ModelInfo, thus relying on pandas/numpy if we add ModelSignature into ModelInfo field. So I suggest we keep using signature_dict in ModelInfo and add another classmethod for ModelInfo to convert signature_dict back to ModelSignature using from_dict method. WDYT?

@serena-ruan Can we import ModelSignature in the implementation of the signature propertymethod, instead of importing in the top-level module? That should resolve the issue.

@serena-ruan
Copy link
Collaborator Author

@dbczumar There's a constraint that if ModelInfo depends on ModelSignature, then we need to rely on pandas/numpy so we can't import them in mlflow-skinny. Besides, Model contains a method get_model_info so it also relies on ModelInfo, thus relying on pandas/numpy if we add ModelSignature into ModelInfo field. So I suggest we keep using signature_dict in ModelInfo and add another classmethod for ModelInfo to convert signature_dict back to ModelSignature using from_dict method. WDYT?

@serena-ruan Can we import ModelSignature in the implementation of the signature propertymethod, instead of importing in the top-level module? That should resolve the issue.

This is a NamedTuple instead of a normal class containing attributes 🤔 But I can have a try

@dbczumar
Copy link
Collaborator

@dbczumar There's a constraint that if ModelInfo depends on ModelSignature, then we need to rely on pandas/numpy so we can't import them in mlflow-skinny. Besides, Model contains a method get_model_info so it also relies on ModelInfo, thus relying on pandas/numpy if we add ModelSignature into ModelInfo field. So I suggest we keep using signature_dict in ModelInfo and add another classmethod for ModelInfo to convert signature_dict back to ModelSignature using from_dict method. WDYT?

@serena-ruan Can we import ModelSignature in the implementation of the signature propertymethod, instead of importing in the top-level module? That should resolve the issue.

This is a NamedTuple instead of a normal class containing attributes 🤔 But I can have a try

Thanks! Hopefully that still works :). If not, we can convert from NamedTuple to class

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
tests/models/test_model.py Outdated Show resolved Hide resolved
tests/models/test_model.py Outdated Show resolved Hide resolved
mlflow/models/utils.py Outdated Show resolved Hide resolved
mlflow/models/utils.py Outdated Show resolved Hide resolved
mlflow/models/model.py Outdated Show resolved Hide resolved
mlflow/models/utils.py Outdated Show resolved Hide resolved
tests/models/test_model.py Outdated Show resolved Hide resolved
tests/models/test_model.py Outdated Show resolved Hide resolved
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
mlflow/models/utils.py Outdated Show resolved Hide resolved
mlflow/models/model.py Outdated Show resolved Hide resolved
mlflow/models/utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serena-ruan This is awesome! Almost ready for merge! Just a couple more small comments

mlflow/models/utils.py Outdated Show resolved Hide resolved
mlflow/models/utils.py Outdated Show resolved Hide resolved
mlflow/models/utils.py Outdated Show resolved Hide resolved
Comment on lines 40 to 49
artifact_path: str = None,
flavors: Dict[str, Any] = None,
model_uri: str = None,
model_uuid: str = None,
run_id: str = None,
saved_input_example_info: Optional[Dict[str, Any]] = None,
signature_dict: Optional[Dict[str, Any]] = None,
signature=None, # Optional[ModelSignature]
utc_time_created: str = None,
mlflow_version: str = None,
Copy link
Member

@harupy harupy Aug 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
artifact_path: str = None,
flavors: Dict[str, Any] = None,
model_uri: str = None,
model_uuid: str = None,
run_id: str = None,
saved_input_example_info: Optional[Dict[str, Any]] = None,
signature_dict: Optional[Dict[str, Any]] = None,
signature=None, # Optional[ModelSignature]
utc_time_created: str = None,
mlflow_version: str = None,
artifact_path: str,
flavors: Dict[str, Any],
model_uri: str,
model_uuid: str,
run_id: str,
saved_input_example_info: Optional[Dict[str, Any]],
signature_dict: Optional[Dict[str, Any]],
signature, # Optional[ModelSignature]
utc_time_created: str,
mlflow_version: str,

Do we need the default values? In the original ModelInfo, all the constructor arguments are positional.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I guess not, but for signature_dict as it will be deprecated I suggest we keep default value None for it :)

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
@serena-ruan
Copy link
Collaborator Author

@dbczumar @harupy Can we merge this in if there's no extra comments? :D

mlflow/models/model.py Outdated Show resolved Hide resolved
mlflow/models/utils.py Outdated Show resolved Hide resolved
mlflow/models/utils.py Outdated Show resolved Hide resolved
mlflow/models/utils.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dbczumar dbczumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@serena-ruan Apologies for the delay! Excited to merge this once three small comments are addressed!

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
@dbczumar dbczumar merged commit 9f3d3af into mlflow:master Sep 1, 2022
@serena-ruan serena-ruan deleted the serena/sigVal branch September 1, 2022 13:25
prithvikannan pushed a commit to prithvikannan/mlflow that referenced this pull request Sep 6, 2022
…mlflow#6494)

* feat: add get_model_info and validate_schema helper function

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* refactor functions in pyfunc to models.utils as validate_schema depends on numpy and pands

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove enforce schema of predict output

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* fix docs issue

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* fix docs and import error

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* fix rsthtml

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* refactor ModelInfo

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove unused arg

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments and fix docs

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* add warnings of signature_dict getter

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove warnings in ModelInfo initialization

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove unused stuff

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* move pd.Series inside pyFuncInput

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* addresss comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove useless return

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
prithvikannan pushed a commit to prithvikannan/mlflow that referenced this pull request Sep 7, 2022
…mlflow#6494)

* feat: add get_model_info and validate_schema helper function

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* refactor functions in pyfunc to models.utils as validate_schema depends on numpy and pands

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove enforce schema of predict output

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* fix docs issue

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* fix docs and import error

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* fix rsthtml

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* refactor ModelInfo

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove unused arg

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments and fix docs

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* add warnings of signature_dict getter

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove warnings in ModelInfo initialization

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove unused stuff

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* move pd.Series inside pyFuncInput

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* addresss comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove useless return

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
nnethery pushed a commit to nnethery/mlflow that referenced this pull request Feb 1, 2024
…mlflow#6494)

* feat: add get_model_info and validate_schema helper function

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* refactor functions in pyfunc to models.utils as validate_schema depends on numpy and pands

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove enforce schema of predict output

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* fix docs issue

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* fix docs and import error

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* fix rsthtml

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* refactor ModelInfo

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove unused arg

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments and fix docs

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* add warnings of signature_dict getter

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove warnings in ModelInfo initialization

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove unused stuff

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* address comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* move pd.Series inside pyFuncInput

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* addresss comments

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

* remove useless return

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>

Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FR] [Roadmap] Add utility for signature validation of logged model to dataset
4 participants