-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add utility for signature validation of logged model to dataset #6494
Conversation
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
mlflow/models/model.py
Outdated
PyFuncInput = Union[ | ||
pandas.DataFrame, | ||
np.ndarray, | ||
scipy.sparse.csc_matrix, # Why do we use a string format "scipy.sparse.csc_matrix" here before? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raising a question here: this was previous set as the string format of "scipy.sparse.csc_matrix" and "scipy.sparse.csr_matrix", is there any specific reason for them?
…ds on numpy and pands Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
mlflow/models/model.py
Outdated
) | ||
|
||
|
||
def get_model_signature(model_uri: str) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of having two methods, it would be great to include the model signature in the ModelInfo
class so that users can call get_model_info().signature
. I realize that ModelInfo
has a signature_dict
attribute; is it possible to add a signature
attribute of type ModelSignature
(rather than dict
) and mark signature_dict
as deprecated via the @deprecated
decorator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually @depracted decorator can only be applied on classes or methods, so I added another similar decorator for this reference (let me know if you have other preferences since I'm not super expert here).
mlflow/models/utils.py
Outdated
if isinstance( | ||
data, | ||
( | ||
pd.DataFrame, | ||
np.ndarray, | ||
csc_matrix, | ||
csr_matrix, | ||
dict, | ||
pd.Series, | ||
list, | ||
), | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check? Looks like pyfunc.predict()
doesn't check isinstance()
before calling _enforce_schema
:
mlflow/mlflow/pyfunc/__init__.py
Lines 650 to 651 in 9d278b1
if input_schema is not None: | |
data = _enforce_schema(data, input_schema) |
If we do need the check, can we check isinstance(data, PyFuncInput)
? Though I don't think the check is necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes it doesn't check for predict, perhaps I can just delete this check and call _enforce_schema directly.
mlflow/models/utils.py
Outdated
csc_matrix, | ||
csr_matrix, | ||
dict, | ||
pd.Series, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't look like pd.Series
is considered valid by _enforce_schema
(it will throw)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh I did miss some parts, will add support as well.
mlflow/models/utils.py
Outdated
) | ||
|
||
|
||
def validate_schema(data: DataInputType, expected_schema: Schema) -> DataInputType: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Following from https://github.com/mlflow/mlflow/pull/6494/files#r950650995, should we change this to:
def validate_schema(data: DataInputType, expected_schema: Schema) -> DataInputType: | |
def validate_schema(data: PyFuncInput, expected_schema: Schema) -> DataInputType: |
Since _enforce_schema
doesn't support pandas series and validation of output schemas doesn't appear to be critically important?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually added support of pandas series, as it can be directly turned into a pandas DataFrame and does the following validation. And in fact we could accept pandas.Series as a dataset input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is nice to make PyFuncInput
support pd.Series
type. @dbczumar Thoughts ?
But, @serena-ruan , your definition DataInputType = Union[PyFuncInput, PyFuncOutput]
looks weird. We can directly reuse PyFuncInput
here, if we add pd.Series
into PyFuncInput
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh yes, will drop DataInputType.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serena-ruan Apologies for the delay in the review. This is looking great! Just left a few comments - let me know if you have any questions.
Thanks for those comments! Didn't get a chance to take a look over the weekend, will address them later today! |
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
@dbczumar There's a constraint that if ModelInfo depends on ModelSignature, then we need to rely on pandas/numpy so we can't import them in mlflow-skinny. Besides, Model contains a method |
@serena-ruan Can we import |
This is a NamedTuple instead of a normal class containing attributes 🤔 But I can have a try |
Thanks! Hopefully that still works :). If not, we can convert from NamedTuple to class |
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serena-ruan This is awesome! Almost ready for merge! Just a couple more small comments
mlflow/models/model.py
Outdated
artifact_path: str = None, | ||
flavors: Dict[str, Any] = None, | ||
model_uri: str = None, | ||
model_uuid: str = None, | ||
run_id: str = None, | ||
saved_input_example_info: Optional[Dict[str, Any]] = None, | ||
signature_dict: Optional[Dict[str, Any]] = None, | ||
signature=None, # Optional[ModelSignature] | ||
utc_time_created: str = None, | ||
mlflow_version: str = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
artifact_path: str = None, | |
flavors: Dict[str, Any] = None, | |
model_uri: str = None, | |
model_uuid: str = None, | |
run_id: str = None, | |
saved_input_example_info: Optional[Dict[str, Any]] = None, | |
signature_dict: Optional[Dict[str, Any]] = None, | |
signature=None, # Optional[ModelSignature] | |
utc_time_created: str = None, | |
mlflow_version: str = None, | |
artifact_path: str, | |
flavors: Dict[str, Any], | |
model_uri: str, | |
model_uuid: str, | |
run_id: str, | |
saved_input_example_info: Optional[Dict[str, Any]], | |
signature_dict: Optional[Dict[str, Any]], | |
signature, # Optional[ModelSignature] | |
utc_time_created: str, | |
mlflow_version: str, |
Do we need the default values? In the original ModelInfo
, all the constructor arguments are positional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I guess not, but for signature_dict as it will be deprecated I suggest we keep default value None for it :)
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@serena-ruan Apologies for the delay! Excited to merge this once three small comments are addressed!
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
…mlflow#6494) * feat: add get_model_info and validate_schema helper function Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * refactor functions in pyfunc to models.utils as validate_schema depends on numpy and pands Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove enforce schema of predict output Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * fix docs issue Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * fix docs and import error Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * fix rsthtml Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * refactor ModelInfo Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove unused arg Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments and fix docs Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * add warnings of signature_dict getter Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove warnings in ModelInfo initialization Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove unused stuff Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * move pd.Series inside pyFuncInput Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * addresss comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove useless return Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
…mlflow#6494) * feat: add get_model_info and validate_schema helper function Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * refactor functions in pyfunc to models.utils as validate_schema depends on numpy and pands Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove enforce schema of predict output Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * fix docs issue Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * fix docs and import error Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * fix rsthtml Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * refactor ModelInfo Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove unused arg Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments and fix docs Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * add warnings of signature_dict getter Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove warnings in ModelInfo initialization Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove unused stuff Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * move pd.Series inside pyFuncInput Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * addresss comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove useless return Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>
…mlflow#6494) * feat: add get_model_info and validate_schema helper function Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * refactor functions in pyfunc to models.utils as validate_schema depends on numpy and pands Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove enforce schema of predict output Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * fix docs issue Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * fix docs and import error Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * fix rsthtml Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * refactor ModelInfo Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove unused arg Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments and fix docs Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * add warnings of signature_dict getter Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove warnings in ModelInfo initialization Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove unused stuff Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * address comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * move pd.Series inside pyFuncInput Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * addresss comments Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> * remove useless return Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com> Signed-off-by: Xinyue Ruan <serena.rxy@gmail.com>
Signed-off-by: Xinyue Ruan serena.rxy@gmail.com
Related Issues/PRs
Resolve #6092
What changes are proposed in this pull request?
Add utility method for get_model_info and validate_schema.
How is this patch tested?
Does this PR change the documentation?
Details
link on thePreview docs
check.Release Notes
Is this a user-facing change?
Add utility functions that enable users to 1. easily get model info from the model uri directly; 2. validate dataset against the target schema.
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/pipelines
: Pipelines, Pipeline APIs, Pipeline configs, Pipeline Templatesarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes