Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support optional inputs in model signatures #8438

Merged
merged 15 commits into from
May 26, 2023

Conversation

apurva-koti
Copy link
Collaborator

@apurva-koti apurva-koti commented May 15, 2023

Related Issues/PRs

#xxx

What changes are proposed in this pull request?

Add an optional boolean parameter to ColSpec to specify whether the column in question is required for model inference or can be omitted. Updates code accordingly:

  • Updates pyfunc schema checks to check against missing required columns. Missing optional columns will be ignored, but provided optional columns will still be type checked.
  • Defaults optional to False for backwards compatibility.
  • Prevents argument autofill in pyfunc models returned as spark UDFs as added here Enable spark_udf to use column names from model signature by default #4236

Example usage (pyfunc):

input_schema = Schema(
    [
        ColSpec("double", "a"),
        ColSpec("double", "b"),
        ColSpec("string", "c", optional=True),
        ColSpec("long", "d", optional=True),
    ]
)
signature = ModelSignature(inputs=input_schema)
data = {"a": [1.0], "b": [1.0]}
data_2 = {"a": [1.0], "b": [1.0], "d": [2]}

for data in [
    data,
    data_2,
]:
    pd_data = pd.DataFrame(data)
    check = _enforce_schema(pd_data, signature.inputs) #passes

Example (spark udf):

test_signature = {
    "inputs": '[{"name": "a", "type": "long"}, {"name": "b", "type": "long"}, {"name" : "c", "type": "long", "optional": "True"}]',
}
signature = ModelSignature.from_dict(test_signature)
...
udf = mlflow.pyfunc.spark_udf(...)

data = spark.createDataFrame(
    pd.DataFrame(columns=["a", "b"], data={"a": [1, 2], "b": [2, 3]})
)

res = data.withColumn("response", udf(*data.columns)) #

How is this patch tested?

  • Existing unit/integration tests
  • New unit/integration tests
  • Manual tests (describe details, including test results, below)

Does this PR change the documentation?

  • No. You can skip the rest of this section.
  • Yes. Make sure the changed pages / sections render correctly in the documentation preview.

Release Notes

Optional input columns can now be specified in model signatures. These columns can be omitted from input dataframes at prediction time.

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

Optional input columns can now be specified in model signatures. These columns can be omitted from input dataframes at prediction time.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/recipes: Recipes, Recipe APIs, Recipe configs, Recipe Templates
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
@mlflow-automation
Copy link
Collaborator

mlflow-automation commented May 15, 2023

Documentation preview for 96030fb will be available here when this CircleCI job completes successfully.

More info

@WeichenXu123
Copy link
Collaborator

Could you update PR description to attach an example code that shows the case your PR supports ?

@apurva-koti
Copy link
Collaborator Author

@WeichenXu123 PR not ready yet. I'll have all that in when requesting review

Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
@github-actions github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs. labels May 17, 2023
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
@apurva-koti apurva-koti marked this pull request as ready for review May 25, 2023 21:26
Comment on lines +1304 to +1309
if input_schema and len(input_schema.optional_input_names()) > 0:
raise MlflowException(
message="Cannot apply UDF without column names specified when"
" model signature contains optional columns.",
error_code=INVALID_PARAMETER_VALUE,
)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automatic parameter filling relies on the model signature to determine what the columns will be when the UDF is applied to the dataframe.
With optional columns, we can neither:

  • include one or all of them, as that would implicitly require those columns to exist in the dataframe, raising an error within pyspark
  • exclude one or all of them, as they would then not be selected from the dataframe at all

Given this, it seems reasonable to prevent this convenience for this case. Users can still manually pass in the list of columns to udf as follows:

test_signature = {
    "inputs": '[{"name": "a", "type": "long"}, {"name": "b", "type": "long"}, {"name" : "c", "type": "long", "optional": "True"}]',
}
signature = ModelSignature.from_dict(test_signature)
...
udf = mlflow.pyfunc.spark_udf(...)

data = spark.createDataFrame(
    pd.DataFrame(columns=["a", "b"], data={"a": [1, 2], "b": [2, 3]})
)

res = data.withColumn("response", udf(*data.columns)) # <-- calling udf() would throw an exception.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense!

Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Signed-off-by: Apurva Koti <apurva.koti@databricks.com>
Copy link
Collaborator

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@apurva-koti apurva-koti merged commit 61decf4 into mlflow:master May 26, 2023
26 checks passed
@apurva-koti apurva-koti deleted the optional-input-types branch May 26, 2023 17:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants