Add support for Decimal object to Double cast in ML Flow #6600

shitaoli-db · 2022-08-26T00:55:02Z

What changes are proposed in this pull request?

(Please fill in changes proposed in this fix)

How is this patch tested?

Added unit test for Decimal to float object in enforce schema cast.
Also added end to end test with databricks automl model inference, which has double schema in model, the inference was successful

Does this PR change the documentation?

No. You can skip the rest of this section.
Yes. Make sure the changed pages / sections render correctly by following the steps below.

Release Notes

Is this a user-facing change?

No. You can skip the rest of this section.
Yes. Give a description of this change to be included in the release notes for MLflow users.

Model inference now allow user to use Decimal object as input. Note decimal input will only be allowed when schema is double and the data will be casted to double and thus may lost precision.

What component(s), interfaces, languages, and integrations does this PR affect?

Components

Interface

area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
area/windows: Windows support

Language

language/r: R APIs and clients
language/java: Java APIs and clients
language/new: Proposals for new client languages

Integrations

integrations/azure: Azure and Azure ML integrations
integrations/sagemaker: SageMaker integrations
integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
rn/feature - A new user-facing feature worth mentioning in the release notes
rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
rn/documentation - A user-facing documentation change worth mentioning in the release notes

github-actions · 2022-08-26T00:55:16Z

@shitaoli-db Thanks for the contribution! The DCO check failed. Please sign off your commits by following the instructions here: https://github.com/mlflow/mlflow/runs/8028008004. See https://github.com/mlflow/mlflow/blob/master/CONTRIBUTING.rst#sign-your-work for more details.

Signed-off-by: Shitao Li <shitao.li@databricks.com>

yxiong · 2022-08-26T03:48:32Z

mlflow/pyfunc/__init__.py

@@ -397,6 +402,15 @@ def _enforce_mlflow_datatype(name, values: pandas.Series, t: DataType):
            raise MlflowException(
                "Failed to convert column {0} from type {1} to {2}.".format(name, values.dtype, t)
            )
+    if t == DataType.double and values.dtype == decimal.Decimal:


I thought in our test the values.dtype is object, while type(values[0]) is decimal.Decimal. Can you confirm?

yeah, verified

Also our unit test and e2e notebook test all passed when object is Decimal

Ah, I didn't know dtype == object and dtype == decimal.Decimal can both be true. LGTM!

yxiong · 2022-08-26T20:07:04Z

mlflow/pyfunc/__init__.py

@@ -397,6 +402,15 @@ def _enforce_mlflow_datatype(name, values: pandas.Series, t: DataType):
            raise MlflowException(
                "Failed to convert column {0} from type {1} to {2}.".format(name, values.dtype, t)
            )
+    if t == DataType.double and values.dtype == decimal.Decimal:


Ah, I didn't know dtype == object and dtype == decimal.Decimal can both be true. LGTM!

tomasatdatabricks · 2022-08-26T20:21:59Z

mlflow/pyfunc/__init__.py

@@ -397,6 +402,15 @@ def _enforce_mlflow_datatype(name, values: pandas.Series, t: DataType):
            raise MlflowException(
                "Failed to convert column {0} from type {1} to {2}.".format(name, values.dtype, t)
            )
+    if t == DataType.double and values.dtype == decimal.Decimal:
+        # NB: Pyspark Decimal columne get converted to decimal.Decimal when converted to pandas


nit: columne -> column.
in order to support ... - is part of the comment missing?

Thank you. Done.

tomasatdatabricks

Looks good.

Can you also update docs please?
Can we make this work only in spark_udf environment or protect it with an argument? E.g. we could use an env variable or extra argument to pyfunc.load_model. Would that be feasible?

Signed-off-by: Shitao Li <shitao.li@databricks.com>

shitaoli-db · 2022-08-29T16:54:46Z

Let me try with an extra argument first. Do you know how did we generate this message?

If we use the extra argument, then we have to customize this message in the UI to let customer know that if they want their model with decimal training data work in inference, they have to pass an extra argument.

Looks good.

Can you also update docs please? Can we make this work only in spark_udf environment or protect it with an argument? E.g. we could use an env variable or extra argument to pyfunc.load_model. Would that be feasible?

dbczumar · 2022-08-30T19:31:19Z

mlflow/pyfunc/__init__.py

+    model_uri: str,
+    suppress_warnings: bool = False,
+    dst_path: str = None,
+    support_decimal: bool = False,


Is there a way to accomplish the goal of providing conditional decimal enforcement without adding this flag to the mlflow.pyfunc.load_model() API? I'd rather not extend this API for a niche use case.

I personally don't think we need this flag at all. Can we just always enable this conversion? I don't see any real use case of setting that to false (or only add that flag later when we confirm it's a real use case).

@dbczumar @tomasatdatabricks what do you think?

I think tom provided 2 approach to me

The other option I could think on top of my head is use environment variable

https://github.com/mlflow/mlflow/blob/master/mlflow/environment_variables.py

I am not sure which approach is more preferrable, let's have a quick agreement.

I agree with @yxiong here

Reverted back the change, now we always enable the cast.

dbczumar

LGTM!

Signed-off-by: Shitao Li <shitao.li@databricks.com>

shitaoli-db · 2022-08-31T23:25:18Z

autoformat

Signed-off-by: Shitao Li <shitao.li@databricks.com>

…ved the enforce scehma away so we have to do same. Signed-off-by: Shitao Li <shitao.li@databricks.com>

Signed-off-by: Shitao Li <shitao.li@databricks.com>

shitaoli-db · 2022-09-01T18:57:17Z

automerge

Signed-off-by: Shitao Li <shitao.li@databricks.com>

* Add Decimal to double cast when enforce the schema. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Add unit test for casting from Decimal to double. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Fix the comment in the python file. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Fix lint problem. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Move the change to utils since merge conflict. Previous commit had moved the enforce scehma away so we have to do same. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Clean up __init__.py since merge resolver did not handle merge conflict. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Lint the utils.py. Signed-off-by: Shitao Li <shitao.li@databricks.com> Signed-off-by: Shitao Li <shitao.li@databricks.com> Signed-off-by: Prithvi Kannan <prithvi.kannan@databricks.com>

* Add Decimal to double cast when enforce the schema. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Add unit test for casting from Decimal to double. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Fix the comment in the python file. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Fix lint problem. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Move the change to utils since merge conflict. Previous commit had moved the enforce scehma away so we have to do same. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Clean up __init__.py since merge resolver did not handle merge conflict. Signed-off-by: Shitao Li <shitao.li@databricks.com> * Lint the utils.py. Signed-off-by: Shitao Li <shitao.li@databricks.com> Signed-off-by: Shitao Li <shitao.li@databricks.com>

github-actions bot added area/models MLmodel format, model serialization/deserialization, flavors rn/feature Mention under Features in Changelogs. labels Aug 26, 2022

shitaoli-db added 2 commits August 25, 2022 17:59

Add Decimal to double cast when enforce the schema.

62afab7

Signed-off-by: Shitao Li <shitao.li@databricks.com>

Add unit test for casting from Decimal to double.

6399e81

Signed-off-by: Shitao Li <shitao.li@databricks.com>

shitaoli-db force-pushed the decimal_schema branch from b06f1c0 to 6399e81 Compare August 26, 2022 01:00

shitaoli-db changed the title ~~Add support for Decimal object to Double cast in ML Flow~~ [ML-24550] Add support for Decimal object to Double cast in ML Flow Aug 26, 2022

yxiong suggested changes Aug 26, 2022

View reviewed changes

shitaoli-db changed the title ~~[ML-24550] Add support for Decimal object to Double cast in ML Flow~~ Add support for Decimal object to Double cast in ML Flow Aug 26, 2022

yxiong approved these changes Aug 26, 2022

View reviewed changes

tomasatdatabricks reviewed Aug 26, 2022

View reviewed changes

Fix the comment in the python file.

99d17ad

Signed-off-by: Shitao Li <shitao.li@databricks.com>

shitaoli-db requested a review from tomasatdatabricks August 29, 2022 22:24

dbczumar reviewed Aug 30, 2022

View reviewed changes

shitaoli-db force-pushed the decimal_schema branch from 89bdb01 to 99d17ad Compare August 30, 2022 22:34

shitaoli-db requested review from yxiong, dbczumar and tomasatdatabricks and removed request for tomasatdatabricks, yxiong and dbczumar August 30, 2022 23:29

yxiong approved these changes Aug 31, 2022

View reviewed changes

shitaoli-db requested review from tomasatdatabricks and removed request for dbczumar August 31, 2022 17:22

dbczumar approved these changes Aug 31, 2022

View reviewed changes

dbczumar enabled auto-merge (squash) August 31, 2022 21:58

Fix lint problem.

aaea168

Signed-off-by: Shitao Li <shitao.li@databricks.com>

auto-merge was automatically disabled August 31, 2022 22:46
Head branch was pushed to by a user without write access

shitaoli-db requested review from dbczumar and removed request for tomasatdatabricks August 31, 2022 23:24

shitaoli-db added 3 commits September 1, 2022 10:37

Merge branch 'master' of github.com:mlflow/mlflow into decimal_schema

7ab4bba

Signed-off-by: Shitao Li <shitao.li@databricks.com>

Move the change to utils since merge conflict. Previous commit had mo…

3578b92

…ved the enforce scehma away so we have to do same. Signed-off-by: Shitao Li <shitao.li@databricks.com>

Clean up __init__.py since merge resolver did not handle merge conflict.

1ecb283

Signed-off-by: Shitao Li <shitao.li@databricks.com>

dbczumar enabled auto-merge (squash) September 2, 2022 06:27

Lint the utils.py.

af69f4a

Signed-off-by: Shitao Li <shitao.li@databricks.com>

auto-merge was automatically disabled September 2, 2022 17:31
Head branch was pushed to by a user without write access

dbczumar enabled auto-merge (squash) September 2, 2022 21:06

dbczumar merged commit eb3588a into mlflow:master Sep 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Decimal object to Double cast in ML Flow #6600

Add support for Decimal object to Double cast in ML Flow #6600

shitaoli-db commented Aug 26, 2022 •

edited

github-actions bot commented Aug 26, 2022

yxiong Aug 26, 2022

shitaoli-db Aug 26, 2022

shitaoli-db Aug 26, 2022

yxiong Aug 26, 2022

yxiong Aug 26, 2022

tomasatdatabricks Aug 26, 2022

shitaoli-db Aug 29, 2022

tomasatdatabricks left a comment

shitaoli-db commented Aug 29, 2022

dbczumar Aug 30, 2022

yxiong Aug 30, 2022

shitaoli-db Aug 30, 2022

dbczumar Aug 30, 2022

shitaoli-db Aug 30, 2022

dbczumar left a comment

shitaoli-db commented Aug 31, 2022

shitaoli-db commented Sep 1, 2022

Add support for Decimal object to Double cast in ML Flow #6600

Add support for Decimal object to Double cast in ML Flow #6600

Conversation

shitaoli-db commented Aug 26, 2022 • edited

What changes are proposed in this pull request?

How is this patch tested?

Does this PR change the documentation?

Release Notes

Is this a user-facing change?

What component(s), interfaces, languages, and integrations does this PR affect?

How should the PR be classified in the release notes? Choose one:

github-actions bot commented Aug 26, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomasatdatabricks left a comment

Choose a reason for hiding this comment

shitaoli-db commented Aug 29, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbczumar left a comment

Choose a reason for hiding this comment

shitaoli-db commented Aug 31, 2022

shitaoli-db commented Sep 1, 2022

shitaoli-db commented Aug 26, 2022 •

edited