Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix a bug that converts int64 to string when converting Protobuf to JSON #5010

Merged
merged 19 commits into from
Nov 11, 2021

Conversation

lichenran1234
Copy link
Contributor

@lichenran1234 lichenran1234 commented Nov 4, 2021

Signed-off-by: Chenran Li chenran.li@databricks.com

What changes are proposed in this pull request?

According to issue #4037, the returned JSON of the endpoints has creation_timestamp and last_updated_timestamp as strings, not numbers. It's different from what was documented in the official doc.

The reason is we are calling Google's MessageToJson API to convert protobuf to json, which implicitly converts int64/fixed64/unit64 fields to strings. And they claimed it's a feature not a bug (see the discussion).

According to the bug reporter, this bug doesn't exist in Azure ML mlflow server (which is essentially our Databricks mlflow server). That's because we are using ScalaPB's ToJson() API for all the Databricks endpoints, and it doesn't convert int64 to string.

There is no way to let MessageToJson API not convert int64 to strings. Nor are there any other good Python proto-to-json libraries. So to fix this bug, we have to choose from:

  • (what I'm doing in this PR) manually converting the int64/uint64/fixed64 fields back to numbers after calling MessageToJson
  • (too risky so I chose not to do) writing our customized MessageToJson API

How is this patch tested?

unit tests

Release Notes

Is this a user-facing change?

  • No. You can skip the rest of this section.
  • Yes. Give a description of this change to be included in the release notes for MLflow users.

(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)

What component(s), interfaces, languages, and integrations does this PR affect?

Components

  • area/artifacts: Artifact stores and artifact logging
  • area/build: Build and test infrastructure for MLflow
  • area/docs: MLflow documentation pages
  • area/examples: Example code
  • area/model-registry: Model Registry service, APIs, and the fluent client calls for Model Registry
  • area/models: MLmodel format, model serialization/deserialization, flavors
  • area/projects: MLproject format, project running backends
  • area/scoring: MLflow Model server, model deployment tools, Spark UDFs
  • area/server-infra: MLflow Tracking server backend
  • area/tracking: Tracking Service, tracking client APIs, autologging

Interface

  • area/uiux: Front-end, user experience, plotting, JavaScript, JavaScript dev server
  • area/docker: Docker use across MLflow's components, such as MLflow Projects and MLflow Models
  • area/sqlalchemy: Use of SQLAlchemy in the Tracking Service or Model Registry
  • area/windows: Windows support

Language

  • language/r: R APIs and clients
  • language/java: Java APIs and clients
  • language/new: Proposals for new client languages

Integrations

  • integrations/azure: Azure and Azure ML integrations
  • integrations/sagemaker: SageMaker integrations
  • integrations/databricks: Databricks integrations

How should the PR be classified in the release notes? Choose one:

  • rn/breaking-change - The PR will be mentioned in the "Breaking Changes" section
  • rn/none - No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" section
  • rn/feature - A new user-facing feature worth mentioning in the release notes
  • rn/bug-fix - A user-facing bug fix worth mentioning in the release notes
  • rn/documentation - A user-facing documentation change worth mentioning in the release notes

Signed-off-by: Chenran Li <chenran.li@databricks.com>
@github-actions github-actions bot added area/model-registry Model registry, model registry APIs, and the fluent client calls for model registry rn/none List under Small Changes in Changelogs. labels Nov 4, 2021
@lichenran1234 lichenran1234 changed the title Fix an issue with the timestamps in the endpoint response Fix a bug that converts int64 to string when converting Protobuf to JSON Nov 4, 2021
@@ -28,6 +29,75 @@ def test_message_to_json():
"lifecycle_stage": "active",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug wasn't caught by this test case, because there is no int fields here.

Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Copy link
Collaborator

@jinzhang21 jinzhang21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making the change! A couple minor comments.

},
],
}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add another line to convert the json back to proto and dict and assert it works?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!

return MessageToJson(message, preserving_proto_field_name=True)

# Google's MessageToJson API converts int64/fixed64/unit64 proto fields to JSON strings.
json_dict_with_int64_converted_to_str = json.loads(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json_dict_with_int64_converted_to_str -> json_dict_with_int64_as_str to be consistent with json_dict_with_int64_as_numbers below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Thanks!



def message_to_json(message):
"""Converts a message to JSON, using snake_case for field names."""
return MessageToJson(message, preserving_proto_field_name=True)

# Google's MessageToJson API converts int64/fixed64/unit64 proto fields to JSON strings.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add context to why, e.g. citing the comment protocolbuffers/protobuf#2954 (comment)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Done

return json_dict


def _merge_json_dicts(from_dict, to_dict):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a package work better here? Not sure about its quality though. https://pypi.org/project/jsonmerge/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately that package has some weird bugs: it dropped lots of fields when merging.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Np. Just to clarify options. Custom implementation is fine.

Comment on lines 32 to 37
if field.label == FieldDescriptor.LABEL_REPEATED:
json_value = []
for v in value:
json_value.append(ftype(v))
else:
json_value = ftype(value)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

json_value = [ftype(v) for v in value] if field.label == FieldDescriptor.LABEL_REPEATED else ftype(value)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

Comment on lines 23 to 25
FieldDescriptor.TYPE_INT64,
FieldDescriptor.TYPE_UINT64,
FieldDescriptor.TYPE_FIXED64,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about these types?

TYPE_FIXED32 = 7
TYPE_INT32 = 5
TYPE_SFIXED32 = 15
TYPE_SFIXED64 = 16
TYPE_SINT32 = 17
TYPE_SINT64 = 18
TYPE_UINT32 = 13
TYPE_UINT64 = 4

And CPP types:
CPPTYPE_INT32 = 1
CPPTYPE_INT64 = 2
CPPTYPE_UINT32 = 3
CPPTYPE_UINT64 = 4

https://googleapis.dev/python/protobuf/latest/google/protobuf/descriptor.html

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For int32 types, according to the doc, they won't be converted to JSON strings so we don't need to add them here.

For CPP types, they are from a difference enum FieldDescriptor::CppType rather than FieldDescriptor::Type. They are used for field.cpp_type(). But here we only care about field.type().
Screen Shot 2021-11-04 at 19 11 17

I added two more int64 types (TYPE_SFIXED64 and TYPE_SINT64) to cover all the int64 types.

Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Copy link
Collaborator

@jinzhang21 jinzhang21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for the change, Chenran! There are some tests are failing. Please fix those.

converted from proto messages
"""

for key in from_dict:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for key in from_dict: -> for key, value in from_dict.items():

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!

value = to_dict[key]
if isinstance(value, dict):
_merge_json_dicts(from_dict[key], to_dict[key])
elif isinstance(value, list):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could value be a tuple?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! According to this page, Python dict constructed from a JSON string cannot have tuples:
Screen Shot 2021-11-05 at 10 49 31

Comment on lines 52 to 56
for i in range(len(value)):
if isinstance(value[i], dict):
_merge_json_dicts(from_dict[key][i], to_dict[key][i])
else:
to_dict[key][i] = from_dict[key][i]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Enumerate seems to be simpler here:

for i, v in enumerate(value):
if isinstance(v, dict):
_merge_json_dicts(v, to_dict[key][i])
else:
to_dict[key][i] = v

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Done

@dbczumar dbczumar requested a review from harupy November 5, 2021 04:22
@dbczumar
Copy link
Collaborator

dbczumar commented Nov 5, 2021

@harupy FYI

return json_dict


def _merge_json_dicts(from_dict, to_dict):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @lichenran1234, can we add a test for _merge_json_dicts to make sure it works properly?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, Harupy! Usually we don't write unit tests for private functions. Especially for this _merge_json_dicts function: the code inside it should be embedded inside message_to_json function, but I'm extracting it out for readability. So I guess we should follow the Test Behavior, Not Implementation principle.

Can you think of more test cases for message_to_json if you are worried that _merge_json_dicts may not work properly?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @lichenran1234 that don't test private function. But perhaps add more types of field to the unit test to verify that the public API works for all kinds of fields? e.g. we might want to verify that the following aspects are correctly translated:

  • default values (I worry about this one)
  • extensions (and this one)
  • proto maps
  • oneof
  • enums

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lichenran1234 Makes sense to not test _merge_json_dicts, thanks for the knowledge!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @harupy and @jinzhang21 ! I added a new test proto message so I can test this function extensively. I also added support for proto maps.

Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
@@ -0,0 +1,55 @@
syntax = "proto2";
Copy link
Member

@harupy harupy Nov 6, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to move this file under tests directory?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! Done!

Copy link
Collaborator

@jinzhang21 jinzhang21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for addressing this issue, @lichenran1234 ! Could you please move the test protos to mlflow/tests/protos as @harupy suggested? The rest LGTM! Feel free to merge after the change.

Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Copy link
Member

@harupy harupy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lichenran1234 lichenran1234 merged commit a20bb49 into mlflow:master Nov 11, 2021
@lichenran1234 lichenran1234 deleted the timestamp branch November 11, 2021 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/model-registry Model registry, model registry APIs, and the fluent client calls for model registry rn/none List under Small Changes in Changelogs.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants