-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix a bug that converts int64 to string when converting Protobuf to JSON #5010
Conversation
Signed-off-by: Chenran Li <chenran.li@databricks.com>
@@ -28,6 +29,75 @@ def test_message_to_json(): | |||
"lifecycle_stage": "active", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug wasn't caught by this test case, because there is no int fields here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for making the change! A couple minor comments.
}, | ||
], | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add another line to convert the json back to proto and dict and assert it works?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks!
mlflow/utils/proto_json_utils.py
Outdated
return MessageToJson(message, preserving_proto_field_name=True) | ||
|
||
# Google's MessageToJson API converts int64/fixed64/unit64 proto fields to JSON strings. | ||
json_dict_with_int64_converted_to_str = json.loads( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json_dict_with_int64_converted_to_str -> json_dict_with_int64_as_str to be consistent with json_dict_with_int64_as_numbers below
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Thanks!
mlflow/utils/proto_json_utils.py
Outdated
|
||
|
||
def message_to_json(message): | ||
"""Converts a message to JSON, using snake_case for field names.""" | ||
return MessageToJson(message, preserving_proto_field_name=True) | ||
|
||
# Google's MessageToJson API converts int64/fixed64/unit64 proto fields to JSON strings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add context to why, e.g. citing the comment protocolbuffers/protobuf#2954 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Done
return json_dict | ||
|
||
|
||
def _merge_json_dicts(from_dict, to_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a package work better here? Not sure about its quality though. https://pypi.org/project/jsonmerge/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately that package has some weird bugs: it dropped lots of fields when merging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Np. Just to clarify options. Custom implementation is fine.
mlflow/utils/proto_json_utils.py
Outdated
if field.label == FieldDescriptor.LABEL_REPEATED: | ||
json_value = [] | ||
for v in value: | ||
json_value.append(ftype(v)) | ||
else: | ||
json_value = ftype(value) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
json_value = [ftype(v) for v in value] if field.label == FieldDescriptor.LABEL_REPEATED else ftype(value)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
mlflow/utils/proto_json_utils.py
Outdated
FieldDescriptor.TYPE_INT64, | ||
FieldDescriptor.TYPE_UINT64, | ||
FieldDescriptor.TYPE_FIXED64, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about these types?
TYPE_FIXED32 = 7
TYPE_INT32 = 5
TYPE_SFIXED32 = 15
TYPE_SFIXED64 = 16
TYPE_SINT32 = 17
TYPE_SINT64 = 18
TYPE_UINT32 = 13
TYPE_UINT64 = 4
And CPP types:
CPPTYPE_INT32 = 1
CPPTYPE_INT64 = 2
CPPTYPE_UINT32 = 3
CPPTYPE_UINT64 = 4
https://googleapis.dev/python/protobuf/latest/google/protobuf/descriptor.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For int32 types, according to the doc, they won't be converted to JSON strings so we don't need to add them here.
For CPP types, they are from a difference enum FieldDescriptor::CppType
rather than FieldDescriptor::Type
. They are used for field.cpp_type()
. But here we only care about field.type()
.
I added two more int64 types (TYPE_SFIXED64
and TYPE_SINT64
) to cover all the int64 types.
Signed-off-by: Chenran Li <chenran.li@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for the change, Chenran! There are some tests are failing. Please fix those.
mlflow/utils/proto_json_utils.py
Outdated
converted from proto messages | ||
""" | ||
|
||
for key in from_dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for key in from_dict: -> for key, value in from_dict.items():
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
value = to_dict[key] | ||
if isinstance(value, dict): | ||
_merge_json_dicts(from_dict[key], to_dict[key]) | ||
elif isinstance(value, list): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could value be a tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! According to this page, Python dict constructed from a JSON string cannot have tuples:
mlflow/utils/proto_json_utils.py
Outdated
for i in range(len(value)): | ||
if isinstance(value[i], dict): | ||
_merge_json_dicts(from_dict[key][i], to_dict[key][i]) | ||
else: | ||
to_dict[key][i] = from_dict[key][i] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Enumerate seems to be simpler here:
for i, v in enumerate(value):
if isinstance(v, dict):
_merge_json_dicts(v, to_dict[key][i])
else:
to_dict[key][i] = v
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Done
@harupy FYI |
return json_dict | ||
|
||
|
||
def _merge_json_dicts(from_dict, to_dict): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @lichenran1234, can we add a test for _merge_json_dicts
to make sure it works properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment, Harupy! Usually we don't write unit tests for private functions. Especially for this _merge_json_dicts
function: the code inside it should be embedded inside message_to_json
function, but I'm extracting it out for readability. So I guess we should follow the Test Behavior, Not Implementation principle.
Can you think of more test cases for message_to_json
if you are worried that _merge_json_dicts
may not work properly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @lichenran1234 that don't test private function. But perhaps add more types of field to the unit test to verify that the public API works for all kinds of fields? e.g. we might want to verify that the following aspects are correctly translated:
- default values (I worry about this one)
- extensions (and this one)
- proto maps
- oneof
- enums
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lichenran1234 Makes sense to not test _merge_json_dicts
, thanks for the knowledge!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @harupy and @jinzhang21 ! I added a new test proto message so I can test this function extensively. I also added support for proto maps.
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
@@ -0,0 +1,55 @@ | |||
syntax = "proto2"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to move this file under tests
directory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point! Done!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for addressing this issue, @lichenran1234 ! Could you please move the test protos to mlflow/tests/protos as @harupy suggested? The rest LGTM! Feel free to merge after the change.
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
Signed-off-by: Chenran Li <chenran.li@databricks.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Signed-off-by: Chenran Li chenran.li@databricks.com
What changes are proposed in this pull request?
According to issue #4037, the returned JSON of the endpoints has
creation_timestamp
andlast_updated_timestamp
as strings, not numbers. It's different from what was documented in the official doc.The reason is we are calling Google's
MessageToJson
API to convert protobuf to json, which implicitly converts int64/fixed64/unit64 fields to strings. And they claimed it's a feature not a bug (see the discussion).According to the bug reporter, this bug doesn't exist in Azure ML mlflow server (which is essentially our Databricks mlflow server). That's because we are using ScalaPB's
ToJson()
API for all the Databricks endpoints, and it doesn't convert int64 to string.There is no way to let
MessageToJson
API not convert int64 to strings. Nor are there any other good Python proto-to-json libraries. So to fix this bug, we have to choose from:MessageToJson
MessageToJson
APIHow is this patch tested?
unit tests
Release Notes
Is this a user-facing change?
(Details in 1-2 sentences. You can just refer to another PR with a description if this PR is part of a larger change.)
What component(s), interfaces, languages, and integrations does this PR affect?
Components
area/artifacts
: Artifact stores and artifact loggingarea/build
: Build and test infrastructure for MLflowarea/docs
: MLflow documentation pagesarea/examples
: Example codearea/model-registry
: Model Registry service, APIs, and the fluent client calls for Model Registryarea/models
: MLmodel format, model serialization/deserialization, flavorsarea/projects
: MLproject format, project running backendsarea/scoring
: MLflow Model server, model deployment tools, Spark UDFsarea/server-infra
: MLflow Tracking server backendarea/tracking
: Tracking Service, tracking client APIs, autologgingInterface
area/uiux
: Front-end, user experience, plotting, JavaScript, JavaScript dev serverarea/docker
: Docker use across MLflow's components, such as MLflow Projects and MLflow Modelsarea/sqlalchemy
: Use of SQLAlchemy in the Tracking Service or Model Registryarea/windows
: Windows supportLanguage
language/r
: R APIs and clientslanguage/java
: Java APIs and clientslanguage/new
: Proposals for new client languagesIntegrations
integrations/azure
: Azure and Azure ML integrationsintegrations/sagemaker
: SageMaker integrationsintegrations/databricks
: Databricks integrationsHow should the PR be classified in the release notes? Choose one:
rn/breaking-change
- The PR will be mentioned in the "Breaking Changes" sectionrn/none
- No description will be included. The PR will be mentioned only by the PR number in the "Small Bugfixes and Documentation Updates" sectionrn/feature
- A new user-facing feature worth mentioning in the release notesrn/bug-fix
- A user-facing bug fix worth mentioning in the release notesrn/documentation
- A user-facing documentation change worth mentioning in the release notes