Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bytes data type not JSON serializable #158

Closed
richard-to opened this issue Apr 26, 2024 · 12 comments
Closed

bytes data type not JSON serializable #158

richard-to opened this issue Apr 26, 2024 · 12 comments
Assignees
Labels
bug Something isn't working good first issues Good for first-time contributors

Comments

@richard-to
Copy link
Collaborator

If you store bytes in Mesop state, it does not appear to be JSON serializable by default .

@richard-to richard-to added the bug Something isn't working label Apr 26, 2024
@richard-to richard-to added the good first issues Good for first-time contributors label Jun 8, 2024
@zacharias1219
Copy link
Contributor

Hello @richard-to I would like to contribute to this issue, this is my first time however I'm fairly confident about my skills to help this project.

@zacharias1219
Copy link
Contributor

zacharias1219 commented Jun 9, 2024

What if we use a json encoder and edit the dataclass_utils.py file to retrieve this function so that it is json serializable

import json
import base64

def decode_bytes(dct):
for key, value in dct.items():
if isinstance(value, str) and value.startswith('base64:'):
dct[key] = base64.b64decode(value[7:]) # Remove 'base64:' prefix
return dct

@richard-to
Copy link
Collaborator Author

Hi @zacharias1219,

Thanks for wanting to contribute to Mesop. You're certainly welcome to contribute.

Yes, I agree with your approach of serializing the bytes to base64.

In terms of implementation, I think you can follow a similar strategy to what we did with the Pandas serialization, which was to convert the incompatible data type to a dictionary.

Example for bytes:

{"__python.bytes._": "base64-enoded-byte-string"}

We also have some other data types that need custom encoding: #387. So we want to create a consistent pattern.


We have a MesopJSONEncoder that you can add to.

We also have decode_mesop_json_state_hook for the decoding.


For getting started, you can start here: https://google.github.io/mesop/internal/contributing/ and here: https://google.github.io/mesop/internal/development/

We also have a Github Codespaces we recently set up that you can test out. No instructions yet since we just added it and it probably needs some testing to work out some kinks. But it's available.

@richard-to
Copy link
Collaborator Author

Also forgot to mention. For running the unit tests for the changes here, you can do:

bazel test //mesop/dataclass_utils:dataclass_utils_test

There should be a test file there that you can add your tests to.

@zacharias1219
Copy link
Contributor

zacharias1219 commented Jun 10, 2024

Hey thank you for commenting, I've gotten an idea that I'll post here

mesop/dataclass_utils/dataclass_utils.py

def default(self, obj):
    if isinstance(obj, bytes):
        return {"__python.bytes.__": base64.b64encode(obj).decode('utf-8')}
    if isinstance(obj, set):
        return {"__python.set.__": list(obj)}  # Convert set to list for JSON
    if isinstance(obj, tuple):
        return {"__python.tuple.__": list(obj)} # Convert tuple to list for JSON
    if isinstance(obj, datetime):
        return {"__python.datetime.__": obj.isoformat()} # ISO format for datetime
    # ... (existing code for Pandas DataFrame) ...
    return super().default(obj)

def decode_mesop_json_state_hook(dct):
  # ... (existing docstring) ...
  if "__python.bytes.__" in dct:
      return base64.b64decode(dct["__python.bytes.__"])
  if "__python.set.__" in dct:
      return set(dct["__python.set.__"])
  if "__python.tuple.__" in dct:
      return tuple(dct["__python.tuple.__"])
  if "__python.datetime.__" in dct:
      return datetime.fromisoformat(dct["__python.datetime.__"])
  # ... (existing code for Pandas DataFrame) ...
  return dct

This is what I could come up with at the moment, I would appreciate for any changes regarding this and it you like this then I can commit it and also add tests to it.

@zacharias1219
Copy link
Contributor

And sorry for editing multiple times, this is my first time with issues interface on github.

@richard-to
Copy link
Collaborator Author

No problem about editing multiple times. Perfectly fine.

In terms of your proposed changes. They look fine on first glance. I would recommend starting with just bytes first. Let's keep the number of changes small and self-contained. Easier for the reviewer and allows you to test more throughly.

So feel free to post a pull request for bytes. Once you get that merged in you can add the other ones too.

A few notes:

We recently added a State Diff performance improvement. This covers some serialization stuff, so you'll need to make sure the following tests pass and have coverage for your change:

  • mesop/web/src/utils/diff_state_spec.ts (bazel test //mesop/web/src/utils:unit_tests)
  • mesop/dataclass_utils/diff_state_test.py (bazel test //mesop/dataclass_utils:dataclass_state_test)
  • mesop/dataclass_utils/dataclass_utils.py (bazel test //mesop/dataclass_utils:dataclass_utils_test)

@richard-to
Copy link
Collaborator Author

Just FYI @zacharias1219 Looks like someone posted a pull request for the set/tuple/datetime issue #387. So you'll only need to do the bytes. Sorry about that. Next time I'll do a better job marking issues as assigned.

@zacharias1219
Copy link
Contributor

Hey this is what I came up with

import json
from dataclasses import asdict, dataclass, field, is_dataclass
from io import StringIO
from typing import Any, Type, TypeVar, cast, get_origin, get_type_hints

from deepdiff import DeepDiff, Delta
from deepdiff.operator import BaseOperator
from deepdiff.path import parse_path

from mesop.exceptions import MesopException
import base64 # Import base64 for encoding

_PANDAS_OBJECT_KEY = "pandas.DataFrame"
_BYTES_OBJECT_KEY = "python.bytes." # Add new key for bytes
_DIFF_ACTION_DATA_FRAME_CHANGED = "data_frame_changed"

C = TypeVar("C")

def _check_has_pandas():
"""Checks if pandas exists since it is an optional dependency for Mesop."""
try:
import pandas # noqa: F401

return True

except ImportError:
return False

_has_pandas = _check_has_pandas()

def dataclass_with_defaults(cls: Type[C]) -> Type[C]:
"""
Provides defaults for every attribute in a dataclass (recursively) so
Mesop developers don't need to manually set default values
"""
pass
annotations = get_type_hints(cls)
for name, type_hint in annotations.items():
if name not in cls.dict: # Skip if default already set
if type_hint == int:
setattr(cls, name, field(default=0))
elif type_hint == float:
setattr(cls, name, field(default=0.0))
elif type_hint == str:
setattr(cls, name, field(default=""))
elif type_hint == bool:
setattr(cls, name, field(default=False))
elif get_origin(type_hint) == list:
setattr(cls, name, field(default_factory=list))
elif get_origin(type_hint) == dict:
setattr(cls, name, field(default_factory=dict))
elif isinstance(type_hint, type):
setattr(
cls, name, field(default_factory=dataclass_with_defaults(type_hint))
)

return dataclass(cls)

def serialize_dataclass(state: Any):
if is_dataclass(state):
json_str = json.dumps(asdict(state), cls=MesopJSONEncoder)
return json_str
else:
raise MesopException("Tried to serialize state which was not a dataclass")

def update_dataclass_from_json(instance: Any, json_string: str):
data = json.loads(json_string, object_hook=decode_mesop_json_state_hook)
_recursive_update_dataclass_from_json_obj(instance, data)

def _recursive_update_dataclass_from_json_obj(instance: Any, json_dict: Any):
for key, value in json_dict.items():
if hasattr(instance, key):
attr = getattr(instance, key)
if isinstance(value, dict):
# If the value is a dict, recursively update the dataclass.
setattr(
instance,
key,
_recursive_update_dataclass_from_json_obj(attr, value),
)
elif isinstance(value, list):
updated_list: list[Any] = []
for item in cast(list[Any], value):
if isinstance(item, dict):
# If the json item value is an instance of dict,
# we assume it should be converted into a dataclass
attr = getattr(instance, key)
item_instance = instance.annotations[key].args0
updated_list.append(
_recursive_update_dataclass_from_json_obj(item_instance, item)
)
else:
# If the item is not a dict, append it directly.
updated_list.append(item)
setattr(instance, key, updated_list)
else:
# For other types, set the value directly.
setattr(instance, key, value)
return instance

class MesopJSONEncoder(json.JSONEncoder):
"""
Custom JSON Encoder to handle special serialization cases.

Since we support Pandas DataFrames in the Mesop table, users may need to store the
the DataFrames in Mesop State. This means we need a way to serialize the DataFrame to
JSON and back.

For simplicity we will convert the DataFrame to JSON within the JSON serialized state.
This makes it so we don't have to worry about serializing other data types used by
Pandas. The "table" serialization format is verbose, but will ensure the most accurate
deserialization back into a DataFrame.
"""

def default(self, obj):
try:
import pandas as pd

  if isinstance(obj, pd.DataFrame):
    return {_PANDAS_OBJECT_KEY: pd.DataFrame.to_json(obj, orient="table")}
except ImportError:
  pass

if isinstance(obj, bytes):
  return {_BYTES_OBJECT_KEY: base64.b64encode(obj).decode('utf-8')} # Encode bytes

if is_dataclass(obj):
  return asdict(obj)

if isinstance(obj, type):
  return str(obj)

return super().default(obj)

def decode_mesop_json_state_hook(dct):
"""
Object hook to decode JSON for Mesop state.

Since we support Pandas DataFrames in the Mesop table, users may need to store the
the DataFrames in Mesop State. This means we need a way to serialize the DataFrame to
JSON and back.

One thing to note is that pandas.NA becomes numpy.nan during deserialization.
"""
if _has_pandas:
import pandas as pd

if _PANDAS_OBJECT_KEY in dct:
  return pd.read_json(StringIO(dct[_PANDAS_OBJECT_KEY]), orient="table")

if _BYTES_OBJECT_KEY in dct:
return base64.b64decode(dct[_BYTES_OBJECT_KEY]) # Decode bytes

return dct

class DataFrameOperator(BaseOperator):
"""Custom operator to detect changes in DataFrames.

DeepDiff does not support diffing DataFrames. See seperman/deepdiff#394.

This operator checks if the DataFrames are equal or not. It does not do a deep diff of
the contents of the DataFrame.
"""

def match(self, level) -> bool:
try:
import pandas as pd

  return isinstance(level.t1, pd.DataFrame) and isinstance(
    level.t2, pd.DataFrame
  )
except ImportError:
  # If Pandas is not installed, don't perform this check. We should log a warning.
  return False

def give_up_diffing(self, level, diff_instance) -> bool:
if not level.t1.equals(level.t2):
diff_instance.custom_report_result(
_DIFF_ACTION_DATA_FRAME_CHANGED, level, {"value": level.t2}
)
return True

def diff_state(state1: Any, state2: Any) -> str:
"""
Diffs two state objects and returns the difference using DeepDiff's Delta format as a
JSON string.

DeepDiff does not support DataFrames yet. See DataFrameOperator.

The to_flat_dicts method does not include custom report results, so we need to add
those manually for the DataFrame case.
"""
if not is_dataclass(state1) or not is_dataclass(state2):
raise MesopException("Tried to diff state which was not a dataclass")

custom_actions = []

Only use the DataFrameOperator if pandas exists.

if _has_pandas:
differences = DeepDiff(
state1, state2, custom_operators=[DataFrameOperator()]
)

# Manually format dataframe diffs to flat dict format.
if _DIFF_ACTION_DATA_FRAME_CHANGED in differences:
  custom_actions = [
    {
      "path": parse_path(path),
      "action": _DIFF_ACTION_DATA_FRAME_CHANGED,
      **diff,
    }
    for path, diff in differences[_DIFF_ACTION_DATA_FRAME_CHANGED].items()
  ]

else:
differences = DeepDiff(state1, state2)

return json.dumps(
Delta(differences, always_include_values=True).to_flat_dicts()
+ custom_actions,
cls=MesopJSONEncoder,
)

@richard-to
Copy link
Collaborator Author

Can you create a Pull Request with your changes. It will be easier for us to review the changes that way. Thanks.

@zacharias1219
Copy link
Contributor

I have created it.

@richard-to
Copy link
Collaborator Author

Ok. Great. We'll take a look tomorrow morning (for us). Thanks a lot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working good first issues Good for first-time contributors
Projects
None yet
Development

No branches or pull requests

2 participants