Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow serialization of a subset of params #338

Merged
merged 7 commits into from
Sep 26, 2021

Conversation

fidelak
Copy link
Contributor

@fidelak fidelak commented Sep 18, 2021

Addresses #337

This is my first time contributing to a project. I'd be happy to hear any critiques

@edublancas
Copy link
Contributor

Hi, thanks for your contribution.

You're on the right track, but there there are a few details:

  • The codediffer.py's job is to compare parameters, so it shouldn't edit metadata, all necessary changes should take place in metadata.py
  • Instead of marking unserializable properties I'd say it's best to ignore them and send a warning
  • no need to add tests to test_dag.py, we can test everything in test_metadata.py

So here are my thoughts:

  • Add a function that ignores any top-level key that has at least one non-serializable object and send a warning
  • Call such function in Metadata.update
    def update(self, source_code, params):
  • Test that Metadata.update ignores non-serializable objects and emits the warning, tests would be similar to this one (
    def test_update_with_resource(tmp_directory):
    ), except we need to test several test cases (using @pytest.mark.parametrize).

Here's some code I quickly put together that should get you started:

import warnings
from copy import copy
from ploomber.env.expand import iterate_nested_dict


def is_json_serializable(o):
    # per:
    # https://github.com/python/cpython/blob/dea59cf88adf5d20812edda330e085a4695baba4/Lib/json/encoder.py#L73
    return isinstance(o,
                      (dict, list, tuple, str, int, float, bool)) or o is None


d = {"a": 1, "b": 2}
d = {"foo": ["bar", object()], "stuff": object(), "number": 1}
d = {"foo": ["bar", {"another": object()}]}
d = {"foo": [1, 2, object(), object()]}


def remove_non_serializable_top_keys(d):
    out = copy(d)

    # this iterates over every value in a possibly nested dictionary
    for (_, _, current_val, preffix) in iterate_nested_dict(d):
        if not is_json_serializable(current_val):
            top_key = preffix[0]
            if top_key in out:
                del out[top_key]
                warnings.warn(f'Top-key {top_key!r} contains unserializable '
                              f'object: {current_val!r}. The key will be '
                              'ignored.')

    return out


remove_non_serializable_top_keys(d)

Let me know if you're ok with these changes. Feel free to edit this PR or open a new one.

@fidelak
Copy link
Contributor Author

fidelak commented Sep 21, 2021

@edublancas Thanks for the feedback. There are a couple of things I'd like to ask you about.

  1. Your is_json_serializable: I had thought of just doing this, but I was unsure if any of the metadata parameters might have been objects with custom json encoding, which is why I opted to try json.dumps on each top level key instead. Am I overthinking this?

  2. I initially thought of just removing the top level keys, but then codediffer.py would always result in outdated_params=True for products that contained non-serializable parameters, so these products would always want to be updated. We will have to import remove_non_serializable_top_keys into codediffer.py and apply it to b_params in is_different, like so:

        if a_params is None or b_params is None:
            outdated_params = False
        else:
            params_to_compare = remove_non_serializable_top_keys(b_params)
            outdated_params = (a_params != params_to_compare)

        result = outdated_params or (a_norm != b_norm)

Is this correct?

Thanks for taking the time to help with this

@edublancas
Copy link
Contributor

  1. That's a reasonable observation, I thought about that too. However, we currently do not support custom serializers because we're calling json.dumps directly, which uses the default JSONEncoder (see here), such encoder directly tests for the type of object to serialize (see here). So I think we're good. Maybe in the future, we could support custom serializers but let's keep it simple for now
  2. Oh, good catch! I didn't think of that. But yes, you're right. We'd have to remove top-level keys using remove_non_serializable_top_keys in codediffer.py, then we should add some tests to test_codediffer.py to make sure it ignores non-serializable parameters.

We're on the same page now :)

@fidelak
Copy link
Contributor Author

fidelak commented Sep 22, 2021

@edublancas I've made the recommended changes and tested everything.

I put the remove_serializable_top_keys function in its own module ploomber/src/ploomber/products/serializeparams.py, since it's being used in metadata.py and codediffer.py. Let me know if you'd prefer it to be somewhere else.

@edublancas edublancas changed the base branch from master to params September 26, 2021 20:09
@edublancas edublancas merged commit a59ea30 into ploomber:params Sep 26, 2021
@edublancas
Copy link
Contributor

Thanks a lot for your contribution! Feel free to contribute/ask questions on any other open issues!

edublancas pushed a commit that referenced this pull request Sep 27, 2021
* Allow serialization of a subset of params

* Allow serialization of a subset of params #337

* parameter names

* changes based on suggestions

* formatting
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants