Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] Fix copying of CheckpointConfig for substitution and printing purposes #3759

Conversation

alexsherstinsky
Copy link
Contributor

@alexsherstinsky alexsherstinsky commented Nov 30, 2021

Scope

This work fixes a bug in CheckpointConfig serialization, specifically the loss of batch_data as the side effect.

The implementation consists of:

  • Overriding CheckpointConfigSchema.dump(),
  • Overriding CheckpointConfig.__deepcopy__(), and
  • Implementing CheckpointConfig.__repr__() and CheckpointResult.__repr__() methods

in the way that:

  1. Extracts the batch_data values from the CheckpointConfiguration and saving them.
  2. Deleting the batch_data values from the CheckpointConfiguration and cloning/serializing the resulting "safe" configuration object.
  3. Restoring the batch_data values into the cloned and/or original configuration object.

Additional clean up consists of:

  • Fixing test(s) that returned incorrect results due to the previous implementation.
  • Making the get_runtime_batch_request() Checkpoint utility more elegant.

As the result of the above fixes, the impacted modules became simpler, and the amount of code has been reduced.

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Changes proposed in this pull request:

  • JIRA: ANT-20/ANT-22/ANT-57

After submitting your PR, CI checks will run and @tiny-tim-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

@netlify
Copy link

netlify bot commented Nov 30, 2021

✔️ Deploy Preview for niobium-lead-7998 ready!

🔨 Explore the source changes: e769faa

🔍 Inspect the deploy log: https://app.netlify.com/sites/niobium-lead-7998/deploys/61a7a481dd5b66000704eb93

😎 Browse the preview: https://deploy-preview-3759--niobium-lead-7998.netlify.app

@github-actions
Copy link
Contributor

HOWDY! This is your friendly 🤖 CHANGELOG bot 🤖

Please don't forget to add a clear and succinct description of your change under the Develop header in docs_rtd/changelog.rst, if applicable. This will help us with the release process. See the Contribution checklist in the Great Expectations documentation for the type of labels to use!

Thank you!

@@ -377,13 +327,13 @@ def run(
"template_name": template_name,
"run_name_template": run_name_template,
"expectation_suite_name": expectation_suite_name,
"expectation_suite_ge_cloud_id": expectation_suite_ge_cloud_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why was this line removed?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was just moved to the bottom (new line 336)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donaldheppner I simply moved it to a "more logical place" congruent with the order these arguments were supplied on input to the method.

@@ -11,6 +10,9 @@
from great_expectations.data_context.types.base import (
CheckpointConfig,
CheckpointConfigSchema,
delete_runtime_parameters_batch_data_references_from_config,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could/should these not be methods on the config object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donaldheppner In fact, even better: in the next commit, they will be moved to batch.py core module.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alexsherstinsky, if that is the case, could I ask that individual tests for the methods be added in the next commit (or is it PR?)

@@ -207,26 +207,11 @@ def send_email(
# TODO: <Alex>A common utility function should be factored out from DataContext.get_batch_list() for any purpose.</Alex>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this TODO still valid?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donaldheppner Unfortunately, this comment is still valid -- we have not had a chance to do the work. I would like to leave it for now and remove only after we had a chance to go through the backlog formally. Thank you.

Copy link
Contributor

@NathanFarmer NathanFarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a few questions!

if (batch_data is not None) and (
self.batch_request.get("runtime_parameters") is not None

def delete_runtime_parameters_batch_data_references_from_config(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

val["batch_request"]["runtime_parameters"].pop("batch_data")


def restore_runtime_parameters_batch_data_references_into_config(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 2338 to 2361
def __deepcopy__(self, memo):
batch_data_references: Tuple[
Optional[Any], Optional[List[Any]]
] = get_runtime_parameters_batch_data_references_from_config(config=self)
delete_runtime_parameters_batch_data_references_from_config(config=self)

cls = self.__class__
result = cls.__new__(cls)
result._commented_map = CommentedMap()

memo[id(self)] = result
for key, value in self.to_json_dict().items():
# noinspection PyArgumentList
value_copy = copy.deepcopy(value, memo)
setattr(result, key, value_copy)

restore_runtime_parameters_batch_data_references_into_config(
config=self, batch_data_references=batch_data_references
)
restore_runtime_parameters_batch_data_references_into_config(
config=result, batch_data_references=batch_data_references
)

return result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a custom __deepcopy__ not be avoided here? Why isn't this something we can do where required, like we are doing in other places using these new functions in this PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanFarmer This custom __deepcoppy__ enables us to simplify / delete a lot of code. It fits, because it "knows" how to handle the copying for the (complex) CheckpointConfig object.

@@ -2337,6 +2337,7 @@ def test_newstyle_checkpoint_config_substitution_nested(
CheckpointConfig(
name="my_nested_checkpoint",
config_version=1,
template_name="my_nested_checkpoint_template_3",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a new test to be added for the bug you referenced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanFarmer This test has effectively been added by repairing some existing tests. In addition, the functionality will be fully manifested in a near-future PR that focuses on checkpoint.run payload instrumentation, which is how the bug was detected in the first place (the batch_data values were empty in the runtime_parameter dictionary). Thanks.


return json_dict
def to_json_dict(self) -> dict:
return checkpointResultSchema.dump(self)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Member

@anthonyburdi anthonyburdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me aside from resolving Don and Nathan's comments :)

@@ -377,13 +327,13 @@ def run(
"template_name": template_name,
"run_name_template": run_name_template,
"expectation_suite_name": expectation_suite_name,
"expectation_suite_ge_cloud_id": expectation_suite_ge_cloud_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it was just moved to the bottom (new line 336)

Copy link
Contributor

@Shinnnyshinshin Shinnnyshinshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for this work @alexsherstinsky. Looks good on my end too, with one clarification question

@@ -361,8 +361,8 @@ def _validate_runtime_batch_request_specific_init_parameters(
):
if not (runtime_parameters and (isinstance(runtime_parameters, dict))):
raise TypeError(
f"""The type for runtime_parameters must be a dict object.
The type given is "{str(type(runtime_parameters))}", which is illegal."""
f"""The runtime_parameters must be a non-empty dict object.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful error message. Thank you for this

val["batch_request"]["runtime_parameters"].pop("batch_data")


def restore_runtime_parameters_batch_data_references_into_config(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (batch_data is not None) and (
self.batch_request.get("runtime_parameters") is not None

def delete_runtime_parameters_batch_data_references_from_config(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -11,6 +10,9 @@
from great_expectations.data_context.types.base import (
CheckpointConfig,
CheckpointConfigSchema,
delete_runtime_parameters_batch_data_references_from_config,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @alexsherstinsky, if that is the case, could I ask that individual tests for the methods be added in the next commit (or is it PR?)

Comment on lines +224 to +233
effective_batch_request: dict = dict(
**runtime_config_batch_request, **validation_batch_request
)
if "runtime_parameters" in effective_batch_request or isinstance(
validation_batch_request, RuntimeBatchRequest
):
batch_request_class = RuntimeBatchRequest
else:
batch_request_class = BatchRequest

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better!

Alex Sherstinsky added 4 commits November 30, 2021 20:21
…NT-22/alexsherstinsky/instrument_runtime_data_connector_for_usage_statistics-2021_11_23-10-overload_deepcopy_and_dump-2021_11_29-11
@alexsherstinsky
Copy link
Contributor Author

Tests have been added for new utilities.

In addition, BatchRequest serialization has also been dramatically simplified using these new utilities.

Copy link
Contributor

@NathanFarmer NathanFarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a thing of beauty. Just one question about a TODO.

config=data, batch_data_references=batch_data_references
)
return data


class CheckpointConfig(BaseYamlConfig):
# TODO: <Alex>ALEX (does not work yet)</Alex>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanFarmer We still need those comments -- the idea was good, but it did not work, and we ought to figure out how to make it work -- or discuss it and conclude that the current approach is the best that can be done. Thanks!

…trument_runtime_data_connector_for_usage_statistics-2021_11_23-10-overload_deepcopy_and_dump-2021_11_29-11
Copy link
Contributor

@NathanFarmer NathanFarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a wonderful refactor 😄 LGTM!

Copy link
Contributor

@Shinnnyshinshin Shinnnyshinshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wonderful work 🙇🏼 thank you @alexsherstinsky

@alexsherstinsky alexsherstinsky merged commit 954da9f into develop Dec 1, 2021
@alexsherstinsky alexsherstinsky deleted the feature/ANT-20/ANT-22/alexsherstinsky/instrument_runtime_data_connector_for_usage_statistics-2021_11_23-10-overload_deepcopy_and_dump-2021_11_29-11 branch December 1, 2021 17:27
Copy link
Member

@roblim roblim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! <3

Copy link
Member

@anthonyburdi anthonyburdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

] = validations_batch_data_list[idx]


def standardize_batch_request_display_ordering(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -40,6 +41,9 @@ def anonymize_batch_request(self, *args, **kwargs) -> Dict[str, List[str]]:
anonymized_batch_request_dict: Optional[
Union[str, dict]
] = self._anonymize_batch_request_properties(source=batch_request_dict)
anonymized_batch_request_dict = standardize_batch_request_display_ordering(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -59,3 +74,771 @@ def test_serialization_of_spark_df(spark_session):
df = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
sdf = spark_session.createDataFrame(df)
assert convert_to_json_serializable(sdf) == {"a": [1, 2, 3], "b": [4, 5, 6]}


def test_batch_request_deepcopy():
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for adding these tests!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants