Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] V1 Checkpoint #9590

Merged
merged 30 commits into from Mar 19, 2024
Merged

[FEATURE] V1 Checkpoint #9590

merged 30 commits into from Mar 19, 2024

Conversation

cdkini
Copy link
Member

@cdkini cdkini commented Mar 7, 2024

The V1 Checkpoint is a significant simplification of the v0.18 model. It now only takes two primary input params (validation definitions and actions) and utilizes Pydantic.

The complexity within this PR stems from the need for custom serialization logic; due to the shared usage of models like batch definitions and expectation suites, we need to ensure we denormalize our data to obtain a single source of truth about an object's state. We do this utilizing our new pattern of serializing directly into identifiers (which can be used to index into the appropriate store to retrieve the object in its accurate state).

  • Description of PR changes above includes a link to an existing GitHub issue
  • PR title is prefixed with one of: [BUGFIX], [FEATURE], [DOCS], [MAINTENANCE], [CONTRIB]
  • Code is linted - run invoke lint (uses ruff format + ruff check)
  • Appropriate tests and docs have been updated

For more information about contributing, see Contribute.

After you submit your PR, keep the page open and monitor the statuses of the various checks made by our continuous integration process at the bottom of the page. Please fix any issues that come up and reach out on Slack if you need help. Thanks for contributing!

Copy link

netlify bot commented Mar 7, 2024

Deploy Preview for niobium-lead-7998 canceled.

Name Link
🔨 Latest commit 571f245
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/65fa165ed5d5c70008826021

@@ -1110,6 +1111,16 @@ def render(self) -> None:
)
)

def serialize(self) -> _IdentifierBundle:
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved any encoding logic to the actual class being serialized (move from _encode_suite in validation config to serialize as instance method on suite).

I don't love exposing these publicly but how do we get around this? Do we want custom serializer classes to encapsulate this logic?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only serializes this object when it is used in another object right? I would call this serialize_identifier_bundle() or just identifier_bundler(). This name should be uniform across any object that needs to do this. I think this public since whatever object serializes this expects this method to exist. We could formalize it as an internal protocol.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree for suites and validation definitions but renderers and batch definitions have different outputs (_EncodedValidationData and class_name/module_name, respectively).

Comment on lines +1 to +16
from typing import Union

from great_expectations.compatibility.pydantic import (
BaseModel,
)


class _IdentifierBundle(BaseModel):
name: str
id: Union[str, None]


class _EncodedValidationData(BaseModel):
datasource: _IdentifierBundle
asset: _IdentifierBundle
batch_definition: _IdentifierBundle
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shared space with intermediate serialization models

@cdkini cdkini self-assigned this Mar 19, 2024
Copy link
Contributor

@billdirks billdirks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for doing this!
I've only looked at the first 2 files. I need to run an errand and I'll look more when I return.
I've left a few comments, mostly around naming.

great_expectations/checkpoint/v1_checkpoint.py Outdated Show resolved Hide resolved
great_expectations/checkpoint/v1_checkpoint.py Outdated Show resolved Hide resolved
great_expectations/checkpoint/v1_checkpoint.py Outdated Show resolved Hide resolved
great_expectations/checkpoint/v1_checkpoint.py Outdated Show resolved Hide resolved
"""

name: str
validations: List[ValidationConfig]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a list of ValidationDefinitions instead of config objects? That's what this shows: https://greatexpectations.atlassian.net/wiki/spaces/SUP/pages/917471267/Validation+Workflows

Or does the ValidationConfig need to be renamed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I think ValidationConfig does need to be renamed. That can be done in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs to be renamed!

f"Unable to retrieve validation config {id_bundle} from store"
)

if validation_config:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this if? With the try/except above, it looks like this will need to be defined.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did it to appease mypy but updated to be a bit clearer

@@ -47,3 +48,22 @@ def build_batch_request(

def save(self) -> None:
self.data_asset._save_batch_config(self)

def serialize(self) -> _EncodedValidationData:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect a method called serialize on an object to serialize the object but this looks like it serializes the validation definition? Could you update the name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to serialize_validation_definition!

great_expectations/checkpoint/v1_checkpoint.py Outdated Show resolved Hide resolved
Copy link
Contributor

@billdirks billdirks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will look at tests after my next meeting. Another couple name comments. Thanks!

@@ -1110,6 +1111,16 @@ def render(self) -> None:
)
)

def serialize(self) -> _IdentifierBundle:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only serializes this object when it is used in another object right? I would call this serialize_identifier_bundle() or just identifier_bundler(). This name should be uniform across any object that needs to do this. I think this public since whatever object serializes this expects this method to exist. We could formalize it as an internal protocol.

@@ -302,3 +264,16 @@ def _get_expectation_suite_and_validation_result_ids(
run_id=run_id,
)
return expectation_suite_identifier, validation_result_id

def serialize(self) -> _IdentifierBundle:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as before. I would rename to identifer_bundle().

self, serialized_checkpoint: str, expected_error: str
):
with pytest.raises(ValidationError) as e:
Checkpoint.parse_obj(serialized_checkpoint)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between parse_raw used above and parse_obj used here? I was expected the same deserialization call in both places.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_raw is on a string and parse_obj is on a dict

Copy link
Contributor

@billdirks billdirks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@cdkini cdkini enabled auto-merge March 19, 2024 22:49
@cdkini cdkini added this pull request to the merge queue Mar 19, 2024
Merged via the queue into develop with commit 0246e0c Mar 19, 2024
68 checks passed
@cdkini cdkini deleted the f/v1-225/v1_checkpoint branch March 19, 2024 23:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants