Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] ParameterBuilder for Computing Average Unexpected Values Fractions for any Map Metric #4340

Conversation

alexsherstinsky
Copy link
Contributor

@alexsherstinsky alexsherstinsky commented Mar 5, 2022

Scope

The new MeanUnexpectedMapMetricMultiBatchParameterBuilder accepts any map metric as well as the required name of the ParameterBuilder that computes the total count metric and optionally the name of the ParameterBuilder that computes the number of rows with the empty value for the domain (e.g., an empty column). It then computes the unexpected counts of the map metric for each batch (e.g., "column_values.nonnull", ""column_values.unique", etc.) and then the ratios of the unexpected counts to the counts of non-null records, and finally outputs the mean of these ratios.

This result can then be used by an ExpectationConfigurationBuilder to determine whether or not this mean ratio is sufficiently low for an expectation to be emitted, which is beneficial in avoiding "validation error noise" subsequently.

Note

MeanUnexpectedMapMetricMultiBatchParameterBuilder requires the ParameterBuilder configurations for the total count and (optionally) the null count metric to appear earlier, since their results are pre-requisite for this parameter builder.

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Changes proposed in this pull request:

  • JIRA: GREAT-464/GREAT-498/GREAT-636

After submitting your PR, CI checks will run and @ge-cla-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

@netlify
Copy link

netlify bot commented Mar 5, 2022

✔️ Deploy Preview for niobium-lead-7998 ready!

🔨 Explore the source changes: b9eff5b

🔍 Inspect the deploy log: https://app.netlify.com/sites/niobium-lead-7998/deploys/6226285443487b0008d8cef3

😎 Browse the preview: https://deploy-preview-4340--niobium-lead-7998.netlify.app

Copy link
Contributor

@Shinnnyshinshin Shinnnyshinshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you so much for this work @alexsherstinsky . the code + tests look great, but given the importance of this PR in the direction we are "nudging' it to go (ie more like a formal standard machine learning component), I think a synchronous discussion would really help. Would we be able to schedule some time on Monday?

@alexsherstinsky
Copy link
Contributor Author

Thank you so much for this work @alexsherstinsky . the code + tests look great, but given the importance of this PR in the direction we are "nudging' it to go (ie more like a formal standard machine learning component), I think a synchronous discussion would really help. Would we be able to schedule some time on Monday?

@Shinnnyshinshin Happy to offer my half-baked ideas. Although this PR in itself is only a hint at what a proper data pipeline should be. In particular, this is the first time we have a ParameterBuilder, which requires other parameters to be executed first, because it uses the values that they compute. As such, this is the first clear illustration of the capabilities of the current architecture, where ParameterContainter with its ParameterNode constituents acts as the holder of the State of a Rule during its execution lifecycle. This was envisioned almost a year ago, implemented last summer, and finally utilized. Thanks!

Alex Sherstinsky and others added 2 commits March 4, 2022 23:32
…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39
…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39
Alex Sherstinsky added 2 commits March 7, 2022 07:43
…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39
…EAT-632/alexsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39' into feature/GREAT-464/GREAT-498/GREAT-632/alexsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39
Copy link
Member

@anthonyburdi anthonyburdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!


nonnull_count_values: np.ndarray = total_count_values - null_count_values

# Compute "unexpected_count" corresponding to "map_metric_name" (given as argument to this "ParameterBuilder").
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for these helpful comments.

metric_value_kwargs: Optional[Union[str, dict]] = None,
batch_list: Optional[List[Batch]] = None,
batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None,
json_serialize: Union[str, bool] = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why can this be a string as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdkini Because of the $parameter syntax, potentially used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it's the responsibility of the parameter builder to determine the true value here? That value can't be pre-computed so we can narrow the type to bool?

parameter_values: Dict[str, Any] = {
self.fully_qualified_parameter_name: {
"value": convert_to_json_serializable(data=computed_parameter_value),
"value": convert_to_json_serializable(data=computed_parameter_value)
if json_serialize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be a string? If so, it'll always result in True. Are we okay with that?

Comment on lines +175 to +177
@property
def json_serialize(self) -> bool:
return self._json_serialize
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The input type is Union[bool ,str]. If we pass in a string, won't this be erroneous?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdkini The $parameter syntax requires this Union for now -- hope we can simplify later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But what happens if I pass in a string value from the constructor? Per the __init__, it seem as though we're assigning the raw value to the attribute without any additional logic.

Doesn't this create a situation where we could possibly be returning a string from this bool property?

Comment on lines +28 to +35
_: MeanUnexpectedMapMetricMultiBatchParameterBuilder = (
MeanUnexpectedMapMetricMultiBatchParameterBuilder(
name="my_name",
map_metric_name="column_values.nonnull",
total_count_parameter_builder_name="my_total_count",
data_context=data_context,
)
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick - we don't even need assignment here right? Perfectly fine to keep this but we should probably have a consistent approach since I tend to omit assignment when I'm just testing if the behavior errors/logs/etc.

@alexsherstinsky alexsherstinsky merged commit bb306e0 into develop Mar 7, 2022
@alexsherstinsky alexsherstinsky deleted the feature/GREAT-464/GREAT-498/GREAT-632/alexsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39 branch March 7, 2022 17:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants