[FEATURE] ParameterBuilder for Computing Average Unexpected Values Fractions for any Map Metric #4340

alexsherstinsky · 2022-03-05T00:50:16Z

Scope

The new MeanUnexpectedMapMetricMultiBatchParameterBuilder accepts any map metric as well as the required name of the ParameterBuilder that computes the total count metric and optionally the name of the ParameterBuilder that computes the number of rows with the empty value for the domain (e.g., an empty column). It then computes the unexpected counts of the map metric for each batch (e.g., "column_values.nonnull", ""column_values.unique", etc.) and then the ratios of the unexpected counts to the counts of non-null records, and finally outputs the mean of these ratios.

This result can then be used by an ExpectationConfigurationBuilder to determine whether or not this mean ratio is sufficiently low for an expectation to be emitted, which is beneficial in avoiding "validation error noise" subsequently.

Note

MeanUnexpectedMapMetricMultiBatchParameterBuilder requires the ParameterBuilder configurations for the total count and (optionally) the null count metric to appear earlier, since their results are pre-requisite for this parameter builder.

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Changes proposed in this pull request:

JIRA: GREAT-464/GREAT-498/GREAT-636

After submitting your PR, CI checks will run and @ge-cla-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

My code follows the Great Expectations style guide
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added unit tests where applicable and made sure that new and existing tests are passing.
I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

…/alexsherstinsky/rule_based_profiler_use_same_utility_for_emitting_domain_objects-2022_03_03-35

…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_03-37

…gned for computing average unexpected values fraction for any map metric.

…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39

netlify · 2022-03-05T00:50:21Z

✔️ Deploy Preview for niobium-lead-7998 ready!

🔨 Explore the source changes: b9eff5b

🔍 Inspect the deploy log: https://app.netlify.com/sites/niobium-lead-7998/deploys/6226285443487b0008d8cef3

😎 Browse the preview: https://deploy-preview-4340--niobium-lead-7998.netlify.app

Shinnnyshinshin

Thank you so much for this work @alexsherstinsky . the code + tests look great, but given the importance of this PR in the direction we are "nudging' it to go (ie more like a formal standard machine learning component), I think a synchronous discussion would really help. Would we be able to schedule some time on Monday?

alexsherstinsky · 2022-03-05T06:52:13Z

Thank you so much for this work @alexsherstinsky . the code + tests look great, but given the importance of this PR in the direction we are "nudging' it to go (ie more like a formal standard machine learning component), I think a synchronous discussion would really help. Would we be able to schedule some time on Monday?

@Shinnnyshinshin Happy to offer my half-baked ideas. Although this PR in itself is only a hint at what a proper data pipeline should be. In particular, this is the first time we have a ParameterBuilder, which requires other parameters to be executed first, because it uses the values that they compute. As such, this is the first clear illustration of the capabilities of the current architecture, where ParameterContainter with its ParameterNode constituents acts as the holder of the State of a Rule during its execution lifecycle. This was envisioned almost a year ago, implemented last summer, and finally utilized. Thanks!

…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39

…EAT-632/alexsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39' into feature/GREAT-464/GREAT-498/GREAT-632/alexsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39

anthonyburdi

LGTM!

anthonyburdi · 2022-03-07T16:27:19Z

...ule_based_profiler/parameter_builder/mean_unexpected_metric_multi_batch_parameter_builder.py

+
+        nonnull_count_values: np.ndarray = total_count_values - null_count_values
+
+        # Compute "unexpected_count" corresponding to "map_metric_name" (given as argument to this "ParameterBuilder").


Thank you for these helpful comments.

cdkini · 2022-03-07T16:54:09Z

...ule_based_profiler/parameter_builder/mean_unexpected_metric_multi_batch_parameter_builder.py

+        metric_value_kwargs: Optional[Union[str, dict]] = None,
+        batch_list: Optional[List[Batch]] = None,
+        batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None,
+        json_serialize: Union[str, bool] = True,


Why can this be a string as well?

@cdkini Because of the $parameter syntax, potentially used.

So it's the responsibility of the parameter builder to determine the true value here? That value can't be pre-computed so we can narrow the type to bool?

cdkini · 2022-03-07T16:57:23Z

great_expectations/rule_based_profiler/parameter_builder/parameter_builder.py

        parameter_values: Dict[str, Any] = {
            self.fully_qualified_parameter_name: {
-                "value": convert_to_json_serializable(data=computed_parameter_value),
+                "value": convert_to_json_serializable(data=computed_parameter_value)
+                if json_serialize


Can this be a string? If so, it'll always result in True. Are we okay with that?

cdkini · 2022-03-07T16:58:13Z

great_expectations/rule_based_profiler/parameter_builder/parameter_builder.py

+    @property
+    def json_serialize(self) -> bool:
+        return self._json_serialize


The input type is Union[bool ,str]. If we pass in a string, won't this be erroneous?

@cdkini The $parameter syntax requires this Union for now -- hope we can simplify later.

But what happens if I pass in a string value from the constructor? Per the __init__, it seem as though we're assigning the raw value to the attribute without any additional logic.

Doesn't this create a situation where we could possibly be returning a string from this bool property?

cdkini · 2022-03-07T16:59:31Z

...ased_profiler/parameter_builder/test_mean_unexpected_metric_multi_batch_parameter_builder.py

+    _: MeanUnexpectedMapMetricMultiBatchParameterBuilder = (
+        MeanUnexpectedMapMetricMultiBatchParameterBuilder(
+            name="my_name",
+            map_metric_name="column_values.nonnull",
+            total_count_parameter_builder_name="my_total_count",
+            data_context=data_context,
+        )
+    )


Nitpick - we don't even need assignment here right? Perfectly fine to keep this but we should probably have a consistent approach since I tend to omit assignment when I'm just testing if the behavior errors/logs/etc.

Alex Sherstinsky added 19 commits March 3, 2022 15:09

small refactor

1b1d805

Merge branch 'develop' into maintenance/GREAT-464/GREAT-498/GREAT-631…

ea1cb0a

…/alexsherstinsky/rule_based_profiler_use_same_utility_for_emitting_domain_objects-2022_03_03-35

Merge branch 'develop' into maintenance/GREAT-464/GREAT-498/GREAT-631…

332b0ca

…/alexsherstinsky/rule_based_profiler_use_same_utility_for_emitting_domain_objects-2022_03_03-35

clean up typ hints

5cccc02

refactor

522ad11

minor refactor/cleanup

f4be968

minor

eee8716

tightening up interface method usage for value-set parameter builder

7f9e33a

Merge branch 'develop' into feature/GREAT-464/GREAT-498/GREAT-632/ale…

e51d475

…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_03-37

cleanup

57c3015

clean up

a491dab

clean up -- add convenience domain generation method

53ac9a5

merge

2e03328

small refactor to add domain building utility and helpers directory

22e287c

init

a8c3968

bugfix to enable parameter builder share results

191121c

WIP

4ea4bc3

MeanUnexpectedMetricMultiBatchParameterBuilder and unit tests -- desi…

b309c61

…gned for computing average unexpected values fraction for any map metric.

Merge branch 'develop' into feature/GREAT-464/GREAT-498/GREAT-632/ale…

14ffa41

…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39

alexsherstinsky requested a review from cdkini March 5, 2022 00:50

alexsherstinsky requested review from NathanFarmer, donaldheppner, anthonyburdi and Shinnnyshinshin March 5, 2022 00:50

cleanup

3403232

Shinnnyshinshin reviewed Mar 5, 2022

View reviewed changes

Alex Sherstinsky and others added 2 commits March 4, 2022 23:32

minor

1f24991

Merge branch 'develop' into feature/GREAT-464/GREAT-498/GREAT-632/ale…

3ce45bb

…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39

Merge branch 'develop' into feature/GREAT-464/GREAT-498/GREAT-632/ale…

8628de1

…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39

alexsherstinsky requested a review from Shinnnyshinshin March 7, 2022 15:43

Alex Sherstinsky added 2 commits March 7, 2022 07:43

Merge branch 'develop' into feature/GREAT-464/GREAT-498/GREAT-632/ale…

2579e6d

…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39

anthonyburdi approved these changes Mar 7, 2022

View reviewed changes

cdkini reviewed Mar 7, 2022

View reviewed changes

alexsherstinsky merged commit bb306e0 into develop Mar 7, 2022

alexsherstinsky deleted the feature/GREAT-464/GREAT-498/GREAT-632/alexsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39 branch March 7, 2022 17:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] ParameterBuilder for Computing Average Unexpected Values Fractions for any Map Metric #4340

[FEATURE] ParameterBuilder for Computing Average Unexpected Values Fractions for any Map Metric #4340

alexsherstinsky commented Mar 5, 2022 •

edited

netlify bot commented Mar 5, 2022 •

edited

Shinnnyshinshin left a comment

alexsherstinsky commented Mar 5, 2022

anthonyburdi left a comment

anthonyburdi Mar 7, 2022

cdkini Mar 7, 2022

alexsherstinsky Mar 7, 2022

cdkini Mar 7, 2022

cdkini Mar 7, 2022

cdkini Mar 7, 2022

alexsherstinsky Mar 7, 2022

cdkini Mar 7, 2022

cdkini Mar 7, 2022


		nonnull_count_values: np.ndarray = total_count_values - null_count_values

		# Compute "unexpected_count" corresponding to "map_metric_name" (given as argument to this "ParameterBuilder").

[FEATURE] ParameterBuilder for Computing Average Unexpected Values Fractions for any Map Metric #4340

[FEATURE] ParameterBuilder for Computing Average Unexpected Values Fractions for any Map Metric #4340

Conversation

alexsherstinsky commented Mar 5, 2022 • edited

Scope

Note

Previous Design Review notes:

Definition of Done

netlify bot commented Mar 5, 2022 • edited

Shinnnyshinshin left a comment

Choose a reason for hiding this comment

alexsherstinsky commented Mar 5, 2022

anthonyburdi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alexsherstinsky commented Mar 5, 2022 •

edited

netlify bot commented Mar 5, 2022 •

edited