New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] ParameterBuilder for Computing Average Unexpected Values Fractions for any Map Metric #4340
Conversation
…/alexsherstinsky/rule_based_profiler_use_same_utility_for_emitting_domain_objects-2022_03_03-35
…/alexsherstinsky/rule_based_profiler_use_same_utility_for_emitting_domain_objects-2022_03_03-35
…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_03-37
…gned for computing average unexpected values fraction for any map metric.
…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39
✔️ Deploy Preview for niobium-lead-7998 ready! 🔨 Explore the source changes: b9eff5b 🔍 Inspect the deploy log: https://app.netlify.com/sites/niobium-lead-7998/deploys/6226285443487b0008d8cef3 😎 Browse the preview: https://deploy-preview-4340--niobium-lead-7998.netlify.app |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for this work @alexsherstinsky . the code + tests look great, but given the importance of this PR in the direction we are "nudging' it to go (ie more like a formal standard machine learning component), I think a synchronous discussion would really help. Would we be able to schedule some time on Monday?
@Shinnnyshinshin Happy to offer my half-baked ideas. Although this PR in itself is only a hint at what a proper data pipeline should be. In particular, this is the first time we have a |
…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39
…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39
…xsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39
…EAT-632/alexsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39' into feature/GREAT-464/GREAT-498/GREAT-632/alexsherstinsky/rule_based_profiler_implement_uniqueness_column_domain_builder-2022_03_04-39
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
||
nonnull_count_values: np.ndarray = total_count_values - null_count_values | ||
|
||
# Compute "unexpected_count" corresponding to "map_metric_name" (given as argument to this "ParameterBuilder"). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for these helpful comments.
metric_value_kwargs: Optional[Union[str, dict]] = None, | ||
batch_list: Optional[List[Batch]] = None, | ||
batch_request: Optional[Union[BatchRequest, RuntimeBatchRequest, dict]] = None, | ||
json_serialize: Union[str, bool] = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can this be a string as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdkini Because of the $parameter
syntax, potentially used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it's the responsibility of the parameter builder to determine the true value here? That value can't be pre-computed so we can narrow the type to bool?
parameter_values: Dict[str, Any] = { | ||
self.fully_qualified_parameter_name: { | ||
"value": convert_to_json_serializable(data=computed_parameter_value), | ||
"value": convert_to_json_serializable(data=computed_parameter_value) | ||
if json_serialize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be a string? If so, it'll always result in True. Are we okay with that?
@property | ||
def json_serialize(self) -> bool: | ||
return self._json_serialize |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The input type is Union[bool ,str]
. If we pass in a string, won't this be erroneous?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdkini The $parameter
syntax requires this Union for now -- hope we can simplify later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But what happens if I pass in a string value from the constructor? Per the __init__
, it seem as though we're assigning the raw value to the attribute without any additional logic.
Doesn't this create a situation where we could possibly be returning a string from this bool property?
_: MeanUnexpectedMapMetricMultiBatchParameterBuilder = ( | ||
MeanUnexpectedMapMetricMultiBatchParameterBuilder( | ||
name="my_name", | ||
map_metric_name="column_values.nonnull", | ||
total_count_parameter_builder_name="my_total_count", | ||
data_context=data_context, | ||
) | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick - we don't even need assignment here right? Perfectly fine to keep this but we should probably have a consistent approach since I tend to omit assignment when I'm just testing if the behavior errors/logs/etc.
Scope
The new
MeanUnexpectedMapMetricMultiBatchParameterBuilder
accepts any map metric as well as the required name of theParameterBuilder
that computes the total count metric and optionally the name of theParameterBuilder
that computes the number of rows with the empty value for the domain (e.g., an empty column). It then computes the unexpected counts of the map metric for each batch (e.g.,"column_values.nonnull"
,""column_values.unique"
, etc.) and then the ratios of the unexpected counts to the counts of non-null records, and finally outputs the mean of these ratios.This result can then be used by an ExpectationConfigurationBuilder to determine whether or not this mean ratio is sufficiently low for an expectation to be emitted, which is beneficial in avoiding "validation error noise" subsequently.
Note
MeanUnexpectedMapMetricMultiBatchParameterBuilder
requires theParameterBuilder
configurations for the total count and (optionally) the null count metric to appear earlier, since their results are pre-requisite for this parameter builder.Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.
Changes proposed in this pull request:
After submitting your PR, CI checks will run and @ge-cla-bot will check for your CLA signature.
For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.
In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g.
closes #123
).Previous Design Review notes:
Definition of Done
Please delete options that are not relevant.
Thank you for submitting!