New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] DataAssistant: Enable passing directives to run() method using runtime_environment argument #5187
Conversation
✅ Deploy Preview for niobium-lead-7998 ready!
To edit notification comments on pull requests, go to your Netlify site settings. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This adds a lot of complexity to the rule-based profiler and another optional dict of Any.
If there are specific settings we want to control for profiling, we should use a well defined type to present these configuration options
…ky/rule_based_profiler/data_assistant/enable_column_type_domain_inclusions_exclusions_at_runtime-2022_05_23-147
@@ -441,13 +441,15 @@ def run( | |||
self, | |||
variables: Optional[Dict[str, Any]] = None, | |||
rules: Optional[Dict[str, Dict[str, Any]]] = None, | |||
**kwargs: dict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where can I, as a user, find the appropriate values that fit into this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdkini It has to be documented, just like in the Scope
of this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values are, in fact, in RuntimeEnvironmentColumnDomainTypeDirectivesKeys
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to get away from all these dictionaries, especially since we don't expect anyone to call these methods but us. No more kwargs!
We don't have to call these "RuntimeEnvironment" any more, they are no longer simple settings, they are "ColumnDomainDirectives". If we want to generalize it, we could say that ColumnDomainDirectives are just one type of Directive that could be passed to DataAssistants (and the Profiler). This makes it obvious we should build a lot more smarts/functionality into the type, and I think makes it easier for the user to understand what they are being asked to supply.
Significantly, the primary user of this IS US, the developers of great_expectations. Convenience methods for those that like dictionaries can be provided at a higher level, and reconciled as Directives by time we get to the actual DataAssistant.
And for those who come along to implement their own DataAssistants later, they will really appreciate knowing that this parameter is a way of providing Directives, vs. a bag of nondescript values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are include_column_names
and exclude_column_names
available for every Data Assistant? If so, does it make sense to add these here separately from kwargs
to make that clear for the users when they use auto-complete? If not, does it make sense to add these to the Data Assistants that use them?
else: | ||
if not isinstance(property_value, property_value_type): | ||
raise ValueError( | ||
f'Unrecognized "{property_name}" directive -- must be "{property_value_type}" (or string).' | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this just be an elif
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cdkini Strictly speaking -- yes; however, when there is no final "else", the "elif" looks inelegant. This way, we have if/elif, which culminates on the else
. It is also a different point in that final isinstance
, so I preferred to distinguish it. Thanks.
data_assistant_result: DataAssistantResult = context.assistants.volume.run( | ||
batch_request=batch_request, | ||
**{ | ||
"exclude_column_names": exclude_column_names, | ||
}, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of unpacking the dictionary here, could we do
data_assistant_result: DataAssistantResult = context.assistants.volume.run( | |
batch_request=batch_request, | |
**{ | |
"exclude_column_names": exclude_column_names, | |
}, | |
) | |
data_assistant_result: DataAssistantResult = context.assistants.volume.run( | |
batch_request=batch_request, | |
exclude_column_names=exclude_column_names | |
) |
to make the functionality more apparent?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@talagluck This is potentially problematic, because every time we add new arguments from any new DomainBuilder
implementations, we will have to add them here. This is difficult to maintain. I am going to try to come up with a way to do both.
@@ -441,13 +441,15 @@ def run( | |||
self, | |||
variables: Optional[Dict[str, Any]] = None, | |||
rules: Optional[Dict[str, Dict[str, Any]]] = None, | |||
**kwargs: dict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are include_column_names
and exclude_column_names
available for every Data Assistant? If so, does it make sense to add these here separately from kwargs
to make that clear for the users when they use auto-complete? If not, does it make sense to add these to the Data Assistants that use them?
@@ -441,13 +441,15 @@ def run( | |||
self, | |||
variables: Optional[Dict[str, Any]] = None, | |||
rules: Optional[Dict[str, Dict[str, Any]]] = None, | |||
**kwargs: dict, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to get away from all these dictionaries, especially since we don't expect anyone to call these methods but us. No more kwargs!
We don't have to call these "RuntimeEnvironment" any more, they are no longer simple settings, they are "ColumnDomainDirectives". If we want to generalize it, we could say that ColumnDomainDirectives are just one type of Directive that could be passed to DataAssistants (and the Profiler). This makes it obvious we should build a lot more smarts/functionality into the type, and I think makes it easier for the user to understand what they are being asked to supply.
Significantly, the primary user of this IS US, the developers of great_expectations. Convenience methods for those that like dictionaries can be provided at a higher level, and reconciled as Directives by time we get to the actual DataAssistant.
And for those who come along to implement their own DataAssistants later, they will really appreciate knowing that this parameter is a way of providing Directives, vs. a bag of nondescript values
""" | ||
This method applies multiple directives to obtain columns to be included as part of returned "Domain" objects. | ||
""" | ||
include_column_names: Optional[List[str]] = cast( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The return type of the _resolve_list_type_property is an empty list (and never None), so we can remove the Optional type hint here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@donaldheppner I think we can -- in next commit (double checking now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@donaldheppner Yes, we can. In next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@donaldheppner This was done (committed).
expected_return_type=None, | ||
variables=variables, | ||
parameters=None, | ||
include_column_name_suffixes: Union[str, Iterable, List[str]] = cast( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we please leave this type as a list of string? Every time we go with a multitude of types, it creates a multitude of if statements later on
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@donaldheppner Yes, we can. In next commit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@donaldheppner This was done (committed).
@@ -833,6 +846,82 @@ def _get_rules_as_dict(self) -> Dict[str, Rule]: | |||
rule: Rule | |||
return {rule.name: rule for rule in self.rules} | |||
|
|||
# noinspection PyUnusedLocal | |||
@staticmethod | |||
def _apply_runtime_environment( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do appreciate this improvement, thank-you Alex!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@talagluck I understand about making the arguments more apparent. I am going to try -- not sure yet; I am struggling with the issue of making this API call maintainable when we add new DomainBuilder
classes with their own directives. I will keep you posted.
@donaldheppner Your statement in #5187 (comment) that "nobody but us will be calling these APIs" is pointing to a different assumed requirements than this feature development has been following: specifically, I have been going under the assumption that the |
…ky/rule_based_profiler/data_assistant/enable_column_type_domain_inclusions_exclusions_at_runtime-2022_05_23-147
…ky/rule_based_profiler/data_assistant/enable_column_type_domain_inclusions_exclusions_at_runtime-2022_05_23-147
data_assistant_result: DataAssistantResult = context.assistants.volume.run( | ||
batch_request=batch_request, | ||
exclude_column_names=exclude_column_names, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"look mom, no dictionaries" - Alex Sherstinsky
exclude_column_names: | ||
- DOLocationID | ||
- RatecodeID | ||
- store_and_fwd_flag | ||
- payment_type | ||
- extra | ||
- mta_tax | ||
- improvement_surcharge | ||
- congestion_surcharge |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
❤️
Scope
**kwargs
toDataAssistant.run()
which can be configured with multiple overrides, including column inclusions/exclusions and other properties influencing the underlyingRuleBasedProfiler
execution.Syntax example:
The implementation attempts to associate the directives in
**kwargs
withDomain Types
, appropriate for a uniquely identifiedDomainBuilder
(e.g.,ColumnDomainBuilder
).ColumnDomain
builder to properly accept values set by setter methods and reconcile these values with those from existing configuration.Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.
Changes proposed in this pull request:
After submitting your PR, CI checks will run and @cla-bot will check for your CLA signature.
For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.
In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g.
closes #123
).Previous Design Review notes:
Definition of Done
Please delete options that are not relevant.
Thank you for submitting!