Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] DataAssistant: Enable passing directives to run() method using runtime_environment argument #5187

Conversation

alexsherstinsky
Copy link
Contributor

@alexsherstinsky alexsherstinsky commented May 23, 2022

Scope

  • Added standard **kwargs to DataAssistant.run() which can be configured with multiple overrides, including column inclusions/exclusions and other properties influencing the underlying RuleBasedProfiler execution.

Syntax example:

    exclude_column_names: List[str] = [
        "dropoff_datetime",
        "store_and_fwd_flag",
        "extra",
        "congestion_surcharge",
    ]
    data_assistant_result: DataAssistantResult = context.assistants.volume.run(
        batch_request=batch_request,       
        exclude_column_names=exclude_column_names,
         # can add other directives here,
    )

The implementation attempts to associate the directives in **kwargs with Domain Types, appropriate for a uniquely identified DomainBuilder (e.g., ColumnDomainBuilder).

  • Enhanced ColumnDomain builder to properly accept values set by setter methods and reconcile these values with those from existing configuration.
  • Added a unit test for column exclusion.
  • Fixed incorrectly configured existing unit test.

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Changes proposed in this pull request:

  • JIRA: GREAT-822/GREAT-905

After submitting your PR, CI checks will run and @cla-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

@netlify
Copy link

netlify bot commented May 23, 2022

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit 5bab634
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/628d5f0dba2d2700084d734c
😎 Deploy Preview https://deploy-preview-5187--niobium-lead-7998.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

Copy link
Member

@donaldheppner donaldheppner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This adds a lot of complexity to the rule-based profiler and another optional dict of Any.

If there are specific settings we want to control for profiling, we should use a well defined type to present these configuration options

Alex Sherstinsky added 5 commits May 23, 2022 16:03
…ky/rule_based_profiler/data_assistant/enable_column_type_domain_inclusions_exclusions_at_runtime-2022_05_23-147
@alexsherstinsky alexsherstinsky enabled auto-merge (squash) May 24, 2022 09:11
@alexsherstinsky alexsherstinsky enabled auto-merge (squash) May 24, 2022 13:36
@@ -441,13 +441,15 @@ def run(
self,
variables: Optional[Dict[str, Any]] = None,
rules: Optional[Dict[str, Dict[str, Any]]] = None,
**kwargs: dict,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where can I, as a user, find the appropriate values that fit into this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdkini It has to be documented, just like in the Scope of this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values are, in fact, in RuntimeEnvironmentColumnDomainTypeDirectivesKeys.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to get away from all these dictionaries, especially since we don't expect anyone to call these methods but us. No more kwargs!

We don't have to call these "RuntimeEnvironment" any more, they are no longer simple settings, they are "ColumnDomainDirectives". If we want to generalize it, we could say that ColumnDomainDirectives are just one type of Directive that could be passed to DataAssistants (and the Profiler). This makes it obvious we should build a lot more smarts/functionality into the type, and I think makes it easier for the user to understand what they are being asked to supply.

Significantly, the primary user of this IS US, the developers of great_expectations. Convenience methods for those that like dictionaries can be provided at a higher level, and reconciled as Directives by time we get to the actual DataAssistant.

And for those who come along to implement their own DataAssistants later, they will really appreciate knowing that this parameter is a way of providing Directives, vs. a bag of nondescript values

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are include_column_names and exclude_column_names available for every Data Assistant? If so, does it make sense to add these here separately from kwargs to make that clear for the users when they use auto-complete? If not, does it make sense to add these to the Data Assistants that use them?

Comment on lines +407 to +412
else:
if not isinstance(property_value, property_value_type):
raise ValueError(
f'Unrecognized "{property_name}" directive -- must be "{property_value_type}" (or string).'
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be an elif?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cdkini Strictly speaking -- yes; however, when there is no final "else", the "elif" looks inelegant. This way, we have if/elif, which culminates on the else. It is also a different point in that final isinstance, so I preferred to distinguish it. Thanks.

Comment on lines 2473 to 2478
data_assistant_result: DataAssistantResult = context.assistants.volume.run(
batch_request=batch_request,
**{
"exclude_column_names": exclude_column_names,
},
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of unpacking the dictionary here, could we do

Suggested change
data_assistant_result: DataAssistantResult = context.assistants.volume.run(
batch_request=batch_request,
**{
"exclude_column_names": exclude_column_names,
},
)
data_assistant_result: DataAssistantResult = context.assistants.volume.run(
batch_request=batch_request,
exclude_column_names=exclude_column_names
)

to make the functionality more apparent?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talagluck This is potentially problematic, because every time we add new arguments from any new DomainBuilder implementations, we will have to add them here. This is difficult to maintain. I am going to try to come up with a way to do both.

@@ -441,13 +441,15 @@ def run(
self,
variables: Optional[Dict[str, Any]] = None,
rules: Optional[Dict[str, Dict[str, Any]]] = None,
**kwargs: dict,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are include_column_names and exclude_column_names available for every Data Assistant? If so, does it make sense to add these here separately from kwargs to make that clear for the users when they use auto-complete? If not, does it make sense to add these to the Data Assistants that use them?

@@ -441,13 +441,15 @@ def run(
self,
variables: Optional[Dict[str, Any]] = None,
rules: Optional[Dict[str, Dict[str, Any]]] = None,
**kwargs: dict,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to get away from all these dictionaries, especially since we don't expect anyone to call these methods but us. No more kwargs!

We don't have to call these "RuntimeEnvironment" any more, they are no longer simple settings, they are "ColumnDomainDirectives". If we want to generalize it, we could say that ColumnDomainDirectives are just one type of Directive that could be passed to DataAssistants (and the Profiler). This makes it obvious we should build a lot more smarts/functionality into the type, and I think makes it easier for the user to understand what they are being asked to supply.

Significantly, the primary user of this IS US, the developers of great_expectations. Convenience methods for those that like dictionaries can be provided at a higher level, and reconciled as Directives by time we get to the actual DataAssistant.

And for those who come along to implement their own DataAssistants later, they will really appreciate knowing that this parameter is a way of providing Directives, vs. a bag of nondescript values

"""
This method applies multiple directives to obtain columns to be included as part of returned "Domain" objects.
"""
include_column_names: Optional[List[str]] = cast(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The return type of the _resolve_list_type_property is an empty list (and never None), so we can remove the Optional type hint here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donaldheppner I think we can -- in next commit (double checking now).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donaldheppner Yes, we can. In next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donaldheppner This was done (committed).

expected_return_type=None,
variables=variables,
parameters=None,
include_column_name_suffixes: Union[str, Iterable, List[str]] = cast(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we please leave this type as a list of string? Every time we go with a multitude of types, it creates a multitude of if statements later on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donaldheppner Yes, we can. In next commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@donaldheppner This was done (committed).

@@ -833,6 +846,82 @@ def _get_rules_as_dict(self) -> Dict[str, Rule]:
rule: Rule
return {rule.name: rule for rule in self.rules}

# noinspection PyUnusedLocal
@staticmethod
def _apply_runtime_environment(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do appreciate this improvement, thank-you Alex!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@talagluck I understand about making the arguments more apparent. I am going to try -- not sure yet; I am struggling with the issue of making this API call maintainable when we add new DomainBuilder classes with their own directives. I will keep you posted.

@alexsherstinsky alexsherstinsky enabled auto-merge (squash) May 24, 2022 16:41
@alexsherstinsky alexsherstinsky enabled auto-merge (squash) May 24, 2022 16:45
@alexsherstinsky alexsherstinsky enabled auto-merge (squash) May 24, 2022 16:48
@alexsherstinsky alexsherstinsky enabled auto-merge (squash) May 24, 2022 16:56
@alexsherstinsky alexsherstinsky enabled auto-merge (squash) May 24, 2022 17:09
@alexsherstinsky
Copy link
Contributor Author

alexsherstinsky commented May 24, 2022

@donaldheppner Your statement in #5187 (comment) that "nobody but us will be calling these APIs" is pointing to a different assumed requirements than this feature development has been following: specifically, I have been going under the assumption that the run() method is user-facing API. Happy to discuss. /cc @talagluck

Alex Sherstinsky added 3 commits May 24, 2022 13:05
…ky/rule_based_profiler/data_assistant/enable_column_type_domain_inclusions_exclusions_at_runtime-2022_05_23-147
…ky/rule_based_profiler/data_assistant/enable_column_type_domain_inclusions_exclusions_at_runtime-2022_05_23-147
@alexsherstinsky alexsherstinsky enabled auto-merge (squash) May 24, 2022 22:12
@alexsherstinsky alexsherstinsky enabled auto-merge (squash) May 24, 2022 22:43
@alexsherstinsky alexsherstinsky merged commit a053e7e into develop May 24, 2022
@alexsherstinsky alexsherstinsky deleted the feature/GREAT-822/GREAT-905/alexsherstinsky/rule_based_profiler/data_assistant/enable_column_type_domain_inclusions_exclusions_at_runtime-2022_05_23-147 branch May 24, 2022 23:27
Comment on lines +2473 to +2476
data_assistant_result: DataAssistantResult = context.assistants.volume.run(
batch_request=batch_request,
exclude_column_names=exclude_column_names,
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"look mom, no dictionaries" - Alex Sherstinsky

Comment on lines +134 to +142
exclude_column_names:
- DOLocationID
- RatecodeID
- store_and_fwd_flag
- payment_type
- extra
- mta_tax
- improvement_surcharge
- congestion_surcharge
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants