Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MAINTENANCE] Add proper unit tests for Column Histogram metric and use Column Value Partitioner in OnboardingDataAssistant #5267

Conversation

alexsherstinsky
Copy link
Contributor

Scope

This contribution accomplishes several related tasks:

  • Add proper unit tests for the "column.histogram" metric.
  • Use PartitionParameterBuilder in OnboardingDataAssistant due to its generality of handling histogram style requirements for both, categorical as well as finely-quantized (as an approximation to "continuous") column data values.
  • Update DataAssistant unit tests to handle the available ParameterNode keys robustly.
  • Various code styling cleanup.

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Changes proposed in this pull request:

  • JIRA: GREAT-465/GREAT-982

After submitting your PR, CI checks will run and @cla-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

Alex Sherstinsky added 2 commits June 8, 2022 09:00
…ky/rule_based_profiler/data_assistant/metrics/column_histogram_metric_must_support_integer_bins_for_all_execution_engine_options-2022_06_07-163
@netlify
Copy link

netlify bot commented Jun 8, 2022

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit 2dd0ed7
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/62a0ecf8587664000995e62e
😎 Deploy Preview https://deploy-preview-5267--niobium-lead-7998.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

@ghost
Copy link

ghost commented Jun 8, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

Alex Sherstinsky added 2 commits June 8, 2022 09:08
Copy link
Contributor

@NathanFarmer NathanFarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions!

@@ -280,8 +280,10 @@ def _build_numeric_columns_rule() -> Rule:

# Step-2: Declare "ParameterBuilder" for every metric of interest.

column_histogram_metric_multi_batch_parameter_builder_for_metrics: ParameterBuilder = DataAssistant.commonly_used_parameter_builders.get_column_histogram_metric_multi_batch_parameter_builder(
json_serialize=True
column_partition_parameter_builder_for_metrics: ParameterBuilder = DataAssistant.commonly_used_parameter_builders.build_partition_parameter_builder(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly makes column.partition an improvement over using column.histogram for continuous data? On the face of it, it would seem like they do the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanFarmer The PartitionParameterBuilder is better, because it has the logic to decide whether or not the underlying data is categorical or finality-quantized (as an approximation to "continuous"). This optimizes the histogram calculations for best interpretation. In the histogram case, we produce probability distribution like output (and care about smoothing the tails). In the categorical case, we produce the probability mass function like output. Thanks!

@@ -661,7 +664,7 @@ def _build_datetime_columns_rule() -> Rule:
"mostly": 1.0,
"strict_min": False,
"strict_max": False,
"bins": 10,
"allow_relative_error": "linear",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter seems to be important, but why can't this just be added to column.histogram?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NathanFarmer Because it would contravene the original API, whereby the column.histogram metric accepts the list, containing the bin boundaries (the bins parameter). So I had to revert in order to stay true to the original intent. Thanks!

Copy link
Contributor

@NathanFarmer NathanFarmer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me once that unused import gets removed!

Alex Sherstinsky added 2 commits June 8, 2022 11:39
…ky/rule_based_profiler/data_assistant/metrics/column_histogram_metric_must_support_integer_bins_for_all_execution_engine_options-2022_06_07-163
@alexsherstinsky alexsherstinsky enabled auto-merge (squash) June 8, 2022 18:40
@alexsherstinsky alexsherstinsky merged commit f03622d into develop Jun 8, 2022
@alexsherstinsky alexsherstinsky deleted the feature/GREAT-465/GREAT-982/alexsherstinsky/rule_based_profiler/data_assistant/metrics/column_histogram_metric_must_support_integer_bins_for_all_execution_engine_options-2022_06_07-163 branch June 8, 2022 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants