Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUGFIX] fix incorrect pandas top rows usage #3091

Conversation

alexsherstinsky
Copy link
Contributor

@alexsherstinsky alexsherstinsky commented Jul 21, 2021

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Changes proposed in this pull request:

  • JIRA: GE-320/GE-364

After submitting your PR, CI checks will run and @tiny-tim-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

@netlify
Copy link

netlify bot commented Jul 21, 2021

✔️ Deploy Preview for knoxpod ready!

🔨 Explore the source changes: 3d5e065

🔍 Inspect the deploy log: https://app.netlify.com/sites/knoxpod/deploys/60f89377fcc10500084ad83c

😎 Browse the preview: https://deploy-preview-3091--knoxpod.netlify.app

@github-actions
Copy link
Contributor

HOWDY! This is your friendly 🤖 CHANGELOG bot 🤖

Please don't forget to add a clear and succinct description of your change under the Develop header in docs/changelog.rst, if applicable. This will help us with the release process. See the Contribution checklist in the Great Expectations documentation for the type of labels to use!

Thank you!

@alexsherstinsky alexsherstinsky marked this pull request as ready for review July 21, 2021 20:15
Copy link
Member

@cdkini cdkini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed from a purely Python perspective since my knowledge of the underlying systems is lacking. I see the pandas related issue and see how it's resolved so as long as tests pass, I'll sign off.

Please do take a look at my comments (non-blocking) and let me know your thoughts.

Comment on lines 554 to +556
domain_values = df[column_name]

domain_values = domain_values[boolean_mapped_unexpected_values == True]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consolidate this into a single filtered df? Is that too complex to put on a single line?

Comment on lines +800 to 801

if result_format["result_format"] == "COMPLETE":
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any chance of result_format missing the "result_format" key? Should we make this a get()?

Comment on lines +629 to 631
list(domain_values),
list(map_series),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on using df.columns.tolist() or df.columns.values.tolist()? The latter is more performant but I'm just thinking from a readability standpoint. I wasn't sure what list(df) did but that might just be from not having used pandas in a while.

Comment on lines +566 to +584
unexpected_rows_metric = MetricConfiguration(
metric_name="column_values.unique.unexpected_rows",
metric_domain_kwargs={"column": "a"},
metric_value_kwargs={
"result_format": {"result_format": "SUMMARY", "partial_unexpected_count": 1}
},
metric_dependencies={
"unexpected_condition": condition_metric,
"table.columns": table_columns_metric,
},
)
results = engine.resolve_metrics(
metrics_to_resolve=(unexpected_rows_metric,), metrics=metrics
)
metrics.update(results)

assert metrics[unexpected_rows_metric.id]["a"].index == [2]
assert metrics[unexpected_rows_metric.id]["a"].values == [3]

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this test doing? Does it add coverage over this pandas issue that was not covered before? Should it be a separate test entirely or are we okay with appending it here?

I don't have the requisite knowledge to definitively say but I thought it was prudent to bring up such questions so you could evaluate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This tests goes into a metric and the result format that triggers the behavior, where the number of rows returned from the pandas dataframe has to be less than its total number of rows.

@alexsherstinsky alexsherstinsky merged commit 0a830b4 into develop Jul 21, 2021
@alexsherstinsky alexsherstinsky deleted the bugfix/GE-320/GE-364/alexsherstinsky/fix_incorrect_pandas_top_rows_usage-2021_07-21-22 branch July 21, 2021 22:19
Shinnnyshinshin pushed a commit that referenced this pull request Jul 23, 2021
…https://github.com/great-expectations/great_expectations into docs/GDOC-199/snowflake-connections-closed-correctly

* 'docs/GDOC-199/snowflake-connections-closed-correctly' of https://github.com/great-expectations/great_expectations:
  Update util.py
  [MAINTENANCE] rename map_metric.py to map_metric_provider.py (with DeprecationWarning) for a better code readability/interpretability (#3103)
  [MAINTENANCE] rename ColumnMetricProvider to ColumnAggregateMetricProvider (with DeprecationWarning) (#3100)
  use correct variable name (#3069)
  [RELEASE] release candidate for 2021-07-22 (#3101)
  [FEATURE] SqlAlchemy engine support for column.most_common_value metric (#3020)
  [BUGFIX] Fix run_diagnostics for contrib expectations (#3096)
  [BUGFIX] Fix typos discovered by codespell (#3064)
  [DOCS] Migrating pages under guides/miscellaneous (#3094)
  [DOCS] Port over "How to run a Checkpoint in Airflow" from RTD to Docusaurus  (#3074)
  disable snowflake tests temporarily (#3093)
  [BUGFIX] fix incorrect pandas top rows usage (#3091)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants