[BUGFIX] fix incorrect pandas top rows usage #3091

alexsherstinsky · 2021-07-21T19:38:50Z

Please annotate your PR title to describe what the PR does, then give a brief bulleted description of your PR below. PR titles should begin with [BUGFIX], [FEATURE], [DOCS], or [MAINTENANCE]. If a new feature introduces breaking changes for the Great Expectations API or configuration files, please also add [BREAKING]. You can read about the tags in our contributor checklist.

Changes proposed in this pull request:

JIRA: GE-320/GE-364

After submitting your PR, CI checks will run and @tiny-tim-bot will check for your CLA signature.

For a PR with nontrivial changes, we review with both design-centric and code-centric lenses.

In a design review, we aim to ensure that the PR is consistent with our relationship to the open source community, with our software architecture and abstractions, and with our users' needs and expectations. That review often starts well before a PR, for example in github issues or slack, so please link to relevant conversations in notes below to help reviewers understand and approve your PR more quickly (e.g. closes #123).

Previous Design Review notes:

Definition of Done

Please delete options that are not relevant.

My code follows the Great Expectations style guide
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added unit tests where applicable and made sure that new and existing tests are passing.
I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

…r of rows

…incorrect_pandas_top_rows_usage-2021_07-21-22

netlify · 2021-07-21T19:38:54Z

✔️ Deploy Preview for knoxpod ready!

🔨 Explore the source changes: 3d5e065

🔍 Inspect the deploy log: https://app.netlify.com/sites/knoxpod/deploys/60f89377fcc10500084ad83c

😎 Browse the preview: https://deploy-preview-3091--knoxpod.netlify.app

github-actions · 2021-07-21T19:39:11Z

HOWDY! This is your friendly 🤖 CHANGELOG bot 🤖

Please don't forget to add a clear and succinct description of your change under the Develop header in docs/changelog.rst, if applicable. This will help us with the release process. See the Contribution checklist in the Great Expectations documentation for the type of labels to use!

✨ Thank you! ✨

…incorrect_pandas_top_rows_usage-2021_07-21-22

cdkini

I reviewed from a purely Python perspective since my knowledge of the underlying systems is lacking. I see the pandas related issue and see how it's resolved so as long as tests pass, I'll sign off.

Please do take a look at my comments (non-blocking) and let me know your thoughts.

cdkini · 2021-07-21T21:57:04Z

great_expectations/expectations/metrics/map_metric.py

    domain_values = df[column_name]

+    domain_values = domain_values[boolean_mapped_unexpected_values == True]


Can we consolidate this into a single filtered df? Is that too complex to put on a single line?

cdkini · 2021-07-21T22:01:41Z

great_expectations/expectations/metrics/map_metric.py

+
    if result_format["result_format"] == "COMPLETE":


Is there any chance of result_format missing the "result_format" key? Should we make this a get()?

cdkini · 2021-07-21T22:05:36Z

great_expectations/expectations/metrics/map_metric.py

+            list(domain_values),
+            list(map_series),
        )


Thoughts on using df.columns.tolist() or df.columns.values.tolist()? The latter is more performant but I'm just thinking from a readability standpoint. I wasn't sure what list(df) did but that might just be from not having used pandas in a while.

cdkini · 2021-07-21T22:08:22Z

tests/expectations/metrics/test_core.py

+    unexpected_rows_metric = MetricConfiguration(
+        metric_name="column_values.unique.unexpected_rows",
+        metric_domain_kwargs={"column": "a"},
+        metric_value_kwargs={
+            "result_format": {"result_format": "SUMMARY", "partial_unexpected_count": 1}
+        },
+        metric_dependencies={
+            "unexpected_condition": condition_metric,
+            "table.columns": table_columns_metric,
+        },
+    )
+    results = engine.resolve_metrics(
+        metrics_to_resolve=(unexpected_rows_metric,), metrics=metrics
+    )
+    metrics.update(results)
+
+    assert metrics[unexpected_rows_metric.id]["a"].index == [2]
+    assert metrics[unexpected_rows_metric.id]["a"].values == [3]
+


What is this test doing? Does it add coverage over this pandas issue that was not covered before? Should it be a separate test entirely or are we okay with appending it here?

I don't have the requisite knowledge to definitively say but I thought it was prudent to bring up such questions so you could evaluate.

This tests goes into a metric and the result format that triggers the behavior, where the number of rows returned from the pandas dataframe has to be less than its total number of rows.

…https://github.com/great-expectations/great_expectations into docs/GDOC-199/snowflake-connections-closed-correctly * 'docs/GDOC-199/snowflake-connections-closed-correctly' of https://github.com/great-expectations/great_expectations: Update util.py [MAINTENANCE] rename map_metric.py to map_metric_provider.py (with DeprecationWarning) for a better code readability/interpretability (#3103) [MAINTENANCE] rename ColumnMetricProvider to ColumnAggregateMetricProvider (with DeprecationWarning) (#3100) use correct variable name (#3069) [RELEASE] release candidate for 2021-07-22 (#3101) [FEATURE] SqlAlchemy engine support for column.most_common_value metric (#3020) [BUGFIX] Fix run_diagnostics for contrib expectations (#3096) [BUGFIX] Fix typos discovered by codespell (#3064) [DOCS] Migrating pages under guides/miscellaneous (#3094) [DOCS] Port over "How to run a Checkpoint in Airflow" from RTD to Docusaurus (#3074) disable snowflake tests temporarily (#3093) [BUGFIX] fix incorrect pandas top rows usage (#3091)

Alex Sherstinsky added 3 commits July 21, 2021 11:30

fix a bug in the usage of Pandas for subscripting to get select numbe…

e0a3379

…r of rows

fix Pandas bug

c5f23e1

Merge branch 'develop' into bugfix/GE-320/GE-364/alexsherstinsky/fix_…

6beaf81

…incorrect_pandas_top_rows_usage-2021_07-21-22

alexsherstinsky marked this pull request as ready for review July 21, 2021 20:15

alexsherstinsky requested review from jcampbell, cdkini, fjork3, petermoyer and Shinnnyshinshin July 21, 2021 20:15

Alex Sherstinsky added 3 commits July 21, 2021 13:16

f602009

Added a test of pandas slicing in MapMetrics functions.

9cd68b1

Merge branch 'develop' into bugfix/GE-320/GE-364/alexsherstinsky/fix_…

3d5e065

…incorrect_pandas_top_rows_usage-2021_07-21-22

cdkini approved these changes Jul 21, 2021

View reviewed changes

alexsherstinsky merged commit 0a830b4 into develop Jul 21, 2021

alexsherstinsky deleted the bugfix/GE-320/GE-364/alexsherstinsky/fix_incorrect_pandas_top_rows_usage-2021_07-21-22 branch July 21, 2021 22:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUGFIX] fix incorrect pandas top rows usage #3091

[BUGFIX] fix incorrect pandas top rows usage #3091

alexsherstinsky commented Jul 21, 2021 •

edited

netlify bot commented Jul 21, 2021 •

edited

github-actions bot commented Jul 21, 2021

cdkini left a comment

cdkini Jul 21, 2021

cdkini Jul 21, 2021

cdkini Jul 21, 2021

cdkini Jul 21, 2021

alexsherstinsky Jul 21, 2021

		domain_values = df[column_name]

		domain_values = domain_values[boolean_mapped_unexpected_values == True]

[BUGFIX] fix incorrect pandas top rows usage #3091

[BUGFIX] fix incorrect pandas top rows usage #3091

Conversation

alexsherstinsky commented Jul 21, 2021 • edited

Previous Design Review notes:

Definition of Done

netlify bot commented Jul 21, 2021 • edited

github-actions bot commented Jul 21, 2021

HOWDY! This is your friendly 🤖 CHANGELOG bot 🤖

cdkini left a comment

Choose a reason for hiding this comment

cdkini Jul 21, 2021

Choose a reason for hiding this comment

cdkini Jul 21, 2021

Choose a reason for hiding this comment

cdkini Jul 21, 2021

Choose a reason for hiding this comment

cdkini Jul 21, 2021

Choose a reason for hiding this comment

alexsherstinsky Jul 21, 2021

Choose a reason for hiding this comment

alexsherstinsky commented Jul 21, 2021 •

edited

netlify bot commented Jul 21, 2021 •

edited