[FEATURE] Support to include ID/PK in validation result for each row - SQL #6448

Shinnnyshinshin · 2022-11-29T18:19:38Z

Changes proposed in this pull request:

SQL implementation of unexpected_index_columns value that allows users to specify a primary key (PK) column for identifying rows that failed an Expectation (usually returned as part of the unexpected_index_list). The PR also enables unexpected_index_query to be returned to the user, which

Changes were made to the map_metric_provider to take in the parameter from result_format and output unexpected_index_list as key-value pairs of the primary key column.

Note : This is only the SQL implementation, Pandas has been merged already, and Spark to follow.

(Almost) closes #3195

What has changed?

unexpected_index_list and unexpected_index_query was added as validation dependencies for SqlAlchemyExecutionEngine
_sqlalchemy_map_condition_index was added as MapMetric.
- This Metric returns the indices (truncated at 10 results for SQL) that contain unexpected values, according to the Primary Key values specified using unexpected_index_column_names. Note Unlike Pandas, the current SQL implementation does not have a default index that will be returned if unexpected_index_column_names is not specified (In other words, in order to see unexpected_index_list, the user must specify unexpected_index_column_names)
_sqlalchemy_map_condition_query was added as MapMetric.
- This Metric returns the query that can be used to retrieve the full unexpected results.
- This method depends on two helper methods :
  - sql_post_compile_to_string() : Used by the _sqlalchemy_map_condition_query() method to compile SQL select statement with post-compile parameters into a string. Logic lifted directly from the sqlalchemy documentation documentation.
  - get_sqlalchemy_source_table_and_schema_selectable(): Used by _sqlalchemy_map_condition_query() metric function to return the table associated with the current Batch, rather than the temp_table that is created as part of running the query.

Can you give me an Example? (heavily adopted from the Pandas example)

Given the following table animal_names

sqlite_path = file_relative_path(
       __file__, "../../test_sets/metrics_test.db"
    )
sqlite_engine = sa.create_engine(f"sqlite:///{sqlite_path}")
df = pd.DataFrame(
       {
           "pk_1": [0, 1, 2, 3, 4, 5],
           "pk_2": ["zero", "one", "two", "three", "four", "five"],
            "animals": [
               "cat",
               "fish",
               "dog",
               "giraffe",
                "lion",
                "zebra",
            ],
        }
    )
df.to_sql(
     name="animal_names",
     con=sqlite_engine,
     index=False,
     if_exists="replace",
 )

We could run the expect_column_values_to_be_in_set Expectation on the animals column with ["cat", "fish", "dog"] as the value_set (ie domestic animals).

expectationConfiguration = ExpectationConfiguration(
    expectation_type="expect_column_values_to_be_in_set",
    kwargs={
        "column": "animals",
        "value_set": ["cat", "fish", "dog"],
        "result_format": {
            "result_format": "COMPLETE",
        },
    },
)

After running the ExpectationConfiguration, we would expect the unexpected_index_list to be ["giraffe", "lion", "zebra"] which correspond to the indices of 4, 5, and 6, the values that are not in the value_set.

This PR enables the following configuration, which sets unexpected_index_columns to be pk_1.

expectationConfiguration = ExpectationConfiguration(
    expectation_type="expect_column_values_to_be_in_set",
    kwargs={
        "column": "animals",
        "value_set": ["cat", "fish", "dog"],
        "result_format": {
            "result_format": "COMPLETE",
            "unexpected_index_column_names": ["pk_1"],
        },
    },
)

After running this new ExpectationConfiguration, we expect the unexpected_index_list to be [{"pk_1": 3}, {"pk_1": 4}, {"pk_1": 5}] which correspond to the values in the pk_1 column of the indices of ["giraffe", "lion", "zebra"].

What if I have a lot of rows in my table?

In order to retrieve the full list unexpected values from the table, the result also contains a unexpected_index_query, which can be copied into a db client to retrieve all the unexpected rows.

SELECT animals, pk_1 
FROM animal_names 
WHERE animals IS NOT NULL AND (animals NOT IN "
  "('cat', 'fish', 'dog'))

The query (along with unexpected_index_list) is returned as part of the result values returned by the Validator.

...
"unexpected_index_list": [{"pk_1": 3}, {"pk_1": 4}, {"pk_1": 5}],
"unexpected_index_query": "SELECT animals, pk_1 \n"
"FROM animal_names \n"
  "WHERE animals IS NOT NULL AND (animals NOT IN "
  "('cat', 'fish', 'dog'))",
...

Definition of Done

Please delete options that are not relevant.

My code follows the Great Expectations style guide
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added unit tests where applicable and made sure that new and existing tests are passing.
I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

…ithub.com/great-expectations/great_expectations into b/dx-67/bugfix-metrics-return-empty-value * 'b/dx-67/bugfix-metrics-return-empty-value' of https://github.com/great-expectations/great_expectations: [MAINTENANCE] Migrate additional methods from `BaseDataContext` to other parts of context hierarchy (#6388) [MAINTENANCE] move `zep` -> `experimental` package (#6378)

* develop: (22 commits) [BUGFIX] issue-4295-fix-issue (#6164) [DOCS] add boto3 explanations on document (#6407) [FEATURE] add multiple column metric (#6372) [MAINTENANCE] Small refactor (#6422) [MAINTENANCE] Sorting batch IDs and typehints clean up (#6421) [MAINTENANCE] Clean Up Type Hints and Minor Refactoring For Better Code Elegance/Readability (#6418) [MAINTENANCE] Implement `RendererConfiguration` (#6412) [BUGFIX] updated capitalone setup.py file (#6410) [FEATURE]: DataProfilerUnstructuredDataAssistant Integration (#6400) [FEATURE] add new metric - query template values (#5994) [MAINTENANCE] Cleanup For Better Code Elegance/Readability (#6406) [MAINTENANCE] ZEP - `GxConfig` cleanup (#6404) [MAINTENANCE] Migrate remaining methods from `BaseDataContext` (#6403) [BUGFIX] Patch key-generation issue with `DataContext.save_profiler()` (#6405) [MAINTENANCE] Migrate additional CRUD methods from `BaseDataContext` to `AbstractDataContext` (#6395) [MAINTENANCE] ZEP add yaml methods to all experimental models (#6401) [FEATURE] ZEP Config serialize as YAML (#6398) [MAINTENANCE] Remove call to verify_library_dependent_modules for pybigquery (#6394) [MAINTENANCE] Make "IDDict.to_id()" serialization more efficient. (#6389) [RELEASE] 0.15.34 (#6397) ...

ghost · 2022-11-29T18:21:12Z

👇 Click on the image for a new way to code review

Make big changes easier — review code in small groups of related files
Know where to start — see the whole change at a glance
Take a code tour — explore the change with an interactive tour
Make comments and review — all fully sync’ed with github

Try it now!

Legend

netlify · 2022-11-29T18:22:12Z

✅ Deploy Preview for niobium-lead-7998 ready!

Name	Link
🔨 Latest commit	`2fe0627`
🔍 Latest deploy log	https://app.netlify.com/sites/niobium-lead-7998/deploys/638e4b59a7a82f0007c37fa8
😎 Deploy Preview	https://deploy-preview-6448--niobium-lead-7998.netlify.app
📱 Preview on mobile	Toggle QR Code... Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

* develop: [MAINTENANCE] Additional `sqlite` database fixture for `taxi_data` - All 2020 data in single table (#6455) [BUGFIX] Metrics return value no longer returns None for `unexpected_index_list` - Sql and Spark (#6392) [DOCS] add configuration of anonymous_usage_statistics for documentati… (#6293) [BUGFIX] Fix for `mssql` tests that depend on `datetime` to `string` conversion (#6449) [FEATURE] add multiple input metric (#6373) [CONTRIB] add expectation - check gaps in SCD tables (#6433) [CONTRIB] Add no days missing expectation (#6432) [CONTRIB] Feature/add two tables expectation (#6429) [CONTRIB] Add number of unique values expectation (#6425) [MAINTENANCE] Clean Up Variable Names In Test Modules, Type Hints, and Minor Refactoring For Better Code Elegance/Readability (#6444)

Shinnnyshinshin · 2022-12-05T17:32:55Z

great_expectations/expectations/metrics/util.py

+    if dialect_name in ["sqlite", "trino", "mssql"]:
+        params = (repr(compiled.params[name]) for name in compiled.positiontup)
+        query_as_string = re.sub(r"\?", lambda m: next(params), str(compiled))
+
+    else:
+        params = (repr(compiled.params[name]) for name in list(compiled.params.keys()))
+        query_as_string = re.sub(r"%\(.*?\)s", lambda m: next(params), str(compiled))
+
+        # bigquery inserts extra '`' character for compiled statement.
+        # clean up string before returning
+        if dialect_name == "bigquery":
+            query_as_string = re.sub(r"`", "", query_as_string)
+
+    return query_as_string


Return SQL query according to backend

This reverts commit 40cf1f6.

This reverts commit c5e9741.

Shinnnyshinshin · 2022-12-05T19:25:26Z

great_expectations/expectations/metrics/util.py

+
+        # bigquery inserts extra '`' character for compiled statement.
+        # clean up string before returning
+        if dialect_name == "bigquery":


TODO: adjust test to compare the query from bigquery, rather than adjust the query from bigquery to match the test

anthonyburdi

LGTM! Thank you for the synchronous review and for adding the tests for param substitution for each backend 🙇

* develop: (63 commits) [FEATURE] Support to include ID/PK in validation result for each row - SQL (#6448) [BUGFIX] Support slack channel name with webhook also (#6481) Query the database for datetime column splitter defaults (#6482) [MAINTENANCE] Move "Domain" to "great_expectations/core" to avoid circular imports; also add MetricConfiguration tests; and other clean up. (#6484) [MAINTENANCE] Reformat core expectation docstrings (#6423) [MAINTENANCE] Staging for build gallery (#6480) [MAINTENANCE] Move zep method from datasource to data asset. (#6477) [MAINTENANCE] Minor cleanup for better code readability (#6478) [MAINTENANCE] Misc updates to PR template (#6479) [CONTRIB] Add uniqueness expectation (#6473) [RELEASE] 0.15.36 (#6476) Add pretty representations for zep pydantic models (#6472) [BUGFIX] Contrib Expectation tracebacks (#6471) [BUGFIX] Add additional error checking to `ExpectationAnonymizer` (#6467) Add docstring for context.sources.add_postgres (#6459) [MAINTENANCE] fixing type hints in metrics utils module (#6469) [MAINTENANCE] Moving tutorials to great-expectations repo (#6464) [BUGFIX] Patch issue with call to `ExpectationAnonymizer` to ensure `DataContext` init events are captured (#6458) [BUGFIX] Support Table and Column Names Case Non-Sensitivity Relationship Between Snowflake, Oracle, DB2, etc. DBMSs (Upper Case) and SQLAlchemy (Lower Case) Representations (#6450) Add sorters to zep postgres datasource. (#6456) ...

Will Shin added 16 commits November 17, 2022 15:01

Before the fix

017350c

Merge branch 'develop' into b/dx-67/bugfix-metrics-return-empty-value

2af5d8d

cleaned up db references

43d30ac

Merge branch 'develop' into b/dx-67/bugfix-metrics-return-empty-value

0c93b5d

bugfix

015b84f

updated etst to not include extra comments or output

61867c2

Update test_map_metric.py

c2043bd

Update test_map_metric.py

583f89d

Update test_map_metric.py

93ae465

update column names

3b983a6

Merge branch 'develop' into b/dx-67/bugfix-metrics-return-empty-value

712d9e2

Merge branch 'develop' into b/dx-67/bugfix-metrics-return-empty-value

cb54172

oops

7dd34c3

Sql Metrics added from other PR

75cc9b2

github-actions bot added the core-team label Nov 29, 2022

Will Shin added 2 commits November 30, 2022 10:43

Update test_map_metric.py

f5d1a05

Shinnnyshinshin self-assigned this Nov 30, 2022

Will Shin added 6 commits November 30, 2022 12:35

see if this works for now

6ab5d3e

push before final check

1cfc4df

fixed

9bfdde1

Merge branch 'develop' into b/dx-67/sql-pk-id-metric

2447162

cleaned up

5a11c0e

cleaned up final function

75134e9

Shinnnyshinshin marked this pull request as ready for review December 1, 2022 01:47

very large

7f03f3d

Will Shin added 15 commits December 2, 2022 17:21

final set

a8a5997

much cleaner

1090b07

athena needs special treatment

159966f

Merge branch 'develop' into b/dx-67/sql-pk-id-metric

97cf510

monkeypatch everything

2783c04

Update test_metrics_util.py

8ca9b3f

wow what was that

47caaca

ok so we here go again

9e2c00d

i think i finally haev it this time

ed131f1

Update metrics_test.db

6deb72e

Update test_metrics_util.py

72494f9

now the tests are fixed

7d63cd2

q1

071cf19

clean up of tests

40cf1f6

Merge branch 'develop' into b/dx-67/sql-pk-id-metric

ba177f6

Shinnnyshinshin commented Dec 5, 2022

View reviewed changes

Will Shin added 2 commits December 5, 2022 09:39

Revert "clean up of tests"

c5e9741

This reverts commit 40cf1f6.

Revert "Revert "clean up of tests""

2eb7aa9

This reverts commit c5e9741.

Shinnnyshinshin commented Dec 5, 2022

View reviewed changes

updated after synchronous review

2abc598

anthonyburdi approved these changes Dec 5, 2022

View reviewed changes

Merge branch 'develop' into b/dx-67/sql-pk-id-metric

2fe0627

Shinnnyshinshin enabled auto-merge (squash) December 5, 2022 19:49

Shinnnyshinshin merged commit db4bc22 into develop Dec 5, 2022

Shinnnyshinshin deleted the b/dx-67/sql-pk-id-metric branch December 5, 2022 20:35

talagluck mentioned this pull request Dec 9, 2022

Support to include ID/PK in validation result for each row that failed an expectations #3195

Closed

talagluck mentioned this pull request Dec 21, 2022

Display selected columns in data docs in case of failures #5265

Closed

Shinnnyshinshin mentioned this pull request Dec 23, 2022

[FEATURE] ID/PK Rendering in DataDocs #6637

Merged

5 tasks

Shinnnyshinshin mentioned this pull request Dec 30, 2022

[FEATURE] Support to include ID/PK in validation result for each row - Spark #6676

Merged

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support to include ID/PK in validation result for each row - SQL #6448

[FEATURE] Support to include ID/PK in validation result for each row - SQL #6448

Shinnnyshinshin commented Nov 29, 2022 •

edited

Loading

ghost commented Nov 29, 2022 •

edited by ghost

Loading

netlify bot commented Nov 29, 2022 •

edited

Loading

Shinnnyshinshin Dec 5, 2022

Shinnnyshinshin Dec 5, 2022

anthonyburdi left a comment

[FEATURE] Support to include ID/PK in validation result for each row - SQL #6448

[FEATURE] Support to include ID/PK in validation result for each row - SQL #6448

Conversation

Shinnnyshinshin commented Nov 29, 2022 • edited Loading

What has changed?

Can you give me an Example? (heavily adopted from the Pandas example)

What if I have a lot of rows in my table?

Definition of Done

ghost commented Nov 29, 2022 • edited by ghost Loading

Legend

netlify bot commented Nov 29, 2022 • edited Loading

✅ Deploy Preview for niobium-lead-7998 ready!

Shinnnyshinshin Dec 5, 2022

Choose a reason for hiding this comment

Shinnnyshinshin Dec 5, 2022

Choose a reason for hiding this comment

anthonyburdi left a comment

Choose a reason for hiding this comment

Shinnnyshinshin commented Nov 29, 2022 •

edited

Loading

ghost commented Nov 29, 2022 •

edited by ghost

Loading

netlify bot commented Nov 29, 2022 •

edited

Loading