Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support to include ID/PK in validation result for each row - SQL #6448

Merged
merged 56 commits into from
Dec 5, 2022

Conversation

Shinnnyshinshin
Copy link
Contributor

@Shinnnyshinshin Shinnnyshinshin commented Nov 29, 2022

Changes proposed in this pull request:

SQL implementation of unexpected_index_columns value that allows users to specify a primary key (PK) column for identifying rows that failed an Expectation (usually returned as part of the unexpected_index_list). The PR also enables unexpected_index_query to be returned to the user, which

Changes were made to the map_metric_provider to take in the parameter from result_format and output unexpected_index_list as key-value pairs of the primary key column.

Note : This is only the SQL implementation, Pandas has been merged already, and Spark to follow.

  • (Almost) closes #3195

What has changed?

  • unexpected_index_list and unexpected_index_query was added as validation dependencies for SqlAlchemyExecutionEngine

  • _sqlalchemy_map_condition_index was added as MapMetric.

    • This Metric returns the indices (truncated at 10 results for SQL) that contain unexpected values, according to the Primary Key values specified using unexpected_index_column_names. Note Unlike Pandas, the current SQL implementation does not have a default index that will be returned if unexpected_index_column_names is not specified (In other words, in order to see unexpected_index_list, the user must specify unexpected_index_column_names)
  • _sqlalchemy_map_condition_query was added as MapMetric.

    • This Metric returns the query that can be used to retrieve the full unexpected results.
    • This method depends on two helper methods :
      • sql_post_compile_to_string() : Used by the _sqlalchemy_map_condition_query() method to compile SQL select statement with post-compile parameters into a string. Logic lifted directly from the sqlalchemy documentation documentation.
      • get_sqlalchemy_source_table_and_schema_selectable(): Used by _sqlalchemy_map_condition_query() metric function to return the table associated with the current Batch, rather than the temp_table that is created as part of running the query.

Can you give me an Example? (heavily adopted from the Pandas example)

Given the following table animal_names

sqlite_path = file_relative_path(
       __file__, "../../test_sets/metrics_test.db"
    )
sqlite_engine = sa.create_engine(f"sqlite:///{sqlite_path}")
df = pd.DataFrame(
       {
           "pk_1": [0, 1, 2, 3, 4, 5],
           "pk_2": ["zero", "one", "two", "three", "four", "five"],
            "animals": [
               "cat",
               "fish",
               "dog",
               "giraffe",
                "lion",
                "zebra",
            ],
        }
    )
df.to_sql(
     name="animal_names",
     con=sqlite_engine,
     index=False,
     if_exists="replace",
 )

We could run the expect_column_values_to_be_in_set Expectation on the animals column with ["cat", "fish", "dog"] as the value_set (ie domestic animals).

expectationConfiguration = ExpectationConfiguration(
    expectation_type="expect_column_values_to_be_in_set",
    kwargs={
        "column": "animals",
        "value_set": ["cat", "fish", "dog"],
        "result_format": {
            "result_format": "COMPLETE",
        },
    },
)

After running the ExpectationConfiguration, we would expect the unexpected_index_list to be ["giraffe", "lion", "zebra"] which correspond to the indices of 4, 5, and 6, the values that are not in the value_set.

This PR enables the following configuration, which sets unexpected_index_columns to be pk_1.

expectationConfiguration = ExpectationConfiguration(
    expectation_type="expect_column_values_to_be_in_set",
    kwargs={
        "column": "animals",
        "value_set": ["cat", "fish", "dog"],
        "result_format": {
            "result_format": "COMPLETE",
            "unexpected_index_column_names": ["pk_1"],
        },
    },
)

After running this new ExpectationConfiguration, we expect the unexpected_index_list to be [{"pk_1": 3}, {"pk_1": 4}, {"pk_1": 5}] which correspond to the values in the pk_1 column of the indices of ["giraffe", "lion", "zebra"].

What if I have a lot of rows in my table?

In order to retrieve the full list unexpected values from the table, the result also contains a unexpected_index_query, which can be copied into a db client to retrieve all the unexpected rows.

SELECT animals, pk_1 
FROM animal_names 
WHERE animals IS NOT NULL AND (animals NOT IN "
  "('cat', 'fish', 'dog'))

The query (along with unexpected_index_list) is returned as part of the result values returned by the Validator.

...
"unexpected_index_list": [{"pk_1": 3}, {"pk_1": 4}, {"pk_1": 5}],
"unexpected_index_query": "SELECT animals, pk_1 \n"
"FROM animal_names \n"
  "WHERE animals IS NOT NULL AND (animals NOT IN "
  "('cat', 'fish', 'dog'))",
...

Definition of Done

Please delete options that are not relevant.

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

Will Shin added 16 commits November 17, 2022 15:01
…ithub.com/great-expectations/great_expectations into b/dx-67/bugfix-metrics-return-empty-value

* 'b/dx-67/bugfix-metrics-return-empty-value' of https://github.com/great-expectations/great_expectations:
  [MAINTENANCE] Migrate additional methods from `BaseDataContext` to other parts of context hierarchy (#6388)
  [MAINTENANCE] move `zep` -> `experimental` package (#6378)
* develop: (22 commits)
  [BUGFIX] issue-4295-fix-issue (#6164)
  [DOCS] add boto3 explanations on document (#6407)
  [FEATURE] add multiple column metric (#6372)
  [MAINTENANCE] Small refactor (#6422)
  [MAINTENANCE] Sorting batch IDs and typehints clean up (#6421)
  [MAINTENANCE] Clean Up Type Hints and Minor Refactoring For Better Code Elegance/Readability (#6418)
  [MAINTENANCE] Implement `RendererConfiguration` (#6412)
  [BUGFIX] updated capitalone setup.py file (#6410)
  [FEATURE]: DataProfilerUnstructuredDataAssistant Integration (#6400)
  [FEATURE] add new metric - query template values (#5994)
  [MAINTENANCE] Cleanup For Better Code Elegance/Readability (#6406)
  [MAINTENANCE] ZEP - `GxConfig` cleanup (#6404)
  [MAINTENANCE] Migrate remaining methods from `BaseDataContext` (#6403)
  [BUGFIX] Patch key-generation issue with `DataContext.save_profiler()` (#6405)
  [MAINTENANCE] Migrate additional CRUD methods from `BaseDataContext` to `AbstractDataContext` (#6395)
  [MAINTENANCE] ZEP add yaml methods to all experimental models (#6401)
  [FEATURE] ZEP Config serialize as YAML (#6398)
  [MAINTENANCE] Remove call to verify_library_dependent_modules for pybigquery (#6394)
  [MAINTENANCE] Make "IDDict.to_id()" serialization more efficient. (#6389)
  [RELEASE] 0.15.34 (#6397)
  ...
@ghost
Copy link

ghost commented Nov 29, 2022

👇 Click on the image for a new way to code review
  • Make big changes easier — review code in small groups of related files

  • Know where to start — see the whole change at a glance

  • Take a code tour — explore the change with an interactive tour

  • Make comments and review — all fully sync’ed with github

    Try it now!

Review these changes using an interactive CodeSee Map

Legend

CodeSee Map Legend

@netlify
Copy link

netlify bot commented Nov 29, 2022

Deploy Preview for niobium-lead-7998 ready!

Name Link
🔨 Latest commit 2fe0627
🔍 Latest deploy log https://app.netlify.com/sites/niobium-lead-7998/deploys/638e4b59a7a82f0007c37fa8
😎 Deploy Preview https://deploy-preview-6448--niobium-lead-7998.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site settings.

Will Shin added 2 commits November 30, 2022 10:43
* develop:
  [MAINTENANCE] Additional `sqlite` database fixture for `taxi_data` - All 2020 data in single table (#6455)
  [BUGFIX] Metrics return value no longer returns None for `unexpected_index_list` - Sql and Spark (#6392)
  [DOCS] add configuration of anonymous_usage_statistics for documentati… (#6293)
  [BUGFIX] Fix for `mssql` tests that depend on `datetime` to `string` conversion (#6449)
  [FEATURE] add multiple input metric (#6373)
  [CONTRIB] add expectation - check gaps in SCD tables (#6433)
  [CONTRIB] Add no days missing expectation (#6432)
  [CONTRIB] Feature/add two tables expectation (#6429)
  [CONTRIB] Add number of unique values expectation (#6425)
  [MAINTENANCE] Clean Up Variable Names In Test Modules, Type Hints, and Minor Refactoring For Better Code Elegance/Readability (#6444)
@Shinnnyshinshin Shinnnyshinshin self-assigned this Nov 30, 2022
@Shinnnyshinshin Shinnnyshinshin marked this pull request as ready for review December 1, 2022 01:47
Comment on lines 1026 to 1039
if dialect_name in ["sqlite", "trino", "mssql"]:
params = (repr(compiled.params[name]) for name in compiled.positiontup)
query_as_string = re.sub(r"\?", lambda m: next(params), str(compiled))

else:
params = (repr(compiled.params[name]) for name in list(compiled.params.keys()))
query_as_string = re.sub(r"%\(.*?\)s", lambda m: next(params), str(compiled))

# bigquery inserts extra '`' character for compiled statement.
# clean up string before returning
if dialect_name == "bigquery":
query_as_string = re.sub(r"`", "", query_as_string)

return query_as_string
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return SQL query according to backend

Will Shin added 2 commits December 5, 2022 09:39

# bigquery inserts extra '`' character for compiled statement.
# clean up string before returning
if dialect_name == "bigquery":
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: adjust test to compare the query from bigquery, rather than adjust the query from bigquery to match the test

Copy link
Member

@anthonyburdi anthonyburdi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you for the synchronous review and for adding the tests for param substitution for each backend 🙇

@Shinnnyshinshin Shinnnyshinshin enabled auto-merge (squash) December 5, 2022 19:49
@Shinnnyshinshin Shinnnyshinshin merged commit db4bc22 into develop Dec 5, 2022
@Shinnnyshinshin Shinnnyshinshin deleted the b/dx-67/sql-pk-id-metric branch December 5, 2022 20:35
Shinnnyshinshin pushed a commit that referenced this pull request Dec 5, 2022
* develop: (63 commits)
  [FEATURE] Support to include ID/PK in validation result for each row - SQL (#6448)
  [BUGFIX] Support slack channel name with webhook also (#6481)
  Query the database for datetime column splitter defaults (#6482)
  [MAINTENANCE] Move "Domain" to "great_expectations/core" to avoid circular imports; also add MetricConfiguration tests; and other clean up. (#6484)
  [MAINTENANCE] Reformat core expectation docstrings (#6423)
  [MAINTENANCE] Staging for build gallery (#6480)
  [MAINTENANCE] Move zep method from datasource to data asset. (#6477)
  [MAINTENANCE] Minor cleanup for better code readability (#6478)
  [MAINTENANCE] Misc updates to PR template (#6479)
  [CONTRIB] Add uniqueness expectation (#6473)
  [RELEASE] 0.15.36 (#6476)
  Add pretty representations for zep pydantic models (#6472)
  [BUGFIX] Contrib Expectation tracebacks (#6471)
  [BUGFIX] Add additional error checking to `ExpectationAnonymizer` (#6467)
  Add docstring for context.sources.add_postgres (#6459)
  [MAINTENANCE] fixing type hints in metrics utils module (#6469)
  [MAINTENANCE] Moving tutorials to great-expectations repo (#6464)
  [BUGFIX] Patch issue with call to `ExpectationAnonymizer` to ensure `DataContext` init events are captured (#6458)
  [BUGFIX] Support Table and Column Names Case Non-Sensitivity Relationship Between Snowflake, Oracle, DB2, etc. DBMSs (Upper Case) and SQLAlchemy (Lower Case) Representations (#6450)
  Add sorters to zep postgres datasource. (#6456)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants