Support to include ID/PK in validation result for each row that failed an expectations #3195

KentonParton · 2021-08-09T00:01:25Z

Is your feature request related to a problem? Please describe.
When an expectation is run the output includes a “partial_unexpected_list” property of values that were unexpected. While this is useful, in most cases, it doesn't allow teams to identify, resolve, or divert data with poor quality.

It would be great if a sample list or a complete list (result_format SUMMARY, COMPLETE) of ID's for each row that failed an expectation was included in the validation result.

Describe the solution you'd like
One could include a column name as an argument in the expectation that they would like to be included in the "partial_unexpected_list" (or a new property).

Describe alternatives you've considered
One could try infer the PK for a table but this is not possible for all engine types E.g. Spark.

Additional context
Enabling teams to not only bring light to data quality issues but identify all rows allows them to address poor data quality in real-time instead of requiring manual intervention.

OmarSultan85 · 2021-08-10T07:56:10Z

This is a very important feature in my opinion, as Kenton mentioned, we will be able to not only identify the data quality and correctness on the data, but take action as well and help isolate erroneous rows rather than failing the entire dataset.

talagluck · 2021-08-10T17:54:09Z

Thanks for submitting this issue, @KentonParton, and the follow-up, @OmarSultan85 ! We will discuss internally, and I will get back to you in the next few days.

talagluck · 2021-09-02T21:06:17Z

Hi @KentonParton and @OmarSultan85 - apologies for delays! A part of the V3 API was making sure that this functionality was available, but @jcampbell and I worked on this draft PR (#3346) to link up all the pieces and get it working. When you have a chance, we'd love your thoughts on the interface of it, concerns, etc.

Thanks!

OmarSultan85 · 2021-09-10T03:21:33Z

Hi @talagluck ,

Thanks a lot for the update, I checked the PR but I am not entirely sure if I understood correctly. What I understand is that now the list of unexpected rows will be returned completely as part of the validation result when its of type COMPLETE.

If that is the case then I believe this would be of great use and would allows us to use great expectations in a pipeline that could split the incoming data into two, process correct rows and insert them and for the unexpected ones perform some kind of action to clean the data, notify business users, etc.

Anyway I can try this new feature before it is released? I can't wait to integrate it into our product and pipelines :)

talagluck · 2021-09-13T19:52:49Z

Thanks so much for the feedback, @OmarSultan85 ! That is indeed the purpose of the feature, and I'm glad to hear that that will work for you!

We are still figuring out the correct user-interface for this, as it is possible that it will wind up returning far too much data, and so we would want to be explicit and careful so that we don't wind up including an entire table in validation, results for instance :) This is something that @KentonParton had called out, and is important to keep in mind.

What we're thinking right now is that we will add another result_format option (e.g. something like INCLUDE_ALL_UNEXPECTED_ROWS) that would return the entire contents of this. This would make sure that only users who definitely want this and know what they are doing would use this.

We are also still looking into specifying a list of columns that would be returned for each unexpected_rows. This is more along the lines of the original request for this issue. In this case, for instance, you would be able to say that where column_a has unexpected rows, return the contents of column_b and column_c. This seems achievable, but it goes pretty deep into the code (since it would affect all column_map expectations), and so we're working on it, but it will take some more time.

sanmati7 · 2022-02-28T18:03:54Z

any updates on this?

abekfenn · 2022-05-06T00:30:21Z

Hi @talagluck I noticed just now that in #3764 that the GE team was planning on taking a look at this back in January.

Is progress being made against this issue on the GE jira board or has this been de-prioritized? If it has been de-prioritized (as tends to happen in our world) is this the kind of feature that the GE team would be able to offer guidance on?

Our dependence on pandas for the unexpected_index_list is fast becoming a constraint. We need this feature to be able to move to validating our data on a DB in order to allow us to continue to use GE as the size of our data scales.

talagluck · 2022-05-06T21:11:11Z

Hi @abekfenn - thanks so much for the ping, and thanks for calling this out. This is something that I've been picking up as I have time (and I don't have a lot of time). I don't think this is a huge amount of work to implement, especially in light of the work done in this PR for unexpected_rows , it's just a matter of resources. If this is something on which you have the capacity to work, I'd be happy to offer guidance and support for design and code review. Let me know and we can set up a call.

abekfenn · 2022-05-10T16:36:26Z

Hi @talagluck thanks for getting back to me. Totally understandable, my work day sounds very similar 😄. I would be grateful if we could get on a call to discuss the scope of the work. I'll reach out over Slack.

abekfenn · 2022-06-06T16:17:05Z

Update

I've started working on this but run into an issue that I hope to figure out soon.

I do have some questions around the API that I would like GE/community feedback on.

With the introduction of unexpected_rows I'm torn between calling this new result format arg, unexpected_index_columns or unexpected_row_columns. The idea being that a user should provide to this arg a list of columns in the table that should be sub-setted from the unexpected rows.
This feature request was initially to support ID/PKs in the validation results, however I could see a use case whereby someone using the PandasExecutionEngine could want both unexpected_index_list and a subset of columns from the unexpected rows.

I see GE generally making use of binary args, but the arg row_condition led me to thinking this would be the cleanest naming convention in line with existing result format args, not to mention user interaction.

However, I wanted to be sure that the GE team/community wouldn't prefer something like include_unexpected_index_columns AND unexpected_index_columns?

Similarly, I'm wondering if the returned list should be a new field in the validation results or if we should simply set unexpected_index_list when unexpected_index_columns is set?

This would keep behavior consistent across SQLAlchemy and Pandas execution engines but the only concern is that people may want to use the same functionality for pandas - returning arbitrary columns rather than the pandas index list - and so that might create some confusion
Alternatively, we could set unexpected_index_list only for SQLAlchemy but I think this only confuses matters, and it probably wouldn't do much to help keep behavior consistent across engines.

I'm personally leaning towards one arg for unexpected_row_columns and one new result field unexpected_row_columns or something alike but given that this deviates slightly from the original intention of this ticket, I wanted to open it for discussion.

OmarSultan85 · 2022-06-27T21:37:12Z

Hello @abekfenn

Sorry I just saw this post must have missed the notification. With regards to the naming convention, I believe what you are leaning towards sounds reasonable and can be easily understood.

I like the idea of having a subset of the columns along with the PK, could be helpful in some usecases.

As for including it in the same list or returning a new validation result, I believe a validation result could be a better choice as most probably it would reduce the size of the dataset that the user is going to be working on and could make things easier to navigate through the unexpected IDs.

austiezr · 2022-07-25T15:06:10Z

Hey @abekfenn! Thanks for reaching out, great discussion here. I align with your thoughts here; one arg, one result field, with the aim of keeping it clean, simple, and consistent.

DLZRR · 2022-08-26T09:46:49Z

Hi @abekfenn, I was wondering how the implementation was coming along. Similar to you the Pandas unexpected index list is not a viable option for my company (large Dutch insurer) because of the cost in speed. I (and my team) had similar thoughts for the solution and would love to help with implementing this feature.

…hat failed an expectations Ticket Number: great-expectations#3195 Problem: - No arguments available to configure column subset of unexpected_rows Solution: - Add args to subset unexpected_rows Note:

abekfenn · 2022-08-30T05:04:02Z

Hi all, I created a very work in progress PR here #5876 at the request of @talagluck (thanks for your guidance thus far). I haven't been able to spend as much time on this as I would have hoped so apologies for the delay.

Currently I've setup a new arg simply to see if I can get this new expectation registered and recognized, and would then implement the actual subsetting functionality thereafter. Right now, the expectation doesn't seem to get registered properly, so any help here would be greatly appreciated.

Shinnnyshinshin · 2022-11-10T03:11:53Z

Thank you very very much @abekfenn for getting the ball rolling, and all the work you put in. I was able to start where you left off, and have a PR for the Pandas implementation that is currently under review by the team #6329 . I'll post more updates here :)

abekfenn · 2022-11-15T19:30:39Z

@Shinnnyshinshin should this issue be re-opened to reflect ongoing development related to SQL engine support?

Shinnnyshinshin · 2022-11-16T17:11:52Z

Hi @abekfenn you're absolutely right 😅. Re-opening to track the ongoing development related to SQL and Spark

talagluck · 2022-12-09T12:45:21Z

Hi @KentonParton, @OmarSultan85, @sanmati7, and @abekfenn! There have been some exciting developments here, and this has been implemented for SQL and expanded for pandas, with Spark support coming soon. We'd love if you could try out the feature and let us know how it's working for you! Could you take a look at the description in #6448 to see how it works?

Thanks again for @KentonParton for opening this initially, @abekfenn for doing a bunch of work on the implementation, and everyone else for the feedback!

abekfenn · 2022-12-15T05:39:44Z

I'm a big fan of the changes you made @Shinnnyshinshin thank you so much for picking up the baton on this and getting this across the finish line.
I'm particularly fond of the implementation of unexpected_index_query. This should be a huge help since a great use case for GE on a database is larger datasets in the first place.
Can't wait to try this out!

petermoyer added the community label Aug 9, 2021

talagluck added the triage Used by the GE core team to flag issues that were not yet triaged label Aug 10, 2021

talagluck mentioned this issue Sep 2, 2021

[FEATURE] Add new result_format to include unexpected_row_list #3346

Merged

4 tasks

NathanFarmer added devrel This item is being addressed by the Developer Relations Team and removed triage Used by the GE core team to flag issues that were not yet triaged labels Sep 23, 2021

talagluck added core-engineering-queue and removed devrel This item is being addressed by the Developer Relations Team labels Oct 26, 2021

talagluck mentioned this issue Nov 12, 2021

Add Metrics on other columns to the validation result #3671

Closed

This was referenced Nov 24, 2021

Result_format "COMPLETE" does return "unexpected_index_list": null #3736

Closed

Data docs does not contain the results of my expectations #3658

Closed

talagluck mentioned this issue Dec 1, 2021

capture unexpected _index_list for pyspark dataframe #3716

Closed

talagluck added devrel This item is being addressed by the Developer Relations Team and removed core-engineering-queue labels Jan 14, 2022

This was referenced Jan 14, 2022

Ability to have additional columns in the unexpected_list result output #3764

Closed

Ability to include more useful aggregate metrics in the result object than just 'count' of failures #3765

Closed

jdimatteo mentioned this issue Feb 25, 2022

[Feature] Optionally Show (Partial) Unexpected Rows in Data Docs #4185

Closed

abekfenn mentioned this issue Apr 13, 2022

Add support for concurrent evaluation of an expectation suite in chunks #4855

Closed

talagluck added the feature label Jul 28, 2022

abekfenn mentioned this issue Aug 30, 2022

Subject: Support to include ID/PK in validation result for each row t… #5876

Merged

5 tasks

Shinnnyshinshin mentioned this issue Nov 10, 2022

[FEATURE] Support to include ID/PK in validation result for each row - Pandas #6329

Closed

5 tasks

Shinnnyshinshin closed this as completed in #5876 Nov 15, 2022

Shinnnyshinshin reopened this Nov 16, 2022

talagluck mentioned this issue Nov 23, 2022

unexpected_index_list on Spark #6359

Closed

Shinnnyshinshin mentioned this issue Dec 1, 2022

[FEATURE] Support to include ID/PK in validation result for each row - SQL #6448

Merged

6 tasks

talagluck mentioned this issue Dec 21, 2022

Display selected columns in data docs in case of failures #5265

Closed

Shinnnyshinshin mentioned this issue Dec 30, 2022

[FEATURE] Support to include ID/PK in validation result for each row - Spark #6676

Merged

6 tasks

Shinnnyshinshin closed this as completed in #6676 Jan 3, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support to include ID/PK in validation result for each row that failed an expectations #3195

Support to include ID/PK in validation result for each row that failed an expectations #3195

KentonParton commented Aug 9, 2021

OmarSultan85 commented Aug 10, 2021

talagluck commented Aug 10, 2021

talagluck commented Sep 2, 2021

OmarSultan85 commented Sep 10, 2021

talagluck commented Sep 13, 2021

sanmati7 commented Feb 28, 2022

abekfenn commented May 6, 2022

talagluck commented May 6, 2022

abekfenn commented May 10, 2022

abekfenn commented Jun 6, 2022 •

edited

Loading

OmarSultan85 commented Jun 27, 2022

austiezr commented Jul 25, 2022

DLZRR commented Aug 26, 2022

abekfenn commented Aug 30, 2022

Shinnnyshinshin commented Nov 10, 2022

abekfenn commented Nov 15, 2022

Shinnnyshinshin commented Nov 16, 2022

talagluck commented Dec 9, 2022

abekfenn commented Dec 15, 2022

Support to include ID/PK in validation result for each row that failed an expectations #3195

Support to include ID/PK in validation result for each row that failed an expectations #3195

Comments

KentonParton commented Aug 9, 2021

OmarSultan85 commented Aug 10, 2021

talagluck commented Aug 10, 2021

talagluck commented Sep 2, 2021

OmarSultan85 commented Sep 10, 2021

talagluck commented Sep 13, 2021

sanmati7 commented Feb 28, 2022

abekfenn commented May 6, 2022

talagluck commented May 6, 2022

abekfenn commented May 10, 2022

abekfenn commented Jun 6, 2022 • edited Loading

Update

OmarSultan85 commented Jun 27, 2022

austiezr commented Jul 25, 2022

DLZRR commented Aug 26, 2022

abekfenn commented Aug 30, 2022

Shinnnyshinshin commented Nov 10, 2022

abekfenn commented Nov 15, 2022

Shinnnyshinshin commented Nov 16, 2022

talagluck commented Dec 9, 2022

abekfenn commented Dec 15, 2022

abekfenn commented Jun 6, 2022 •

edited

Loading