Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support to include ID/PK in validation result for each row that failed an expectations #3195

Closed
KentonParton opened this issue Aug 9, 2021 · 19 comments · Fixed by #5876 or #6676
Closed
Labels
community devrel This item is being addressed by the Developer Relations Team feature

Comments

@KentonParton
Copy link

Is your feature request related to a problem? Please describe.
When an expectation is run the output includes a “partial_unexpected_list” property of values that were unexpected. While this is useful, in most cases, it doesn't allow teams to identify, resolve, or divert data with poor quality.

It would be great if a sample list or a complete list (result_format SUMMARY, COMPLETE) of ID's for each row that failed an expectation was included in the validation result.

Describe the solution you'd like
One could include a column name as an argument in the expectation that they would like to be included in the "partial_unexpected_list" (or a new property).

Describe alternatives you've considered
One could try infer the PK for a table but this is not possible for all engine types E.g. Spark.

Additional context
Enabling teams to not only bring light to data quality issues but identify all rows allows them to address poor data quality in real-time instead of requiring manual intervention.

@OmarSultan85
Copy link

This is a very important feature in my opinion, as Kenton mentioned, we will be able to not only identify the data quality and correctness on the data, but take action as well and help isolate erroneous rows rather than failing the entire dataset.

@talagluck talagluck added the triage Used by the GE core team to flag issues that were not yet triaged label Aug 10, 2021
@talagluck
Copy link
Contributor

Thanks for submitting this issue, @KentonParton, and the follow-up, @OmarSultan85 ! We will discuss internally, and I will get back to you in the next few days.

@talagluck
Copy link
Contributor

Hi @KentonParton and @OmarSultan85 - apologies for delays! A part of the V3 API was making sure that this functionality was available, but @jcampbell and I worked on this draft PR (#3346) to link up all the pieces and get it working. When you have a chance, we'd love your thoughts on the interface of it, concerns, etc.

Thanks!

@OmarSultan85
Copy link

Hi @talagluck ,

Thanks a lot for the update, I checked the PR but I am not entirely sure if I understood correctly. What I understand is that now the list of unexpected rows will be returned completely as part of the validation result when its of type COMPLETE.

If that is the case then I believe this would be of great use and would allows us to use great expectations in a pipeline that could split the incoming data into two, process correct rows and insert them and for the unexpected ones perform some kind of action to clean the data, notify business users, etc.

Anyway I can try this new feature before it is released? I can't wait to integrate it into our product and pipelines :)

@talagluck
Copy link
Contributor

Thanks so much for the feedback, @OmarSultan85 ! That is indeed the purpose of the feature, and I'm glad to hear that that will work for you!

We are still figuring out the correct user-interface for this, as it is possible that it will wind up returning far too much data, and so we would want to be explicit and careful so that we don't wind up including an entire table in validation, results for instance :) This is something that @KentonParton had called out, and is important to keep in mind.

What we're thinking right now is that we will add another result_format option (e.g. something like INCLUDE_ALL_UNEXPECTED_ROWS) that would return the entire contents of this. This would make sure that only users who definitely want this and know what they are doing would use this.

We are also still looking into specifying a list of columns that would be returned for each unexpected_rows. This is more along the lines of the original request for this issue. In this case, for instance, you would be able to say that where column_a has unexpected rows, return the contents of column_b and column_c. This seems achievable, but it goes pretty deep into the code (since it would affect all column_map expectations), and so we're working on it, but it will take some more time.

@NathanFarmer NathanFarmer added devrel This item is being addressed by the Developer Relations Team and removed triage Used by the GE core team to flag issues that were not yet triaged labels Sep 23, 2021
@talagluck talagluck added core-engineering-queue and removed devrel This item is being addressed by the Developer Relations Team labels Oct 26, 2021
@talagluck talagluck added devrel This item is being addressed by the Developer Relations Team and removed core-engineering-queue labels Jan 14, 2022
@sanmati7
Copy link

any updates on this?

@abekfenn
Copy link
Contributor

abekfenn commented May 6, 2022

Hi @talagluck I noticed just now that in #3764 that the GE team was planning on taking a look at this back in January.

Is progress being made against this issue on the GE jira board or has this been de-prioritized? If it has been de-prioritized (as tends to happen in our world) is this the kind of feature that the GE team would be able to offer guidance on?

Our dependence on pandas for the unexpected_index_list is fast becoming a constraint. We need this feature to be able to move to validating our data on a DB in order to allow us to continue to use GE as the size of our data scales.

@talagluck
Copy link
Contributor

Hi @abekfenn - thanks so much for the ping, and thanks for calling this out. This is something that I've been picking up as I have time (and I don't have a lot of time). I don't think this is a huge amount of work to implement, especially in light of the work done in this PR for unexpected_rows , it's just a matter of resources. If this is something on which you have the capacity to work, I'd be happy to offer guidance and support for design and code review. Let me know and we can set up a call.

@abekfenn
Copy link
Contributor

Hi @talagluck thanks for getting back to me. Totally understandable, my work day sounds very similar 😄. I would be grateful if we could get on a call to discuss the scope of the work. I'll reach out over Slack.

@abekfenn
Copy link
Contributor

abekfenn commented Jun 6, 2022

Update

I've started working on this but run into an issue that I hope to figure out soon.

I do have some questions around the API that I would like GE/community feedback on.

With the introduction of unexpected_rows I'm torn between calling this new result format arg, unexpected_index_columns or unexpected_row_columns. The idea being that a user should provide to this arg a list of columns in the table that should be sub-setted from the unexpected rows.
This feature request was initially to support ID/PKs in the validation results, however I could see a use case whereby someone using the PandasExecutionEngine could want both unexpected_index_list and a subset of columns from the unexpected rows.

  1. I see GE generally making use of binary args, but the arg row_condition led me to thinking this would be the cleanest naming convention in line with existing result format args, not to mention user interaction.
  • However, I wanted to be sure that the GE team/community wouldn't prefer something like include_unexpected_index_columns AND unexpected_index_columns?
  1. Similarly, I'm wondering if the returned list should be a new field in the validation results or if we should simply set unexpected_index_list when unexpected_index_columns is set?
  • This would keep behavior consistent across SQLAlchemy and Pandas execution engines but the only concern is that people may want to use the same functionality for pandas - returning arbitrary columns rather than the pandas index list - and so that might create some confusion
  • Alternatively, we could set unexpected_index_list only for SQLAlchemy but I think this only confuses matters, and it probably wouldn't do much to help keep behavior consistent across engines.

I'm personally leaning towards one arg for unexpected_row_columns and one new result field unexpected_row_columns or something alike but given that this deviates slightly from the original intention of this ticket, I wanted to open it for discussion.

@OmarSultan85
Copy link

Hello @abekfenn

Sorry I just saw this post must have missed the notification. With regards to the naming convention, I believe what you are leaning towards sounds reasonable and can be easily understood.

I like the idea of having a subset of the columns along with the PK, could be helpful in some usecases.

As for including it in the same list or returning a new validation result, I believe a validation result could be a better choice as most probably it would reduce the size of the dataset that the user is going to be working on and could make things easier to navigate through the unexpected IDs.

@austiezr
Copy link
Contributor

Hey @abekfenn! Thanks for reaching out, great discussion here. I align with your thoughts here; one arg, one result field, with the aim of keeping it clean, simple, and consistent.

@DLZRR
Copy link

DLZRR commented Aug 26, 2022

Hi @abekfenn, I was wondering how the implementation was coming along. Similar to you the Pandas unexpected index list is not a viable option for my company (large Dutch insurer) because of the cost in speed. I (and my team) had similar thoughts for the solution and would love to help with implementing this feature.

abekfenn added a commit to abekfenn/great_expectations that referenced this issue Aug 30, 2022
…hat failed an expectations

Ticket Number: great-expectations#3195

Problem:

- No arguments available to configure column subset of unexpected_rows

Solution:

- Add args to subset unexpected_rows

Note:
@abekfenn
Copy link
Contributor

Hi all, I created a very work in progress PR here #5876 at the request of @talagluck (thanks for your guidance thus far). I haven't been able to spend as much time on this as I would have hoped so apologies for the delay.

Currently I've setup a new arg simply to see if I can get this new expectation registered and recognized, and would then implement the actual subsetting functionality thereafter. Right now, the expectation doesn't seem to get registered properly, so any help here would be greatly appreciated.

@Shinnnyshinshin
Copy link
Contributor

Thank you very very much @abekfenn for getting the ball rolling, and all the work you put in. I was able to start where you left off, and have a PR for the Pandas implementation that is currently under review by the team #6329 . I'll post more updates here :)

@abekfenn
Copy link
Contributor

@Shinnnyshinshin should this issue be re-opened to reflect ongoing development related to SQL engine support?

@Shinnnyshinshin
Copy link
Contributor

Hi @abekfenn you're absolutely right 😅. Re-opening to track the ongoing development related to SQL and Spark

@talagluck
Copy link
Contributor

Hi @KentonParton, @OmarSultan85, @sanmati7, and @abekfenn! There have been some exciting developments here, and this has been implemented for SQL and expanded for pandas, with Spark support coming soon. We'd love if you could try out the feature and let us know how it's working for you! Could you take a look at the description in #6448 to see how it works?

Thanks again for @KentonParton for opening this initially, @abekfenn for doing a bunch of work on the implementation, and everyone else for the feedback!

@abekfenn
Copy link
Contributor

I'm a big fan of the changes you made @Shinnnyshinshin thank you so much for picking up the baton on this and getting this across the finish line.
I'm particularly fond of the implementation of unexpected_index_query. This should be a huge help since a great use case for GE on a database is larger datasets in the first place.
Can't wait to try this out!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment