-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support to include ID/PK in validation result for each row that failed an expectations #3195
Comments
This is a very important feature in my opinion, as Kenton mentioned, we will be able to not only identify the data quality and correctness on the data, but take action as well and help isolate erroneous rows rather than failing the entire dataset. |
Thanks for submitting this issue, @KentonParton, and the follow-up, @OmarSultan85 ! We will discuss internally, and I will get back to you in the next few days. |
Hi @KentonParton and @OmarSultan85 - apologies for delays! A part of the V3 API was making sure that this functionality was available, but @jcampbell and I worked on this draft PR (#3346) to link up all the pieces and get it working. When you have a chance, we'd love your thoughts on the interface of it, concerns, etc. Thanks! |
Hi @talagluck , Thanks a lot for the update, I checked the PR but I am not entirely sure if I understood correctly. What I understand is that now the list of unexpected rows will be returned completely as part of the validation result when its of type COMPLETE. If that is the case then I believe this would be of great use and would allows us to use great expectations in a pipeline that could split the incoming data into two, process correct rows and insert them and for the unexpected ones perform some kind of action to clean the data, notify business users, etc. Anyway I can try this new feature before it is released? I can't wait to integrate it into our product and pipelines :) |
Thanks so much for the feedback, @OmarSultan85 ! That is indeed the purpose of the feature, and I'm glad to hear that that will work for you! We are still figuring out the correct user-interface for this, as it is possible that it will wind up returning far too much data, and so we would want to be explicit and careful so that we don't wind up including an entire table in validation, results for instance :) This is something that @KentonParton had called out, and is important to keep in mind. What we're thinking right now is that we will add another We are also still looking into specifying a list of columns that would be returned for each |
any updates on this? |
Hi @talagluck I noticed just now that in #3764 that the GE team was planning on taking a look at this back in January. Is progress being made against this issue on the GE jira board or has this been de-prioritized? If it has been de-prioritized (as tends to happen in our world) is this the kind of feature that the GE team would be able to offer guidance on? Our dependence on pandas for the unexpected_index_list is fast becoming a constraint. We need this feature to be able to move to validating our data on a DB in order to allow us to continue to use GE as the size of our data scales. |
Hi @abekfenn - thanks so much for the ping, and thanks for calling this out. This is something that I've been picking up as I have time (and I don't have a lot of time). I don't think this is a huge amount of work to implement, especially in light of the work done in this PR for |
Hi @talagluck thanks for getting back to me. Totally understandable, my work day sounds very similar 😄. I would be grateful if we could get on a call to discuss the scope of the work. I'll reach out over Slack. |
UpdateI've started working on this but run into an issue that I hope to figure out soon. I do have some questions around the API that I would like GE/community feedback on. With the introduction of
I'm personally leaning towards one arg for |
Hello @abekfenn Sorry I just saw this post must have missed the notification. With regards to the naming convention, I believe what you are leaning towards sounds reasonable and can be easily understood. I like the idea of having a subset of the columns along with the PK, could be helpful in some usecases. As for including it in the same list or returning a new validation result, I believe a validation result could be a better choice as most probably it would reduce the size of the dataset that the user is going to be working on and could make things easier to navigate through the unexpected IDs. |
Hey @abekfenn! Thanks for reaching out, great discussion here. I align with your thoughts here; one arg, one result field, with the aim of keeping it clean, simple, and consistent. |
Hi @abekfenn, I was wondering how the implementation was coming along. Similar to you the Pandas unexpected index list is not a viable option for my company (large Dutch insurer) because of the cost in speed. I (and my team) had similar thoughts for the solution and would love to help with implementing this feature. |
…hat failed an expectations Ticket Number: great-expectations#3195 Problem: - No arguments available to configure column subset of unexpected_rows Solution: - Add args to subset unexpected_rows Note:
Hi all, I created a very work in progress PR here #5876 at the request of @talagluck (thanks for your guidance thus far). I haven't been able to spend as much time on this as I would have hoped so apologies for the delay. Currently I've setup a new arg simply to see if I can get this new expectation registered and recognized, and would then implement the actual subsetting functionality thereafter. Right now, the expectation doesn't seem to get registered properly, so any help here would be greatly appreciated. |
@Shinnnyshinshin should this issue be re-opened to reflect ongoing development related to SQL engine support? |
Hi @abekfenn you're absolutely right 😅. Re-opening to track the ongoing development related to SQL and Spark |
Hi @KentonParton, @OmarSultan85, @sanmati7, and @abekfenn! There have been some exciting developments here, and this has been implemented for SQL and expanded for pandas, with Spark support coming soon. We'd love if you could try out the feature and let us know how it's working for you! Could you take a look at the description in #6448 to see how it works? Thanks again for @KentonParton for opening this initially, @abekfenn for doing a bunch of work on the implementation, and everyone else for the feedback! |
I'm a big fan of the changes you made @Shinnnyshinshin thank you so much for picking up the baton on this and getting this across the finish line. |
Is your feature request related to a problem? Please describe.
When an expectation is run the output includes a “partial_unexpected_list” property of values that were unexpected. While this is useful, in most cases, it doesn't allow teams to identify, resolve, or divert data with poor quality.
It would be great if a sample list or a complete list (result_format SUMMARY, COMPLETE) of ID's for each row that failed an expectation was included in the validation result.
Describe the solution you'd like
One could include a column name as an argument in the expectation that they would like to be included in the "partial_unexpected_list" (or a new property).
Describe alternatives you've considered
One could try infer the PK for a table but this is not possible for all engine types E.g. Spark.
Additional context
Enabling teams to not only bring light to data quality issues but identify all rows allows them to address poor data quality in real-time instead of requiring manual intervention.
The text was updated successfully, but these errors were encountered: