New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data docs does not contain the results of my expectations #3658
Comments
Hi @abekfenn! Thanks so much for the question. While you are running a validate step, and you are saving your suite, batch.save_expectation_suite(discard_failed_expectations=False)
results = LegacyCheckpoint(
name="_temp_checkpoint",
data_context=context,
batches=[
{
"batch_kwargs": batch_kwargs,
"expectation_suite_names": [expectation_suite_name]
}
]
).run()
validation_result_identifier = results.list_validation_result_identifiers()[0]
context.build_data_docs()
context.open_data_docs(validation_result_identifier) You'll find some more information about configuring Checkpoints in our docs, though there will be slight between V2 and V3. I believe you'll have more expressiveness with your Checkpoints in V3. I'm going to close this issue for now, but feel free to chime in if you have additional questions here. |
Hi @talagluck thanks so much for your response. This worked for me but then I encountered a bug which was recently resolved for V3 3225. I've tried to switch my code over to V3 but keep encountering an error and cannot find any documentation on in-code expectations using V3 checkpoints and data docs. If you could point out the error in my logic below, I would be very grateful.
|
Error message:
|
Hi @abekfenn - at first glance, I don't see anything obviously incorrect. What line is triggering the error? Is it the Are you able to save the checkpoint, and then call |
Correct Should I be referencing this path elsewhere too? |
Hi @abekfenn - on second glance, it appears that you haven't configured a Store in the config you're using, as this appears to be commented out, though it's hard to tell exactly. Have you seen our docs on configuring a DataContext without a .yml file? I suspect this may be what you want. Currently it seems that you are using a hybrid approach, and so I think your Stores are not being configured correctly, and this can be tough to troubleshoot. I would recommend either working entirely with an in-memory DataContext, or entirely with a .yml based config, at least initially. |
@talagluck thanks for getting back to me, I was trying different approaches to get it work using an in-memory DataContext vs a yml based config. Am I not instantiating the context in-code with the following line?
I've deleted/commented out the following lines and I still get the IndexError processing the tuple.
In any case, there appears to be an amalgamation of issues that mean I will have to run both batch.validate() and checkpoint.run(). Unfortunately, from what I've discovered:
This means that in order for me to both use data docs and process the unexpected_index_list from the validation results (to e.g. drop bad records), I have to run BOTH an in-code validation using P.S. I have yet to create issues for all of the above. |
Hi @abekfenn - apologies, there's a bunch to work through here, so thanks for bearing with us! First - yes you are right, it appears that you are correctly instantiating a DataContext in code. Is the directory that you are pointing to an initialized Great Expectations project? Are you able to share your Stores config from the great_expectations.yml for that project? For your discoveries above: 2: I was not aware of this issue. We are currently moving toward V3 being the primary API, and so are probably not able to prioritize a fix for this, but please do open an issue if this is a problem as well. 3: There are a few possibilities here (and this might be the case for 2 as well):
I don't fully understand why you need to run things two ways, though I agree that you shouldn't need to do this! Are you saying that you are using both V2 and V3? Could you please unpack what your use case is here? Hopefully some of this will be ironed out by the coming bugfixes, but it would be good to understand the end goal. I'm also a bit confused by your script above - are you adding the expectations and validating them every time? Do you also have an Expectation Suite that you are persisting and using? |
Apologies as well - we've had a few different issues pertaining to |
Apologies that this issue/ticket has ballooned in scope but thank you for helping me work through it. I hope to see #3736 in the upcoming release and that it will solve these issues with respect to unexpected_index_list being None. Between that and #3634 I think my issue could be solved but I have yet to be able to test this and have to get GE released internally. Answering your responsesWith respect to 1) and 2) this makes sense. Thank you for your response. With respect to 3)
My ScriptTo answer this:
My use case is this:I am hoping to leverage GE to serve as both validation and automated data quality checks of my data in an ETL pipeline.
For this reason, I cannot just run I am probably not following the intended implementation of GE, but have struggled to find documentation that covers my use case of running validations in an ETL pipeline to trigger actions on my data. Thanks so much for your detailed responses, have a great thanksgiving if you are celebrating! |
Hi @abekfenn - thanks for this. Regarding your script, thank you, this clears things up. I agree that evaluation parameters may simplify things quite a bit. In general, we think about the ExpectationSuite as the source of truth, so that we clearly lay out Expectations about our data, and then validate those Expectations - adding them anew each time to the suite is somewhat counter to this, though I understand your use case. You may also want to take a look at our Rule-Based Profilers (though they are still somewhat experimental). You are correct that you would be running validations three times when adding the expectation, calling To confirm:
Did you mean to say that Regarding #3736, I was mistaken - this is an issue of backend, as currently, Thanks also for explaining your use case further. We also hope to add support for returning an ID or PK for unexpected rows in the not-too-distant future, though I don't have a specific date for this, and I suspect that this would help as well. |
And thank you! Hope you are having a great Thanksgiving weekend. |
Thanks again for your response @talagluck. I was thinking of the code that runs my expectations ( Yes, you are right, I made a typo, running I believe I am looking at the ValidationResults within using the code above/below but please correct me if I'm wrong. FWIW, I DO see the unexpected_index_list in the results from within here after upgrading to GE 0.13.44 which contains the fix related to #3736. However, this wasn't the case in 0.13.43, thank you for this fix! The type of To be clear, I am leveraging GE for in-memory dataframes using pandas at the moment, so this addresses my use case. Nevertheless, we do plan on moving to in-database data storage and maintenance in the relatively near future so I do look forward to this getting added with #3195. Really appreciate all of your support, I'm very close to being able to release GE to production and having the support of your team has been instrumental in that. Can't wait to see GE revolutionize our data quality!
|
Thanks so much for the follow-up, @abekfenn! Wishing you much luck as you move your implementation forward, and please don't hesitate to reach out with further questions or suggestions. |
Hi @talagluck, let me know if you'd prefer I open another issue but I have one more related question.. Everything is working well for me but when I try to setup an S3 store for data docs, the docs do not get built and sent to S3. I have a feeling this is probably something to do with my code. I am trying to setup an S3-hosted data docs. I can manually upload these files but I would like to be able to have index.html serve as the home page for each data asset. I am using the below yaml.
Quick follow-up question, I've noticed that when I do a lot of development locally. The metrics calculations start to take REALLY long (particularly the last step in metrics calculation). Is this due to a lot of historical results or is there anything you might recommend to prevent this? |
Hmm - I'm not sure exactly, but it depends on how you are building docs. I believe I'd be curious whether the site is being built (like the files are there on S3), but the index is not populating it. And also what will happen if you remove your local_site from the config. As far as the metrics, this feels strange to me. It's a known issue that DataDocs takes longer to build over time, as the number of Validations increases, but I don't believe that Metrics should take any longer to run (this process is relatively lean as it relies on the backend to run, and so shouldn't be much more expensive than getting the metrics directly with |
For posterity, figured it out.. had to add a call to update_data_docs by adding specifying an argument to the action_list param of checkpoint.run() See:
|
Great - thanks for the update, @abekfenn! |
Describe the bug
This is probably not a bug and is user error but I didn't see a suitable template.. I am trying to run expectations via code (not the CLI) as a part of my ETL pipeline in order to validate data before it goes to production. I want to save the expectation results json and upload it to S3 and setup a S3-hosted data docs to pull from those results & the expectation suite.
To Reproduce
Steps to reproduce the behavior:
Expected behavior
Data docs contains the results of my expectations.
Environment (please complete the following information):
Additional context
If there is a better way to do this that say better leverages existing great_expectations features, do please point me in that direction.
Notably, I couldn't make the CLI configuration of my great_expectations.yml work for me, as I need this to run dynamically in a pipeline, uploading to different locations depending on client.
The text was updated successfully, but these errors were encountered: