-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Testing YAML config takes significant amount of time when partitioning large table #4965
Comments
Thanks for raising this, @cbuffett - we will review and be in touch. |
Hi @cbuffett - thanks for your patience. I wanted to check, is this still an issue for you? If so, we were planning on figuring out prioritization this week |
@talagluck Yes, this would pose a serious performance impact, particularly around using the validator with batch parameters defined that correspond to a single table partition. The performance impact for testing YAML configurations is secondary as this isn't something that would run inside a production environment. |
Hello @cbuffett. With the upcoming launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy. To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example). You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar. Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗 |
Describe the bug
When testing a YAML config that defines a table partition, Great Expectations executes a
SELECT DISTINCT <partition columns> FROM <table>
query. For a large table (nearly half a trillion rows), that takes a significant amount of time (on the order of hours). This also appears to happen when fetching the validator, even if the batch_request specifies a single partition. It appears this is the result ofConfiguredAssetSqlDataConnector.get_batch_definition_list_from_batch_request()
refreshing its data reference cache, which attempts to fetch all partitions.To Reproduce
Steps to reproduce the behavior:
tables
section that partitions a very large table on multiple columns (e.g., date, hour, project_id).context.test_yaml_config
or attempt to fetch a validator using a batch request that specifies a single partition viabatch_filter_parameters
Expected behavior
Options to limit/disable cache refreshing or otherwise improve performance
Environment (please complete the following information):
The text was updated successfully, but these errors were encountered: