Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Testing YAML config takes significant amount of time when partitioning large table #4965

Closed
cbuffett opened this issue Apr 26, 2022 · 4 comments
Labels
community triage Used by the GE core team to flag issues that were not yet triaged

Comments

@cbuffett
Copy link

Describe the bug
When testing a YAML config that defines a table partition, Great Expectations executes a SELECT DISTINCT <partition columns> FROM <table> query. For a large table (nearly half a trillion rows), that takes a significant amount of time (on the order of hours). This also appears to happen when fetching the validator, even if the batch_request specifies a single partition. It appears this is the result of ConfiguredAssetSqlDataConnector.get_batch_definition_list_from_batch_request() refreshing its data reference cache, which attempts to fetch all partitions.

To Reproduce
Steps to reproduce the behavior:

  1. Define a datasource YAML config with a tables section that partitions a very large table on multiple columns (e.g., date, hour, project_id).
datasource_yaml = f"""
name: {datasource_name}
class_name: SimpleSqlalchemyDatasource
credentials:
  host: {snowflake_host}
  username: {snowflake_user}
  password: {snowflake_password}
  database: {snowflake_db}
  query:
    schema: {snowflake_schema}
    warehouse: {snowflake_warehouse}
    role: {snowflake_role}
  connect_args:
    authenticator: {snowflake_authenticator}
  drivername: snowflake
      
tables:  # Each key in the "tables" section is a table_name (key name "tables" in "SimpleSqlalchemyDatasource" configuration is reserved).
  project:
    partitioners:
      project_by_date_and_hour:
        include_schema_name: true
        schema_name: {snowflake_schema}
        data_asset_name_suffix: _project_id
        splitter_method: _split_on_multi_column_values
        splitter_kwargs:
          column_names: [date, hour, project_id]
"""
  1. Validate the YAML config using context.test_yaml_config or attempt to fetch a validator using a batch request that specifies a single partition via batch_filter_parameters
  2. Generated query to populate reference cache will attempt to fetch all partitions, even in the case that a single partition is specified, which can take a significant amount of time for large tables with high cardinality across the partition columns

Expected behavior
Options to limit/disable cache refreshing or otherwise improve performance

Environment (please complete the following information):

  • Operating System: Linux
  • Great Expectations Version: 0.15.2
  • DB: Snowflake
@talagluck
Copy link
Contributor

Thanks for raising this, @cbuffett - we will review and be in touch.

@talagluck talagluck added community devrel This item is being addressed by the Developer Relations Team triage Used by the GE core team to flag issues that were not yet triaged and removed devrel This item is being addressed by the Developer Relations Team labels Apr 29, 2022
@github-actions github-actions bot added the stale Stale issues and PRs label Jul 31, 2022
@great-expectations great-expectations deleted a comment from github-actions bot Aug 5, 2022
@kyleaton kyleaton removed the stale Stale issues and PRs label Aug 5, 2022
@talagluck
Copy link
Contributor

Hi @cbuffett - thanks for your patience. I wanted to check, is this still an issue for you? If so, we were planning on figuring out prioritization this week

@cbuffett
Copy link
Author

cbuffett commented Aug 9, 2022

@talagluck Yes, this would pose a serious performance impact, particularly around using the validator with batch parameters defined that correspond to a single table partition. The performance impact for testing YAML configurations is secondary as this isn't something that would run inside a production environment.

@molliemarie
Copy link
Contributor

Hello @cbuffett. With the upcoming launch of Great Expectations Core (GX 1.0), we are closing old issues posted regarding previous versions. Moving forward, we will focus our resources on supporting and improving GX Core (version 1.0 and beyond). If you find that an issue you previously reported still exists in GX Core, we encourage you to resubmit it against the new version. With more resources dedicated to community support, we aim to tackle new issues swiftly. For specific details on what is GX-supported vs community-supported, you can reference our integration and support policy.

To get started on your transition to GX Core, check out the GX Core quickstart (click “Full example code” tab to see a code example).

You can also join our upcoming community meeting on August 28th at 9am PT (noon ET / 4pm UTC) for a comprehensive rundown of everything GX Core, plus Q&A as time permits. Go to https://greatexpectations.io/meetup and click “follow calendar” to follow the GX community calendar.

Thank you for being part of the GX community and thank you for submitting this issue. We're excited about this new chapter and look forward to your feedback on GX Core. 🤗

@molliemarie molliemarie closed this as not planned Won't fix, can't repro, duplicate, stale Aug 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community triage Used by the GE core team to flag issues that were not yet triaged
Projects
None yet
Development

No branches or pull requests

4 participants