Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable zstd in tests #8288

Closed
wants to merge 9 commits into from
Closed

Enable zstd in tests #8288

wants to merge 9 commits into from

Conversation

arpad-m
Copy link
Member

@arpad-m arpad-m commented Jul 5, 2024

Enables zstd compression in the testsuite. Builds on #8281.

Part of #5431

Copy link

github-actions bot commented Jul 5, 2024

No tests were run or test report is not available

Test coverage report is not available

The comment gets automatically updated with the latest test results
4d8c7a3 at 2024-07-12T02:23:25.217Z :recycle:

@jcsp
Copy link
Collaborator

jcsp commented Jul 9, 2024

I'm really curious about why this all passes without #8302 -- maybe we have a configuration issue where tests are running without vectored reads. Let's figure this out before proceeding.

@arpad-m
Copy link
Member Author

arpad-m commented Jul 9, 2024

I'm really curious about why this all passes without #8302

Yeah it's really strange. For shining more light on the issue, I've

  • filed Make vectored read_blobs function not fill buffer correctly #8324 for trying out completely broken vectored reads (the reads have the correct sizes but it's all just garbage data)
  • added a commit to this PR to break zstd decompression in non-vectored reads, to see how many of the current tests depend on compressed data with pre-vectored reads

@VladLazar
Copy link
Contributor

maybe we have a configuration issue where tests are running without vectored reads. Let's figure this out before proceeding.

CI tests and benchmarks use vectored read by default. We even print out the read path config at startup (example from a random recent ci run):

2024-07-04T18:02:19.009463Z  INFO starting with get page implementation conf.get_impl=Vectored
2024-07-04T18:02:19.009471Z  INFO starting with vectored get page implementation conf.get_vectored_impl=Vectored

@arpad-m
Copy link
Member Author

arpad-m commented Jul 10, 2024

So after artificially introducing a corruption in this PR during reading of compressed fullpage images (just setting all returned data to the same byte), I don't see any pytest failures. On the other hand, when making vectored return 0'd out data #8324, there is a lot of failures.

As a conclusion, this means that we didn't notice the missing vectored read implementation not because there is no vectored read testing, but because there isn't any tests that cause us to compress data. This might be due to a gap in test coverage, where there is no well compressible full page images in image layers, or maybe also due to the method I chose in this PR to enable zstd: maybe it simply doesn't work?

arpad-m added a commit that referenced this pull request Jul 11, 2024
We need to pass on the configured compression param during image layer
generation.

This was an oversight of #8106, and the likely cause why #8288 didn't
bring any interesting regressions.

Part of #5431
@arpad-m
Copy link
Member Author

arpad-m commented Jul 12, 2024

Found the culprit: it was put_image not calling write_blob_maybe_compressed but instead write_blob. I've filed #8363 to address this oversight.

@arpad-m
Copy link
Member Author

arpad-m commented Jul 12, 2024

This is the newest CI run. Sadly no allure report available because of timeouts.

@arpad-m
Copy link
Member Author

arpad-m commented Jul 12, 2024

Manually extracted list, 76 failures:

 FAILED test_runner/regress/test_ancestor_branch.py::test_ancestor_branch[debug-pg14]
 FAILED test_runner/regress/test_branch_and_gc.py::test_branch_creation_before_gc[debug-pg14]
 FAILED test_runner/regress/test_branching.py::test_branching_with_pgbench[debug-pg14-cascade-1-10]
 FAILED test_runner/regress/test_branching.py::test_branching_with_pgbench[debug-pg14-flat-1-10]
 FAILED test_runner/regress/test_compaction.py::test_uploads_and_deletions[debug-pg14-legacy]
 FAILED test_runner/regress/test_compaction.py::test_sharding_compaction[debug-pg14-None-None]
 FAILED test_runner/regress/test_disk_usage_eviction.py::test_fast_growing_tenant[debug-pg14-absolute]
 FAILED test_runner/regress/test_compaction.py::test_sharding_compaction[debug-pg14-4-16]
 FAILED test_runner/regress/test_compaction.py::test_uploads_and_deletions[debug-pg14-tiered]
 FAILED test_runner/regress/test_compaction.py::test_sharding_compaction[debug-pg14-4-32768]
 FAILED test_runner/regress/test_disk_usage_eviction.py::test_fast_growing_tenant[debug-pg14-relative_equal]
 FAILED test_runner/regress/test_disk_usage_eviction.py::test_fast_growing_tenant[debug-pg14-relative_spare]
 FAILED test_runner/regress/test_import.py::test_import_from_pageserver_small[debug-pg14]
 FAILED test_runner/regress/test_compatibility.py::test_forward_compatibility[debug-pg14]@compatibility
 FAILED test_runner/regress/test_layer_eviction.py::test_gc_of_remote_layers[debug-pg14]
 FAILED test_runner/regress/test_import.py::test_import_from_vanilla[debug-pg14]
 FAILED test_runner/regress/test_lsn_mapping.py::test_lsn_mapping[debug-pg14-True]
 FAILED test_runner/regress/test_next_xid.py::test_import_at_2bil[debug-pg14]
 FAILED test_runner/regress/test_ondemand_download.py::test_ondemand_download_large_rel[debug-pg14]
 FAILED test_runner/regress/test_pageserver_generations.py::test_generations_upgrade[debug-pg14]
 FAILED test_runner/regress/test_pageserver_generations.py::test_deletion_queue_recovery[debug-pg14-validate-keep]
 FAILED test_runner/regress/test_pageserver_generations.py::test_deferred_deletion[debug-pg14]
 FAILED test_runner/regress/test_ondemand_download.py::test_download_remote_layers_api[debug-pg14]
 FAILED test_runner/regress/test_lsn_mapping.py::test_lsn_mapping[debug-pg14-False]
 FAILED test_runner/regress/test_pageserver_generations.py::test_deletion_queue_recovery[debug-pg14-validate-lose]
 FAILED test_runner/regress/test_pageserver_generations.py::test_multi_attach[debug-pg14]
 FAILED test_runner/regress/test_ondemand_download.py::test_ondemand_download_timetravel[debug-pg14]
 FAILED test_runner/regress/test_pageserver_secondary.py::test_location_conf_churn[debug-pg14-2]
 FAILED test_runner/regress/test_pageserver_secondary.py::test_location_conf_churn[debug-pg14-1]
 FAILED test_runner/regress/test_pageserver_generations.py::test_deletion_queue_recovery[debug-pg14-no-validate-keep]
 FAILED test_runner/regress/test_pageserver_generations.py::test_emergency_mode[debug-pg14]
 FAILED test_runner/regress/test_pageserver_generations.py::test_deletion_queue_recovery[debug-pg14-no-validate-lose]
 FAILED test_runner/regress/test_pageserver_generations.py::test_upgrade_generationless_local_file_paths[debug-pg14]
 FAILED test_runner/regress/test_pageserver_generations.py::test_eviction_across_generations[debug-pg14]
 FAILED test_runner/regress/test_pageserver_crash_consistency.py::test_local_only_layers_after_crash[debug-pg14]
 FAILED test_runner/regress/test_pg_regress.py::test_sql_regress[debug-pg14-None]
 FAILED test_runner/regress/test_pageserver_secondary.py::test_live_migration[debug-pg14]
 FAILED test_runner/regress/test_next_xid.py::test_multixid_wraparound_import[debug-pg14]
 FAILED test_runner/regress/test_pageserver_secondary.py::test_secondary_downloads[debug-pg14]
 FAILED test_runner/regress/test_remote_storage.py::test_remote_storage_upload_queue_retries[debug-pg14]
 FAILED test_runner/regress/test_pg_regress.py::test_isolation[debug-pg14-None]
 FAILED test_runner/regress/test_pg_regress.py::test_pg_regress[debug-pg14-None]
 FAILED test_runner/regress/test_recovery.py::test_pageserver_recovery[debug-pg14]
 FAILED test_runner/regress/test_pageserver_secondary.py::test_location_conf_churn[debug-pg14-3]
 FAILED test_runner/regress/test_remote_storage.py::test_remote_timeline_client_calls_started_metric[debug-pg14]
 FAILED test_runner/regress/test_sharding.py::test_sharding_ingest_gaps[debug-pg14]
 FAILED test_runner/regress/test_remote_storage.py::test_timeline_deletion_with_files_stuck_in_upload_queue[debug-pg14]
 FAILED test_runner/regress/test_storage_controller.py::test_storage_controller_s3_time_travel_recovery[debug-pg14]
 FAILED test_runner/regress/test_sharding.py::test_sharding_split_compaction[debug-pg14-compact-shard-ancestors-enqueued]
 FAILED test_runner/regress/test_sharding.py::test_sharding_split_compaction[debug-pg14-compact-shard-ancestors-localonly]
 FAILED test_runner/regress/test_sharding.py::test_sharding_split_compaction[debug-pg14-None]
 FAILED test_runner/regress/test_sharding.py::test_sharding_split_compaction[debug-pg14-compact-shard-ancestors-persistent]
 FAILED test_runner/regress/test_tenant_conf.py::test_tenant_config[debug-pg14]
 FAILED test_runner/regress/test_tenants_with_remote_storage.py::test_tenants_many[debug-pg14]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-timeline-delete-before-schedule]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-timeline-delete-before-index-deleted-at]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-timeline-delete-before-rm]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-timeline-delete-before-index-delete]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-timeline-delete-after-index-delete]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-timeline-delete-before-rm]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-timeline-delete-after-rm]
 FAILED test_runner/regress/test_timeline_delete.py::test_timeline_delete_resumed_on_attach[debug-pg14]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITHOUT_RESTART-timeline-delete-before-index-deleted-at]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-timeline-delete-before-schedule]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-timeline-delete-after-rm]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-timeline-delete-before-index-delete]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_timeline_exercise_crash_safety_failpoints[debug-pg14-Check.RETRY_WITH_RESTART-timeline-delete-after-index-delete]
 FAILED test_runner/regress/test_timeline_delete.py::test_delete_orphaned_objects[debug-pg14]
 FAILED test_runner/regress/test_timeline_size.py::test_timeline_physical_size_post_compaction[debug-pg14]
 FAILED test_runner/regress/test_timeline_size.py::test_timeline_physical_size_post_gc[debug-pg14]
 FAILED test_runner/regress/test_vm_bits.py::test_vm_bit_clear_on_heap_lock_blackbox[debug-pg14]
 FAILED test_runner/regress/test_truncate.py::test_truncate[debug-pg14]
 FAILED test_runner/regress/test_wal_acceptor.py::test_s3_eviction[debug-pg14-0.0-True]
 FAILED test_runner/regress/test_wal_acceptor.py::test_s3_eviction[debug-pg14-0.0-False]
 FAILED test_runner/regress/test_wal_acceptor.py::test_s3_eviction[debug-pg14-0.2-True]
 FAILED test_runner/regress/test_wal_acceptor.py::test_s3_eviction[debug-pg14-0.2-False]

@arpad-m
Copy link
Member Author

arpad-m commented Jul 12, 2024

extracted error for test_sharding_compaction:

  stdout:
    Starting existing endpoint ep-workload-17d9-7d14...
    Starting postgres node at 'postgresql://cloud_admin@127.0.0.1:15009/postgres'
    SIGKILL & wait the started process
  stderr:
    command failed: compute startup failed: failed to get basebackup@0/0 from pageserver postgresql://no_user@localhost:15004

    Caused by:
        0: failed to iterate over archive
        1: db error: ERROR: invalid SlruKind::Clog record: block.len()=23
        2: ERROR: invalid SlruKind::Clog record: block.len()=23

@arpad-m
Copy link
Member Author

arpad-m commented Jul 12, 2024

closing this, will open a new branch for further experiments.

@arpad-m arpad-m closed this Jul 12, 2024
@arpad-m arpad-m deleted the arpad/compression_6 branch July 12, 2024 14:05
@arpad-m arpad-m mentioned this pull request Jul 12, 2024
skyzh pushed a commit that referenced this pull request Jul 15, 2024
We need to pass on the configured compression param during image layer
generation.

This was an oversight of #8106, and the likely cause why #8288 didn't
bring any interesting regressions.

Part of #5431
arpad-m added a commit that referenced this pull request Jul 18, 2024
Successor of #8288 , just enable zstd in tests. Also adds a test that
creates easily compressable data.

Part of #5431

---------

Co-authored-by: John Spray <john@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
problame pushed a commit that referenced this pull request Jul 22, 2024
Successor of #8288 , just enable zstd in tests. Also adds a test that
creates easily compressable data.

Part of #5431

---------

Co-authored-by: John Spray <john@neon.tech>
Co-authored-by: Joonas Koivunen <joonas@neon.tech>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants