Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: fix image layer creation check that inhibited compaction #7420

Merged
merged 3 commits into from
Apr 26, 2024

Conversation

VladLazar
Copy link
Contributor

@VladLazar VladLazar commented Apr 18, 2024

Problem

PR #7230 attempted to introduce a WAL ingest threshold for checking whether enough deltas are stacked to warrant creating a new image layer. However, this check was incorrectly performed at the compaction partition level instead of the timeline level. Hence, it inhibited GC for any keys outside of the first partition.

Summary of Changes

Hoist the check up to the timeline level.

Testing and Deployment

I generated the layer map diagram from the output of test_compaction_smoke and checked that it looks as expected.
Before the fix there's no image layers. For deployment, I suggest we go region by region and monitor pageserver_layers_visited_per_read_global. If people have ideas for any more testing, happy to do it.

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

Copy link

github-actions bot commented Apr 18, 2024

2958 tests run: 2837 passed, 0 failed, 121 skipped (full report)


Flaky tests (7)

Postgres 16

Postgres 15

  • test_fully_custom_config: debug
  • test_crafted_wal_end[last_wal_record_xlog_switch]: debug
  • test_multi_attach: debug

Postgres 14

  • test_crafted_wal_end[simple]: debug

Code coverage* (full report)

  • functions: 28.1% (6483 of 23056 functions)
  • lines: 46.9% (46049 of 98152 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
aa2ea71 at 2024-04-26T13:27:03.263Z :recycle:

PR #7230 attempted to introduce a WAL ingest threshold for checking
whether enough deltas are stacked to warrant creating a new image layer.
However, this check was incorrectly performed at the compaction partition
level instead of the timeline level. Hence, it inhibited GC for any keys
outside of the first partition.

Hoist the check up to the timeline level.

We should probably allow compaction to catch up across the fleet before
re-enabling this. In the meantime, I'll test with a real tenant to make
sure we don't inhibit compaction in an unexpected way again.
@VladLazar VladLazar force-pushed the vlad/fix-image-layer-creation-check branch from df46294 to 4460f98 Compare April 24, 2024 15:41
@VladLazar VladLazar marked this pull request as ready for review April 24, 2024 17:08
@VladLazar VladLazar requested a review from a team as a code owner April 24, 2024 17:08
Copy link
Contributor

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent that the test now shows expected behaviour with the config removed.

Do I understand correctly that only the self.last_image_layer_creation_check_at.store was at the wrong place, but everything could be hoisted as that was the original intention? It is a bit frustrating how non-obviously this place is better, partially due to how the code is organized (the per-partition detail method is above or comes before of per-timeline or the caller).

@VladLazar
Copy link
Contributor Author

Do I understand correctly that only the self.last_image_layer_creation_check_at.store was at the wrong place, but everything could be hoisted as that was the original intention?

Yes, that's right

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm surprised this doesn't break a bunch of tests. Like, the ones where we write a certain amount of data and then check for an image layer being created.

But 🤷, LGTM

As a side note: I don't like we use "threshold" in the config variable name here; because we already have image_creation_threshold, whose unit is "number of delta layers" whereas this variable's unit is "amount of WAL bytes".
But, I guess it's too late for that.

pageserver/src/tenant/timeline.rs Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Outdated Show resolved Hide resolved
@VladLazar
Copy link
Contributor Author

I'm surprised this doesn't break a bunch of tests. Like, the ones where we write a certain amount of data and then check for an image layer being created.

Probably for the same reason why these tests passed in #7230. They don't write a keyspace that's wide enough to get multiple partitions?

@problame
Copy link
Contributor

Good to go from my side.

@VladLazar VladLazar enabled auto-merge (squash) April 25, 2024 17:21
@VladLazar VladLazar disabled auto-merge April 26, 2024 12:40
@VladLazar VladLazar merged commit af43f78 into main Apr 26, 2024
54 checks passed
@VladLazar VladLazar deleted the vlad/fix-image-layer-creation-check branch April 26, 2024 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants