-
Notifications
You must be signed in to change notification settings - Fork 360
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: avoid starving background task permits in eviction task #7471
Conversation
this solves the constraining problem just as well.
returning impl Drop or any other opaque type would stop us from re-acquring the permit.
this might have some adverse effects just as well for tenants having many many timelines, but then again, now the permit will be held for much less.
No longer to be included. |
2772 tests run: 2654 passed, 0 failed, 118 skipped (full report)Code coverage* (full report)
* collected from Rust tests only The comment gets automatically updated with the latest test results
f52f618 at 2024-04-23T06:41:58.151Z :recycle: |
Realized that if all compaction tasks are blocked, we will not have any early flushes due to memory pressure either. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, isn't the root of the issue that the background task permits are tenant-scoped (gc, compaction), and this eviction task mis-uses them as timeline-scoped? And so, high-timeline-count tenants have disproportionate impact on others.
Maybe we should instead rectify that?
I wrote this to be included with this weeks release, and I think it will remove the problem. If you want to rewrite it, go ahead, but I think we must have this fixed for next week because the ephemeral file protection hinges on being able to run compaction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM as a point fix.
Seems like we all agree that eviction task should be later refactored to be one per tenant to make this easier to reason about.
As seen with a recent incident, eviction tasks can cause pageserver-wide permit starvation on the background task semaphore when synthetic size calculation takes a long time for a tenant that has more than our permit number of timelines or multiple tenants that have slow synthetic size and total number of timelines exceeds the permits. Metric links can be found in the internal slack thread.
As a solution, release the permit while waiting for the state guarding the synthetic size calculation. This will most likely hurt the eviction task eviction performance, but that does not matter because we are hoping to get away from it using OnlyImitiate policy anyway and rely solely on disk usage-based eviction.