Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: avoid starving background task permits in eviction task #7471

Merged
merged 7 commits into from
Apr 24, 2024

Conversation

koivunej
Copy link
Contributor

@koivunej koivunej commented Apr 23, 2024

As seen with a recent incident, eviction tasks can cause pageserver-wide permit starvation on the background task semaphore when synthetic size calculation takes a long time for a tenant that has more than our permit number of timelines or multiple tenants that have slow synthetic size and total number of timelines exceeds the permits. Metric links can be found in the internal slack thread.

As a solution, release the permit while waiting for the state guarding the synthetic size calculation. This will most likely hurt the eviction task eviction performance, but that does not matter because we are hoping to get away from it using OnlyImitiate policy anyway and rely solely on disk usage-based eviction.

this solves the constraining problem just as well.
returning impl Drop or any other opaque type would stop us from
re-acquring the permit.
this might have some adverse effects just as well for tenants having
many many timelines, but then again, now the permit will be held for
much less.
@koivunej koivunej requested a review from problame April 23, 2024 05:57
@koivunej koivunej requested a review from a team as a code owner April 23, 2024 05:57
@koivunej koivunej changed the title Joonas/avoid bg task starvation fix: avoid starving background task permits in eviction task Apr 23, 2024
@koivunej
Copy link
Contributor Author

koivunej commented Apr 23, 2024

Hotfix to be included in #7447 -- please review accordingly. Rationale: we currently have at least 2h long starvations.

No longer to be included.

Copy link

2772 tests run: 2654 passed, 0 failed, 118 skipped (full report)


Code coverage* (full report)

  • functions: 28.1% (6464 of 23043 functions)
  • lines: 46.8% (45676 of 97580 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
f52f618 at 2024-04-23T06:41:58.151Z :recycle:

@koivunej koivunej enabled auto-merge (squash) April 23, 2024 06:56
@koivunej
Copy link
Contributor Author

Realized that if all compaction tasks are blocked, we will not have any early flushes due to memory pressure either.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, isn't the root of the issue that the background task permits are tenant-scoped (gc, compaction), and this eviction task mis-uses them as timeline-scoped? And so, high-timeline-count tenants have disproportionate impact on others.

Maybe we should instead rectify that?

@koivunej
Copy link
Contributor Author

Hm, isn't the root of the issue that the background task permits are tenant-scoped (gc, compaction), and this eviction task mis-uses them as timeline-scoped? And so, high-timeline-count tenants have disproportionate impact on others.

Maybe we should instead rectify that?

I wrote this to be included with this weeks release, and I think it will remove the problem. If you want to rewrite it, go ahead, but I think we must have this fixed for next week because the ephemeral file protection hinges on being able to run compaction.

Copy link
Contributor

@jcsp jcsp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM as a point fix.

Seems like we all agree that eviction task should be later refactored to be one per tenant to make this easier to reason about.

@koivunej koivunej merged commit a60035b into main Apr 24, 2024
53 checks passed
@koivunej koivunej deleted the joonas/avoid_bg_task_starvation branch April 24, 2024 08:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants