Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dube: timeout individual layer evictions, log progress and record metrics #6131

Merged
merged 21 commits into from
Feb 29, 2024

Conversation

koivunej
Copy link
Member

@koivunej koivunej commented Dec 13, 2023

Because of bugs evictions could hang and pause disk usage eviction task. One such bug is known and fixed #6928. Guard each layer eviction with a modest timeout deeming timeouted evictions as failures, to be conservative.

In addition, add logging and metrics recording on each eviction iteration:

  • log collection completed with duration and amount of layers
    • per tenant collection time is observed in a new histogram
    • per tenant layer count is observed in a new histogram
  • record metric for collected, selected and evicted layer counts
  • log if eviction takes more than 10s
  • log eviction completion with eviction duration

Additionally remove dead code for which no dead code warnings appeared in earlier PR.

Follow-up to: #6060.

@koivunej koivunej requested a review from a team as a code owner December 13, 2023 16:27
@koivunej koivunej requested review from shanyp and removed request for a team December 13, 2023 16:27
Copy link

github-actions bot commented Dec 13, 2023

2448 tests run: 2326 passed, 0 failed, 122 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_pageserver_chaos[None]: release

Code coverage* (full report)

  • functions: 28.8% (6933 of 24085 functions)
  • lines: 47.4% (42570 of 89824 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
d97f1e1 at 2024-02-29T20:29:58.767Z :recycle:

@koivunej
Copy link
Member Author

Should do the same for eviction_task.

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While convincing myself of the cancel-safety of evict_layers, I noticed that it leaks the JoinSet js on cancellation.
Well, it does join_handle.abort() on each task.

So, I think secondary_tenant.evict_layer() has probably not been audited for cancellation safety.
I don't know the invariants, but, it doesn't look cancel safe (could be cancelled by the runtime when returning Poll::Ready from remove_file). It's a pre-existing issue though, and secondaries aren't prod-enabled yet, so, let's punt that to a new issue. At a mininum, the the SecondaryTenant::evict_layer function needs a comment stating it promises to be cancel-safe.

pageserver/src/disk_usage_eviction_task.rs Outdated Show resolved Hide resolved
pageserver/src/disk_usage_eviction_task.rs Outdated Show resolved Hide resolved
@koivunej
Copy link
Member Author

So, I think secondary_tenant.evict_layer() has probably not been audited for cancellation safety.

Excellent point. This PR was done before secondary tenants or at least their eviction, and I did not revisit it while fixing conflicts.

@koivunej koivunej force-pushed the remove_eviction_batching_cont branch from 106fc0a to b7b106f Compare February 28, 2024 01:57
@koivunej koivunej changed the title disk_usage_eviction_task: timeout individual layer evictions, log progress dube: timeout individual layer evictions, log progress and record metrics Feb 28, 2024
Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Love the smoke test, great work!

I like the added metrics, not just their names.

Not sure why you went for a timeout parameter instead of making callers who care use tokio::time::timeout, but 🤷

pageserver/src/tenant/storage_layer/layer/tests.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/timeline/eviction_task.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/secondary.rs Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Show resolved Hide resolved
pageserver/src/metrics.rs Outdated Show resolved Hide resolved
pageserver/src/disk_usage_eviction_task.rs Show resolved Hide resolved
@koivunej
Copy link
Member Author

Not sure why you went for a timeout parameter instead of making callers who care use tokio::time::timeout, but 🤷

I was thinking it's the mandatory nature makes sense here. Waiting is a nice to have, it's not required except for our tests. Timeout does not cancel it.

@koivunej koivunej enabled auto-merge (squash) February 29, 2024 11:35
@koivunej koivunej force-pushed the remove_eviction_batching_cont branch from 2be668d to 4e6ad8e Compare February 29, 2024 12:32
@koivunej
Copy link
Member Author

Rebase needed for build-image tools.

@koivunej koivunej force-pushed the remove_eviction_batching_cont branch from 4e6ad8e to 583eb5e Compare February 29, 2024 14:05
@koivunej
Copy link
Member Author

Rebase needed (?) for test_encode #6963.

@koivunej koivunej force-pushed the remove_eviction_batching_cont branch from 583eb5e to d97f1e1 Compare February 29, 2024 19:22
@koivunej koivunej enabled auto-merge (squash) February 29, 2024 19:23
@koivunej koivunej merged commit ee93700 into main Feb 29, 2024
54 checks passed
@koivunej koivunej deleted the remove_eviction_batching_cont branch February 29, 2024 20:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

layer: unimplemented support for evicting wanted deleted layers
2 participants