Skip to content

Commit

Permalink
fix(layer): remove the need to repair internal state (#7030)
Browse files Browse the repository at this point in the history
## Problem

The current implementation of struct Layer supports canceled read
requests, but those will leave the internal state such that a following
`Layer::keep_resident` call will need to repair the state. In
pathological cases seen during generation numbers resetting in staging
or with too many in-progress on-demand downloads, this repair activity
will need to wait for the download to complete, which stalls disk
usage-based eviction. Similar stalls have been observed in staging near
disk-full situations, where downloads failed because the disk was full.

Fixes #6028 or the "layer is present on filesystem but not evictable"
problems by:
1. not canceling pending evictions by a canceled
`LayerInner::get_or_maybe_download`
2. completing post-download initialization of the `LayerInner::inner`
from the download task

Not canceling evictions above case (1) and always initializing (2) lead
to plain `LayerInner::inner` always having the up-to-date information,
which leads to the old `Layer::keep_resident` never having to wait for
downloads to complete. Finally, the `Layer::keep_resident` is replaced
with `Layer::is_likely_resident`. These fix #7145.

## Summary of changes

- add a new test showing that a canceled get_or_maybe_download should
not cancel the eviction
- switch to using a `watch` internally rather than a `broadcast` to
avoid hanging eviction while a download is ongoing
- doc changes for new semantics and cleanup
- fix `Layer::keep_resident` to use just `self.0.inner.get()` as truth
as `Layer::is_likely_resident`
- remove `LayerInner::wanted_evicted` boolean as no longer needed

Builds upon: #7185. Cc: #5331.
  • Loading branch information
koivunej committed Mar 21, 2024
1 parent a95c41f commit 2206e14
Show file tree
Hide file tree
Showing 7 changed files with 1,085 additions and 328 deletions.
7 changes: 3 additions & 4 deletions pageserver/src/tenant/mgr.rs
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
//! page server.

use camino::{Utf8DirEntry, Utf8Path, Utf8PathBuf};
use futures::stream::StreamExt;
use itertools::Itertools;
use pageserver_api::key::Key;
use pageserver_api::models::ShardParameters;
Expand Down Expand Up @@ -1662,9 +1661,9 @@ impl TenantManager {
.layers
.read()
.await
.resident_layers()
.collect::<Vec<_>>()
.await;
.likely_resident_layers()
.collect::<Vec<_>>();

for layer in timeline_layers {
let relative_path = layer
.local_path()
Expand Down
Loading

1 comment on commit 2206e14

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2786 tests run: 2641 passed, 2 failed, 143 skipped (full report)


Failures on Postgres 14

  • test_heavy_write_workload[neon_off-github-actions-selfhosted-10-5-5]: release
  • test_bulk_insert[neon-github-actions-selfhosted]: release
# Run all failed tests locally:
scripts/pytest -vv -n $(nproc) -k "test_heavy_write_workload[neon_off-release-pg14-github-actions-selfhosted-10-5-5] or test_bulk_insert[neon-release-pg14-github-actions-selfhosted]"

Code coverage* (full report)

  • functions: 28.4% (7181 of 25277 functions)
  • lines: 47.1% (44155 of 93761 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
2206e14 at 2024-03-21T02:28:03.374Z :recycle:

Please sign in to comment.