Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pageserver: don't store tenant/timeline ID on each layer, synchronize with Timeline shutdown #5967

Closed
wants to merge 7 commits into from

Conversation

jcsp
Copy link
Contributor

@jcsp jcsp commented Nov 28, 2023

Problem

  • Layers are implicitly scoped to a Timeline: it is wasteful to store the timeline+tenant ID on each one
  • Layers store their full local path, but only need it occasionally & can construct it on-demand via Timeline
  • Layer::drop does I/O to a Timeline's local path, but this can happen after the Timeline is shut down and destroyed.

Summary of changes

Checklist before requesting a review

  • I have performed a self-review of my code.
  • If it is a core feature, I have added thorough tests.
  • Do we need to implement analytics? if so did you add the relevant metrics to the dashboard?
  • If this PR requires public announcement, mark it with /release-notes label and add several sentences in this section.

Checklist before merging

  • Do not forget to reformat commit message to not include the above checklist

@jcsp jcsp added c/storage/pageserver Component: storage: pageserver a/tech_debt Area: related to tech debt labels Nov 28, 2023
Copy link

github-actions bot commented Nov 28, 2023

2148 tests run: 2064 passed, 0 failed, 84 skipped (full report)


Flaky tests (4)

Postgres 16

Postgres 15

  • test_physical_replication: release

Postgres 14

  • test_statvfs_pressure_min_avail_bytes: debug
  • test_statvfs_pressure_usage: debug

Code coverage (full report)

  • functions: 54.8% (9370 of 17084 functions)
  • lines: 82.1% (54423 of 66250 lines)

The comment gets automatically updated with the latest test results
93e069c at 2023-12-11T18:50:38.500Z :recycle:

}

/// Use this instead of `local_path` if you already have a Timeline to hand.
pub(crate) fn build_local_path(
Copy link
Member

@koivunej koivunej Dec 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who uses this from outside? Or why does it need to be exposed at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Used by:

  • Timeline, constructing paths for layer files in e.g. create_delta_layer
  • RemoteTimelineClient, constructing path to download to.

Comment on lines 492 to 498
// We will only do I/O during drop if our Timeline's layer_gate is open: this avoids
// the risk that we would race with Timeline::shutdown and end up doing I/O to a timeline
// path for which the Timeline object has been torn down already.
let _gate_guard = match timeline.layer_gate.enter() {
Ok(g) => g,
Err(GateError::GateClosed) => return,
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that I am looking at this, wouldn't it just be better to acquire a gate for all layers before creating them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the simplest reasoning about correctness, we could hold a gate around the lifetime of all layer objects.

However, that's relatively expensive (layers are very numerous) and isn't necessary: we already block timeline shutdown on all related tasks completing, so that should mean that layers aren't doing any I/O after shutdown complete. The only exception to that is here in drop(), where we need to check the gate in case some other code kept a Layer alive past the point that the other code kept its GateGuard (or task_mgr task lifetime).

Comment on lines 1112 to 1114
let local_path = self
.local_path()
.map_err(|_| EvictionCancelled::TimelineGone)?;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But now we would be upgrading the timeline all over? It must already be upgraded to get to here because otherwise we would be running without a span.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not upgraded frequently (the paths that need the local path aren't called super often), but yeah, I agree that it's a bit dangerous to make this implicit, since callers might naively use it for logging or something.

I've refactored this to take Timeline as a parameter: as you say, callers generally have a Timeline to hand already at the point they call this - 0f53319

Copy link
Member

@koivunej koivunej left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Getting rid of ids in the persistentlayerdesc is a good change but I dislike the adhoc timeline upgrading which is now spread around layer.

@@ -1004,7 +1038,10 @@ impl LayerInner {
//
// FIXME: this is not true anymore, we can safely evict wanted deleted files.
} else if can_evict && evict {
let span = tracing::info_span!(parent: None, "layer_evict", tenant_id = %self.desc.tenant_shard_id.tenant_id, shard_id = %self.desc.tenant_shard_id.shard_slug(), timeline_id = %self.desc.timeline_id, layer=%self, %version);
// If timeline is alive, we can construct a span with IDs for this function.
let span = self.timeline.upgrade().map(|timeline| {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koivunej at this point, we could instead return if the timeline is gone. I didn't make that change because it wasn't obvious if that would violate any other expectations, but it's probably fine since once Timeline is destroyed we don't have any obligation to make progress with eviction -- your thoughts?

@jcsp jcsp marked this pull request as ready for review December 11, 2023 09:58
@jcsp jcsp requested a review from a team as a code owner December 11, 2023 09:58
@jcsp jcsp requested review from arpad-m and removed request for a team December 11, 2023 09:58
Copy link
Member

@arpad-m arpad-m left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should

pageserver/src/tenant/timeline/layer_manager.rs Outdated Show resolved Hide resolved
pageserver/src/tenant/storage_layer/layer.rs Outdated Show resolved Hide resolved
@jcsp
Copy link
Contributor Author

jcsp commented Apr 5, 2024

#7082 fixed the important part of this (deletions happening after timeline shutdown)

@jcsp jcsp closed this Apr 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
a/tech_debt Area: related to tech debt c/storage/pageserver Component: storage: pageserver
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants