Skip to content

Commit

Permalink
metrics: record time to update gc info as a per timeline metric (#7473)
Browse files Browse the repository at this point in the history
We know that updating gc info can take a very long time from [recent
incident], and holding `Tenant::gc_cs` affects many per-tenant
operations in the system. We need a direct way to observe the time it
takes. The solution is to add metrics so that we know when this happens:
- 2 new per-timeline metric
- 1 new global histogram

Verified that the buckets are okay-ish in [dashboard]. In our current
state, we will see a lot more of `Inf,` but that is probably okay; at
least we can learn which timelines are having issues.

Can we afford to add these metrics? A bit unclear, see [another
dashboard] with top pageserver `/metrics` response sizes.

[dashboard]:
https://neonprod.grafana.net/d/b7a5a5e2-1276-4bb0-9e3a-b4528adb6eb6/storage-operations-histograms-in-prod?orgId=1&var-datasource=ZNX49CDVz&var-instance=All&var-operation=All&from=now-7d&to=now

[another dashboard]:
https://neonprod.grafana.net/d/MQx4SN-Vk/metric-sizes-on-prod-and-some-correlations?orgId=1

[recent incident]:
https://neondb.slack.com/archives/C06UEMLK7FE/p1713817696580119?thread_ts=1713468604.508969&cid=C06UEMLK7FE
  • Loading branch information
koivunej committed Apr 29, 2024
1 parent 75b4440 commit 3695a1e
Show file tree
Hide file tree
Showing 4 changed files with 41 additions and 1 deletion.
27 changes: 27 additions & 0 deletions pageserver/src/metrics.rs
Original file line number Diff line number Diff line change
Expand Up @@ -51,6 +51,9 @@ pub(crate) enum StorageTimeOperation {
#[strum(serialize = "gc")]
Gc,

#[strum(serialize = "update gc info")]
UpdateGcInfo,

#[strum(serialize = "create tenant")]
CreateTenant,
}
Expand Down Expand Up @@ -1910,6 +1913,22 @@ impl StorageTimeMetricsTimer {
self.metrics.timeline_count.inc();
self.metrics.global_histogram.observe(duration);
}

/// Turns this timer into a timer, which will always record -- usually this means recording
/// regardless an early `?` path was taken in a function.
pub(crate) fn record_on_drop(self) -> AlwaysRecordingStorageTimeMetricsTimer {
AlwaysRecordingStorageTimeMetricsTimer(Some(self))
}
}

pub(crate) struct AlwaysRecordingStorageTimeMetricsTimer(Option<StorageTimeMetricsTimer>);

impl Drop for AlwaysRecordingStorageTimeMetricsTimer {
fn drop(&mut self) {
if let Some(inner) = self.0.take() {
inner.stop_and_record();
}
}
}

/// Timing facilities for an globally histogrammed metric, which is supported by per tenant and
Expand Down Expand Up @@ -1970,6 +1989,7 @@ pub(crate) struct TimelineMetrics {
pub imitate_logical_size_histo: StorageTimeMetrics,
pub load_layer_map_histo: StorageTimeMetrics,
pub garbage_collect_histo: StorageTimeMetrics,
pub update_gc_info_histo: StorageTimeMetrics,
pub last_record_gauge: IntGauge,
resident_physical_size_gauge: UIntGauge,
/// copy of LayeredTimeline.current_logical_size
Expand Down Expand Up @@ -2030,6 +2050,12 @@ impl TimelineMetrics {
&shard_id,
&timeline_id,
);
let update_gc_info_histo = StorageTimeMetrics::new(
StorageTimeOperation::UpdateGcInfo,
&tenant_id,
&shard_id,
&timeline_id,
);
let last_record_gauge = LAST_RECORD_LSN
.get_metric_with_label_values(&[&tenant_id, &shard_id, &timeline_id])
.unwrap();
Expand Down Expand Up @@ -2072,6 +2098,7 @@ impl TimelineMetrics {
logical_size_histo,
imitate_logical_size_histo,
garbage_collect_histo,
update_gc_info_histo,
load_layer_map_histo,
last_record_gauge,
resident_physical_size_gauge,
Expand Down
6 changes: 5 additions & 1 deletion pageserver/src/tenant.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3402,7 +3402,11 @@ impl Tenant {
// is in progress (which is not a common case).
//
// See more for on the issue #2748 condenced out of the initial PR review.
let mut shared_cache = self.cached_logical_sizes.lock().await;
let mut shared_cache = tokio::select! {
locked = self.cached_logical_sizes.lock() => locked,
_ = cancel.cancelled() => anyhow::bail!("cancelled"),
_ = self.cancel.cancelled() => anyhow::bail!("tenant is shutting down"),
};

size::gather_inputs(
self,
Expand Down
3 changes: 3 additions & 0 deletions pageserver/src/tenant/size.rs
Original file line number Diff line number Diff line change
Expand Up @@ -118,6 +118,9 @@ pub(super) async fn gather_inputs(
ctx: &RequestContext,
) -> anyhow::Result<ModelInputs> {
// refresh is needed to update gc related pitr_cutoff and horizon_cutoff
//
// FIXME: if a single timeline is deleted while refresh gc info is ongoing, we will fail the
// whole computation. It does not make sense from the billing perspective.
tenant
.refresh_gc_info(cancel, ctx)
.await
Expand Down
6 changes: 6 additions & 0 deletions pageserver/src/tenant/timeline.rs
Original file line number Diff line number Diff line change
Expand Up @@ -4345,6 +4345,12 @@ impl Timeline {
cancel: &CancellationToken,
ctx: &RequestContext,
) -> anyhow::Result<()> {
let _timer = self
.metrics
.update_gc_info_histo
.start_timer()
.record_on_drop();

// First, calculate pitr_cutoff_timestamp and then convert it to LSN.
//
// Some unit tests depend on garbage-collection working even when
Expand Down

1 comment on commit 3695a1e

@github-actions
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2876 tests run: 2742 passed, 0 failed, 134 skipped (full report)


Flaky tests (1)

Postgres 14

  • test_partial_evict_tenant[relative_spare]: release

Code coverage* (full report)

  • functions: 28.3% (6549 of 23146 functions)
  • lines: 47.0% (46243 of 98480 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
3695a1e at 2024-04-29T05:10:57.021Z :recycle:

Please sign in to comment.