metrics: record time to update gc info as a per timeline metric #7473
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
We know that updating gc info can take a very long time from recent incident, and holding
Tenant::gc_cs
affects many per-tenant operations in the system. We need a direct way to observe the time it takes. The solution is to add metrics so that we know when this happens:Verified that the buckets are okay-ish in dashboard. In our current state, we will see a lot more of
Inf,
but that is probably okay; at least we can learn which timelines are having issues.Can we afford to add these metrics? A bit unclear, see another dashboard with top pageserver
/metrics
response sizes.