Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: record time to update gc info as a per timeline metric #7473

Merged
merged 3 commits into from
Apr 29, 2024

Conversation

koivunej
Copy link
Contributor

@koivunej koivunej commented Apr 23, 2024

We know that updating gc info can take a very long time from recent incident, and holding Tenant::gc_cs affects many per-tenant operations in the system. We need a direct way to observe the time it takes. The solution is to add metrics so that we know when this happens:

  • 2 new per-timeline metric
  • 1 new global histogram

Verified that the buckets are okay-ish in dashboard. In our current state, we will see a lot more of Inf, but that is probably okay; at least we can learn which timelines are having issues.

Can we afford to add these metrics? A bit unclear, see another dashboard with top pageserver /metrics response sizes.

@koivunej koivunej requested a review from problame April 23, 2024 06:40
@koivunej koivunej requested a review from a team as a code owner April 23, 2024 06:40
Copy link

2772 tests run: 2654 passed, 0 failed, 118 skipped (full report)


Code coverage* (full report)

  • functions: 28.1% (6466 of 23044 functions)
  • lines: 46.8% (45690 of 97613 lines)

* collected from Rust tests only


The comment gets automatically updated with the latest test results
22592a2 at 2024-04-23T07:22:29.195Z :recycle:

Copy link
Contributor

@problame problame left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No strong objections, some comments.
And I don't have opinions on whether we can afford the additional metrics.

pageserver/src/tenant.rs Show resolved Hide resolved
pageserver/src/tenant/size.rs Show resolved Hide resolved
pageserver/src/tenant/timeline.rs Show resolved Hide resolved
@koivunej
Copy link
Contributor Author

We discussed this more last week. The guess is that we should have room for this metric, so merging.

@koivunej koivunej merged commit 3695a1e into main Apr 29, 2024
52 of 53 checks passed
@koivunej koivunej deleted the update_gc_info_storage_time branch April 29, 2024 04:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants