Improve slow operations observability in safekeepers #8188

petuhovskiy · 2024-06-27T15:58:07Z

After #8022 was deployed to staging, I noticed many cases of timeouts. After inspecting the logs, I realized that some operations are taking ~20 seconds and they're doing while holding shared state lock. Usually it happens right after redeploy, because compute reconnections put high load on disks. This commit tries to improve observability around slow operations.

Non-observability changes:

TimelineState::finish_change now skips update if nothing has changed
wal_residence_guard() timeout is set to 30s

github-actions · 2024-06-27T16:45:55Z

2940 tests run: 2823 passed, 0 failed, 117 skipped (full report)

Flaky tests (1)

Postgres 14

test_lr_with_slow_safekeeper: release

Code coverage* (full report)

functions: 32.6% (6896 of 21128 functions)
lines: 50.0% (53957 of 107939 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
53c0746 at 2024-06-27T16:45:54.462Z :recycle:}

In #8188 I forgot to specify buckets for new operations metrics. This commit fixes that.

Improve slow operations observability in safekeepers

53c0746

petuhovskiy requested a review from a team as a code owner June 27, 2024 15:58

petuhovskiy requested review from jcsp and arssher and removed request for jcsp June 27, 2024 15:58

arssher approved these changes Jun 27, 2024

View reviewed changes

petuhovskiy merged commit 1d66ca7 into main Jun 27, 2024
64 checks passed

petuhovskiy deleted the sk-slow-ops branch June 27, 2024 17:39

petuhovskiy mentioned this pull request Jun 28, 2024

Add buckets to safekeeper ops metrics #8194

Merged

petuhovskiy added a commit that referenced this pull request Jun 28, 2024

Add buckets to safekeeper ops metrics (#8194)

c22c6a6

In #8188 I forgot to specify buckets for new operations metrics. This commit fixes that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve slow operations observability in safekeepers #8188

Improve slow operations observability in safekeepers #8188

petuhovskiy commented Jun 27, 2024 •

edited

Loading

github-actions bot commented Jun 27, 2024

Postgres 14

Improve slow operations observability in safekeepers #8188

Improve slow operations observability in safekeepers #8188

Conversation

petuhovskiy commented Jun 27, 2024 • edited Loading

github-actions bot commented Jun 27, 2024

2940 tests run: 2823 passed, 0 failed, 117 skipped (full report)

Postgres 14

Code coverage* (full report)

petuhovskiy commented Jun 27, 2024 •

edited

Loading