Expose LSN and replication delay as metrics #7610

save-buffer · 2024-05-03T18:04:54Z

Problem

We currently have no way to see what the current LSN of a compute its, and in case of read replicas, we don't know what the difference in LSNs is.

Summary of changes

Adds these metrics

github-actions · 2024-05-03T18:41:55Z

2886 tests run: 2759 passed, 0 failed, 127 skipped (full report)

Flaky tests (3)

Postgres 16

test_gc_aggressive: debug
test_vm_bit_clear_on_heap_lock: debug

Postgres 14

test_lock_time_tracing: release

Code coverage* (full report)

functions: 31.4% (6241 of 19881 functions)
lines: 47.1% (46742 of 99293 lines)

* collected from Rust tests only

_{The comment gets automatically updated with the latest test results
057672c at 2024-05-06T22:59:58.713Z :recycle:}

vm-image-spec.yaml

hlinnaka · 2024-05-05T07:46:49Z

What is the difference between the neon_collector and neon_collector_autoscaling sections in the yaml file? Which metrics are supposed to go in which?

hlinnaka · 2024-05-05T07:57:03Z

What is the difference between the neon_collector and neon_collector_autoscaling sections in the yaml file? Which metrics are supposed to go in which?

Ok, found the explanation in the commit message of commit 4b55dad:

As discussed in neondatabase/autoscaling#895, we want to have a separate sql_exporter for simple metrics to avoid overload the database because the autoscaling agent needs to scrape at a higher interval. The new exporter is exposed at port 9499.

So if I understand correctly, the point of the "autoscaling" metrics is to expose metrics that will be used by autoscaling, to make scaling decisions. The replication lag metrics doesn't seem necessary for that. So I think these new metrics should only be added to the neon_collector metrics, not neon_collector_autoscaling.

save-buffer · 2024-05-06T18:01:34Z

So I think these new metrics should only be added to the neon_collector metrics, not neon_collector_autoscaling.

Got it, good catch! I just kind of assumed they were there because we for some reason had different specs for pods and vms. Just removed it.

vm-image-spec.yaml

## Problem We currently have no way to see what the current LSN of a compute its, and in case of read replicas, we don't know what the difference in LSNs is. ## Summary of changes Adds these metrics

hlinnaka reviewed May 5, 2024

View reviewed changes

vm-image-spec.yaml Outdated Show resolved Hide resolved

save-buffer requested a review from hlinnaka May 6, 2024 18:21

save-buffer force-pushed the sasha_compute_metrics branch from 9531ba4 to d86ffaf Compare May 6, 2024 18:22

hlinnaka reviewed May 6, 2024

View reviewed changes

vm-image-spec.yaml Outdated Show resolved Hide resolved

vm-image-spec.yaml Outdated Show resolved Hide resolved

vm-image-spec.yaml Outdated Show resolved Hide resolved

save-buffer added 4 commits May 6, 2024 15:17

Expose LSN and replication delay as metrics

185c000

Add some more metrics

cd558eb

Respond to Heikki

4d6cf24

Fix metric name

057672c

save-buffer force-pushed the sasha_compute_metrics branch from a5ed329 to 057672c Compare May 6, 2024 22:17

hlinnaka approved these changes May 8, 2024

View reviewed changes

andreasscherbaum added the c/storage/compute Component: storage: compute label May 8, 2024

save-buffer merged commit 21e1a49 into main May 8, 2024
56 checks passed

save-buffer deleted the sasha_compute_metrics branch May 8, 2024 15:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose LSN and replication delay as metrics #7610

Expose LSN and replication delay as metrics #7610

save-buffer commented May 3, 2024

github-actions bot commented May 3, 2024 •

edited

Postgres 16

Postgres 14

hlinnaka commented May 5, 2024

hlinnaka commented May 5, 2024

save-buffer commented May 6, 2024

Expose LSN and replication delay as metrics #7610

Expose LSN and replication delay as metrics #7610

Conversation

save-buffer commented May 3, 2024

Problem

Summary of changes

github-actions bot commented May 3, 2024 • edited

2886 tests run: 2759 passed, 0 failed, 127 skipped (full report)

Postgres 16

Postgres 14

Code coverage* (full report)

hlinnaka commented May 5, 2024

hlinnaka commented May 5, 2024

save-buffer commented May 6, 2024

github-actions bot commented May 3, 2024 •

edited