vm-builder: add SQL exporter to vector #878

skyzh · 2024-03-25T19:10:03Z

This pull request makes the LFC metrics available at the vector metrics endpoint. We use the default scrape interval to avoid overload the database.

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh · 2024-03-26T15:04:29Z

Welcome to Alpine!
 ~ This is the VM :) ~
neonvm:~# curl 10.0.2.18:9100/metrics | grep lfc
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26016  100 26016    0     0  12.1M      0 --:--:-- --:--:-- --:--:-- 24.8M
# HELP lfc_cache_size_limit lfc_cache_size_limit
# TYPE lfc_cache_size_limit gauge
lfc_cache_size_limit 12700352512 1711465391821
# HELP lfc_hits lfc_hits
# TYPE lfc_hits gauge
lfc_hits 0 1711465391821
# HELP lfc_misses lfc_misses
# TYPE lfc_misses gauge
lfc_misses 374 1711465391821
# HELP lfc_used lfc_used
# TYPE lfc_used gauge
lfc_used 111 1711465391821
# HELP lfc_writes lfc_writes
# TYPE lfc_writes gauge
lfc_writes 374 1711465391821
neonvm:~#

skyzh · 2024-03-26T15:05:17Z

mark ready for review

Bodobolero · 2024-03-26T15:26:52Z

@skyzh pls also add the working set size in pages to the sql_exporter as a metric
https://github.com/neondatabase/neon/blob/8dfe3a070cd04dd2310ed07e1f38f4257dd43cd8/vm-image-spec.yaml#L188

select approximate_working_set_size(false);

Bodobolero

select approximate_working_set_size(false);

is missing in sql_exporter metrics

Do we really require 1 sec interval for those metrics ?
For working set size and hit ratios a much smaller interval like 1 minute or so would be sufficient.
It is not just using resources in compute but also in our victoriametrics database if we scrape so frequently?

skyzh · 2024-03-26T15:45:07Z

I'm using the default scrape interval for sql exporters, which is usually 15 secs (1 sec is for host information).

skyzh · 2024-03-26T15:46:07Z

select approximate_working_set_size(false);

will have a separate pull request on the compute side and make it into the next release.

sharnoff · 2024-03-26T15:54:54Z

It is not just using resources in compute but also in our victoriametrics database if we scrape so frequently?

These metrics are exclusively consumed by the autoscaler-agent and not persisted anywhere.

The autoscaler-agent fetches metrics (and potentially makes a scaling decision) every 5 seconds.

IMO it's worthwhile to not need to worry about whether those metrics are stale; I don't immediately see a reason not to fetch sql-exporter metrics every 5 seconds (or even every 1 second) - presumably it's pretty cheap?

skyzh · 2024-03-26T16:20:18Z

There are some aggregations and joins on the system catalog, and therefore, if there are a lot of tables, it might be slow to execute.

https://github.com/neondatabase/neon/blob/47d2b3a4830f6d5ecb84086e785ec0f913390176/vm-image-spec.yaml#L162-L172

A better approach is to separate two sql exporters: the ones that are cheap to scrape, and the expensive ones. All LFC metrics are basically O(1) operations when being retrieved from the database.

skyzh · 2024-03-28T17:35:38Z

This pull request will be merged once the compute + console release is done next week. There are still some leftover works from upgrading the compute, and I will ensure that at the time this pull request gets merged, the metrics I put in this vector config will be available from all neon user projects.

Bodobolero

thanks for adding the working set size

ref neondatabase/autoscaling#878 ref neondatabase/autoscaling#872 Add `approximate_working_set_size` to sql exporter so that autoscaling can use it in the future. --------- Signed-off-by: Alex Chi Z <chi@neon.tech> Co-authored-by: Peter Bendel <peterbendel@neon.tech>

sharnoff

Apologies for the delayed review. Took a while to think this over.

High-level question:

I think this can also be implemented by telling the autoscaler-agent another port to fetch metrics from (per each VM). AFAICT the trade-offs would be:

More implementation work required in autoscaler-agent to pass the data around
The simplest implementation would probably pull every 5 seconds (although it wouldn't be too tricky to have a different scrape interval...)
Rollout would depend on autoscaling -> cplane release (to add per-VM setting), instead of autoscaling -> compute image release
We wouldn't have spooky action at a distance w.r.t. sql-exporter split across this repo and neondatabase/neon

What do you think?

Separately, could you prefix "vm-builder: " to the start of this PR title for consistent naming? Helps with organization 🙏

neonvm/tools/vm-builder/files/vector.yaml

sharnoff · 2024-04-01T23:24:26Z

neonvm/tools/vm-builder/files/vector.yaml

+    # The data might be delayed as we use the default scrape interval. Some SQLs are not
+    # trivial to execute, and therefore we don't want them to run every second.


How much work would it be to run this at the same scrape interval? (or at least, <5s?)

I ask because it may make the scaling algorithm tricky (specifically: expecting counters to remain the same, not because nothing's happening, but because metrics are delayed). I'm sure it can be worked around, but it'd be good to weigh the options for tech debt like this.

I think we just need to split two sqlexporters (one for quick scrapes, and one for slow scrapes).

Co-authored-by: Em Sharnoff <sharnoff@neon.tech>

skyzh · 2024-04-02T20:12:15Z

I think this can also be implemented by telling the autoscaler-agent another port to fetch metrics from (per each VM).

Do you mean that the autoscaler-agent can directly query LFC data by logging in to the database using Postgres protocol + cloud_admin? In this case, do we need to store that data somewhere inside cplane? I thought the reason that we have vector.dev is to keep some history of the data and avoid storing them by autoscaling-agent itself, but I could be wrong...

skyzh · 2024-04-02T20:18:05Z

Okay I just re-read https://github.com/neondatabase/autoscaling/blob/main/ARCHITECTURE.md and I think I had some misunderstanding there. So I feel the best approach is:

Scrape the data directly using SQL. The autoscaling-agent will connect to the database using Postgres protocol and run SQL queries to collect the data.
Find a way to access these data. As the agent runs outside of the compute pod, it cannot directly log in to cloud_admin. There are 3 approaches: assign a temporary password for cloud_admin; modify ip list so that the agent can connect without a password; create a proxy inside the pod on some port to delegate SQL queries.

skyzh · 2024-04-02T20:20:20Z

create a proxy inside the pod on some port to delegate SQL queries

can we make this into vm-monitor?

sharnoff · 2024-04-03T01:42:55Z

Having the autoscaler-agent fetch the data via SQL is an interesting idea — hadn't thought of that.

I was thinking more like having it fetch metrics from sql-exporter via prometheus (http). It's already exposed from every VM, so the change would be entirely within the autoscaler-agent.

Omrigan · 2024-04-03T19:42:30Z

Having the autoscaler-agent fetch the data via SQL is an interesting idea — hadn't thought of that.

That feels like smashing abstraction layers. Does it yield any practical advantage over below?

I was thinking more like having it fetch metrics from sql-exporter via prometheus (http). It's already exposed from every VM, so the change would be entirely within the autoscaler-agent.

This also, but less so than above. I see now a practical advantage (no bufferization) of bypassing vector. Also, @neondatabase/billing folks say vector is unreliable 🙂 so I guess this is my preference.

skyzh · 2024-04-03T20:53:08Z

I can open a pull request on the compute side to have vm-monitor directly exposing the metrics

skyzh · 2024-04-03T21:39:57Z

I have a demo pull request ready but not tested yet, feel free to leave some comments before we finalize the code + method of exposing the metrics. neondatabase/neon#7302

add SQL exporter to vector

48851d7

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh requested a review from sharnoff March 25, 2024 19:10

use ip instead of localhost, add new metrics to input

08a3116

Signed-off-by: Alex Chi Z <chi@neon.tech>

skyzh marked this pull request as ready for review March 26, 2024 15:04

skyzh requested a review from Bodobolero March 26, 2024 15:04

Bodobolero requested changes Mar 26, 2024

View reviewed changes

skyzh mentioned this pull request Mar 26, 2024

neonvm: add LFC approximate working set size to metrics neondatabase/neon#7252

Merged

5 tasks

skyzh requested a review from Bodobolero March 28, 2024 17:35

Bodobolero approved these changes Mar 28, 2024

View reviewed changes

sharnoff reviewed Apr 1, 2024

View reviewed changes

Omrigan self-requested a review April 2, 2024 11:48

skyzh changed the title ~~add SQL exporter to vector~~ vm-builder: add SQL exporter to vector Apr 2, 2024

Update neonvm/tools/vm-builder/files/vector.yaml

75cd80d

Co-authored-by: Em Sharnoff <sharnoff@neon.tech>

skyzh mentioned this pull request Apr 3, 2024

Epic: Local File Cache Enhancements neondatabase/neon#6919

Closed

sharnoff mentioned this pull request Apr 4, 2024

agent: Refactor metrics definitions #892

Merged

Omrigan mentioned this pull request Apr 11, 2024

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm #872

Closed

skyzh closed this May 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vm-builder: add SQL exporter to vector #878

vm-builder: add SQL exporter to vector #878

skyzh commented Mar 25, 2024 •

edited

Loading

skyzh commented Mar 26, 2024

skyzh commented Mar 26, 2024

Bodobolero commented Mar 26, 2024 •

edited

Loading

Bodobolero left a comment •

edited

Loading

skyzh commented Mar 26, 2024 •

edited

Loading

skyzh commented Mar 26, 2024

sharnoff commented Mar 26, 2024

skyzh commented Mar 26, 2024

skyzh commented Mar 28, 2024

Bodobolero left a comment

sharnoff left a comment •

edited

Loading

sharnoff Apr 1, 2024

skyzh Apr 2, 2024

skyzh commented Apr 2, 2024 •

edited

Loading

skyzh commented Apr 2, 2024

skyzh commented Apr 2, 2024 •

edited

Loading

sharnoff commented Apr 3, 2024

Omrigan commented Apr 3, 2024

skyzh commented Apr 3, 2024

skyzh commented Apr 3, 2024

		# The data might be delayed as we use the default scrape interval. Some SQLs are not
		# trivial to execute, and therefore we don't want them to run every second.

vm-builder: add SQL exporter to vector #878

vm-builder: add SQL exporter to vector #878

Conversation

skyzh commented Mar 25, 2024 • edited Loading

skyzh commented Mar 26, 2024

skyzh commented Mar 26, 2024

Bodobolero commented Mar 26, 2024 • edited Loading

Bodobolero left a comment • edited Loading

Choose a reason for hiding this comment

skyzh commented Mar 26, 2024 • edited Loading

skyzh commented Mar 26, 2024

sharnoff commented Mar 26, 2024

skyzh commented Mar 26, 2024

skyzh commented Mar 28, 2024

Bodobolero left a comment

Choose a reason for hiding this comment

sharnoff left a comment • edited Loading

Choose a reason for hiding this comment

sharnoff Apr 1, 2024

Choose a reason for hiding this comment

skyzh Apr 2, 2024

Choose a reason for hiding this comment

skyzh commented Apr 2, 2024 • edited Loading

skyzh commented Apr 2, 2024

skyzh commented Apr 2, 2024 • edited Loading

sharnoff commented Apr 3, 2024

Omrigan commented Apr 3, 2024

skyzh commented Apr 3, 2024

skyzh commented Apr 3, 2024

skyzh commented Mar 25, 2024 •

edited

Loading

Bodobolero commented Mar 26, 2024 •

edited

Loading

Bodobolero left a comment •

edited

Loading

skyzh commented Mar 26, 2024 •

edited

Loading

sharnoff left a comment •

edited

Loading

skyzh commented Apr 2, 2024 •

edited

Loading

skyzh commented Apr 2, 2024 •

edited

Loading