Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vm-builder: add SQL exporter to vector #878

Closed
wants to merge 3 commits into from
Closed

Conversation

skyzh
Copy link
Member

@skyzh skyzh commented Mar 25, 2024

ref neondatabase/neon#5949
close #872

This pull request makes the LFC metrics available at the vector metrics endpoint. We use the default scrape interval to avoid overload the database.

Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh skyzh requested a review from sharnoff March 25, 2024 19:10
Signed-off-by: Alex Chi Z <chi@neon.tech>
@skyzh
Copy link
Member Author

skyzh commented Mar 26, 2024

Welcome to Alpine!
 ~ This is the VM :) ~
neonvm:~# curl 10.0.2.18:9100/metrics | grep lfc
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 26016  100 26016    0     0  12.1M      0 --:--:-- --:--:-- --:--:-- 24.8M
# HELP lfc_cache_size_limit lfc_cache_size_limit
# TYPE lfc_cache_size_limit gauge
lfc_cache_size_limit 12700352512 1711465391821
# HELP lfc_hits lfc_hits
# TYPE lfc_hits gauge
lfc_hits 0 1711465391821
# HELP lfc_misses lfc_misses
# TYPE lfc_misses gauge
lfc_misses 374 1711465391821
# HELP lfc_used lfc_used
# TYPE lfc_used gauge
lfc_used 111 1711465391821
# HELP lfc_writes lfc_writes
# TYPE lfc_writes gauge
lfc_writes 374 1711465391821
neonvm:~# 

@skyzh skyzh marked this pull request as ready for review March 26, 2024 15:04
@skyzh skyzh requested a review from Bodobolero March 26, 2024 15:04
@skyzh
Copy link
Member Author

skyzh commented Mar 26, 2024

mark ready for review

@Bodobolero
Copy link

Bodobolero commented Mar 26, 2024

@skyzh pls also add the working set size in pages to the sql_exporter as a metric
https://github.com/neondatabase/neon/blob/8dfe3a070cd04dd2310ed07e1f38f4257dd43cd8/vm-image-spec.yaml#L188

select approximate_working_set_size(false);

Copy link

@Bodobolero Bodobolero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

select approximate_working_set_size(false);

is missing in sql_exporter metrics

Do we really require 1 sec interval for those metrics ?
For working set size and hit ratios a much smaller interval like 1 minute or so would be sufficient.
It is not just using resources in compute but also in our victoriametrics database if we scrape so frequently?

@skyzh
Copy link
Member Author

skyzh commented Mar 26, 2024

I'm using the default scrape interval for sql exporters, which is usually 15 secs (1 sec is for host information).

@skyzh
Copy link
Member Author

skyzh commented Mar 26, 2024

select approximate_working_set_size(false);

will have a separate pull request on the compute side and make it into the next release.

@sharnoff
Copy link
Member

It is not just using resources in compute but also in our victoriametrics database if we scrape so frequently?

These metrics are exclusively consumed by the autoscaler-agent and not persisted anywhere.

The autoscaler-agent fetches metrics (and potentially makes a scaling decision) every 5 seconds.

IMO it's worthwhile to not need to worry about whether those metrics are stale; I don't immediately see a reason not to fetch sql-exporter metrics every 5 seconds (or even every 1 second) - presumably it's pretty cheap?

@skyzh
Copy link
Member Author

skyzh commented Mar 26, 2024

There are some aggregations and joins on the system catalog, and therefore, if there are a lot of tables, it might be slow to execute.

https://github.com/neondatabase/neon/blob/47d2b3a4830f6d5ecb84086e785ec0f913390176/vm-image-spec.yaml#L162-L172

A better approach is to separate two sql exporters: the ones that are cheap to scrape, and the expensive ones. All LFC metrics are basically O(1) operations when being retrieved from the database.

@skyzh
Copy link
Member Author

skyzh commented Mar 28, 2024

This pull request will be merged once the compute + console release is done next week. There are still some leftover works from upgrading the compute, and I will ensure that at the time this pull request gets merged, the metrics I put in this vector config will be available from all neon user projects.

@skyzh skyzh requested a review from Bodobolero March 28, 2024 17:35
Copy link

@Bodobolero Bodobolero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for adding the working set size

skyzh added a commit to neondatabase/neon that referenced this pull request Mar 29, 2024
ref neondatabase/autoscaling#878
ref neondatabase/autoscaling#872

Add `approximate_working_set_size` to sql exporter so that autoscaling
can use it in the future.

---------

Signed-off-by: Alex Chi Z <chi@neon.tech>
Co-authored-by: Peter Bendel <peterbendel@neon.tech>
Copy link
Member

@sharnoff sharnoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delayed review. Took a while to think this over.

High-level question:

I think this can also be implemented by telling the autoscaler-agent another port to fetch metrics from (per each VM). AFAICT the trade-offs would be:

  • More implementation work required in autoscaler-agent to pass the data around
  • The simplest implementation would probably pull every 5 seconds (although it wouldn't be too tricky to have a different scrape interval...)
  • Rollout would depend on autoscaling -> cplane release (to add per-VM setting), instead of autoscaling -> compute image release
  • We wouldn't have spooky action at a distance w.r.t. sql-exporter split across this repo and neondatabase/neon

What do you think?


Separately, could you prefix "vm-builder: " to the start of this PR title for consistent naming? Helps with organization 🙏

neonvm/tools/vm-builder/files/vector.yaml Outdated Show resolved Hide resolved
Comment on lines +24 to +25
# The data might be delayed as we use the default scrape interval. Some SQLs are not
# trivial to execute, and therefore we don't want them to run every second.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much work would it be to run this at the same scrape interval? (or at least, <5s?)

I ask because it may make the scaling algorithm tricky (specifically: expecting counters to remain the same, not because nothing's happening, but because metrics are delayed). I'm sure it can be worked around, but it'd be good to weigh the options for tech debt like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we just need to split two sqlexporters (one for quick scrapes, and one for slow scrapes).

@Omrigan Omrigan self-requested a review April 2, 2024 11:48
@skyzh skyzh changed the title add SQL exporter to vector vm-builder: add SQL exporter to vector Apr 2, 2024
Co-authored-by: Em Sharnoff <sharnoff@neon.tech>
@skyzh
Copy link
Member Author

skyzh commented Apr 2, 2024

I think this can also be implemented by telling the autoscaler-agent another port to fetch metrics from (per each VM).

Do you mean that the autoscaler-agent can directly query LFC data by logging in to the database using Postgres protocol + cloud_admin? In this case, do we need to store that data somewhere inside cplane? I thought the reason that we have vector.dev is to keep some history of the data and avoid storing them by autoscaling-agent itself, but I could be wrong...

@skyzh
Copy link
Member Author

skyzh commented Apr 2, 2024

Okay I just re-read https://github.com/neondatabase/autoscaling/blob/main/ARCHITECTURE.md and I think I had some misunderstanding there. So I feel the best approach is:

  • Scrape the data directly using SQL. The autoscaling-agent will connect to the database using Postgres protocol and run SQL queries to collect the data.
  • Find a way to access these data. As the agent runs outside of the compute pod, it cannot directly log in to cloud_admin. There are 3 approaches: assign a temporary password for cloud_admin; modify ip list so that the agent can connect without a password; create a proxy inside the pod on some port to delegate SQL queries.

@skyzh
Copy link
Member Author

skyzh commented Apr 2, 2024

create a proxy inside the pod on some port to delegate SQL queries

can we make this into vm-monitor?

@sharnoff
Copy link
Member

sharnoff commented Apr 3, 2024

Having the autoscaler-agent fetch the data via SQL is an interesting idea — hadn't thought of that.

I was thinking more like having it fetch metrics from sql-exporter via prometheus (http). It's already exposed from every VM, so the change would be entirely within the autoscaler-agent.

@Omrigan
Copy link
Contributor

Omrigan commented Apr 3, 2024

Having the autoscaler-agent fetch the data via SQL is an interesting idea — hadn't thought of that.

That feels like smashing abstraction layers. Does it yield any practical advantage over below?

I was thinking more like having it fetch metrics from sql-exporter via prometheus (http). It's already exposed from every VM, so the change would be entirely within the autoscaler-agent.

This also, but less so than above. I see now a practical advantage (no bufferization) of bypassing vector. Also, @neondatabase/billing folks say vector is unreliable 🙂 so I guess this is my preference.

@skyzh
Copy link
Member Author

skyzh commented Apr 3, 2024

I can open a pull request on the compute side to have vm-monitor directly exposing the metrics

@skyzh
Copy link
Member Author

skyzh commented Apr 3, 2024

I have a demo pull request ready but not tested yet, feel free to leave some comments before we finalize the code + method of exposing the metrics. neondatabase/neon#7302

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make metrics LFC number of hits, number of misses and working set size available to autoscaling algorithm
4 participants