Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Prometheus metrics #694

Closed
wants to merge 5 commits into from

Conversation

avanier
Copy link
Contributor

@avanier avanier commented Feb 9, 2022

This pull request implements #627 .

The goal here is to expose some basic metrics over a Prometheus endpoint.

This is very much a draft PR for now. I would expect this to get rebased into something cleaner before it's taken out of draft state. Undrafted, yay! \o/

@avanier avanier force-pushed the feature/add-prometheus-metrics branch from ead525c to 08b7d62 Compare February 9, 2022 08:25
Copy link
Collaborator

@mr-karan mr-karan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone through the PR and yeah this would require a bit of design changes. Overall the approach looks fine. Needs some more thought in designing the API for the internal metrics package, I suggested a simpler approach please have a look.

I've not yet gone through the Echo middleware part, shall do so once these changes are finalised.

cmd/admin.go Outdated Show resolved Hide resolved
cmd/handlers.go Outdated Show resolved Hide resolved
cmd/handlers.go Outdated Show resolved Hide resolved
cmd/main.go Outdated Show resolved Hide resolved
cmd/main.go Outdated Show resolved Hide resolved
config.toml.sample Outdated Show resolved Hide resolved
internal/bounce/bounce.go Outdated Show resolved Hide resolved
internal/metrics/handler.go Outdated Show resolved Hide resolved
internal/metrics/handler.go Outdated Show resolved Hide resolved
internal/metrics/handler.go Outdated Show resolved Hide resolved
@avanier avanier marked this pull request as ready for review February 9, 2022 10:49
@avanier avanier marked this pull request as draft February 9, 2022 10:49
@avanier avanier changed the title Feature/add prometheus metrics Add Prometheus metrics Feb 9, 2022
@avanier avanier force-pushed the feature/add-prometheus-metrics branch from 8fe9250 to 86801f2 Compare February 22, 2022 11:13
@avanier avanier marked this pull request as ready for review February 22, 2022 11:14
@avanier
Copy link
Contributor Author

avanier commented Feb 22, 2022

Almighty, I think this might be able to come out of draft now.

I believe this should be in a better form than previously and closer to what you had in mind.

One thing you'll note is that I did not go with the string concat to declare the metric name and labels inline. Hopefully, this is an acceptable compromise. I understand how the string concat gets it out of the way, line-count wise, but while metric names are technically labels at storage time on most backends, they're never referred as such in the Prometheus or VictoriaMetrics codebase. I felt that keeping the reference to the metric name being separate from the labels was worthwhile for "standards adherence".

Either way, if this doesn't pass muster, it's not a big refactor. I'm hoping this is unobtrusive enough to slip by. 😄

(Also, this is a mess of commits. I might rebase that into something better organized.)

Edit: Rebased on master and split into new cleaner commits.

@avanier avanier force-pushed the feature/add-prometheus-metrics branch 3 times, most recently from b8e8a92 to 2cd9384 Compare February 22, 2022 11:30
@avanier
Copy link
Contributor Author

avanier commented Mar 1, 2022

@knadh && @mr-karan . May I bump for review?

@avanier avanier force-pushed the feature/add-prometheus-metrics branch from 2cd9384 to 99ca9b2 Compare March 1, 2022 21:46
@mr-karan
Copy link
Collaborator

mr-karan commented Mar 2, 2022

@avanier I'll checkout these changes locally and review it by this end of this week.

@mr-karan
Copy link
Collaborator

mr-karan commented Mar 5, 2022

@avanier Firstly, excellent work so far :) Most of the things are in-place, I just had to do some minor refactoring on how the config is loaded, function/variable naming and removing some redundant code that I found.

I was not really sure if I can push the changes to your fork (if you give write access I can maybe). For now, I've pushed changes in this repo itself and you can see what is changed: bc7acfe. Please let me know if the proposed changes are fine.

I've some minor concerns before we land this in upstream:

  1. Do we keep metrics config in config.toml (or in the UI)? If inside config.toml is fine, then when users upgrade, they won't necessarily have this config block. To accomodate for that, I've loaded a default config inside app: https://github.com/knadh/listmonk/blob/feature/add-prometheus-metrics/cmd/init.go#L317-L320. This should be fine IMO.

  2. Over here, https://github.com/knadh/listmonk/blob/feature/add-prometheus-metrics/internal/metrics/http.go#L60-L63 we are using reflection to see if the handler's function type is indeed a Not Found handler. Since the type is function, I guess there's no other way to check apart from using reflect. But this check will happen to each and every request, having a reflect op on every single incoming request, may not be a good idea. @knadh What do you think?

Apart from this, mostly looks LGTM! @knadh can do a final review once we discuss the above points.
Thanks!

@knadh
Copy link
Owner

knadh commented Mar 5, 2022

  1. It should be in the UI, probably under Performance. config.toml only has config that's required for the app to boot.

  2. Yeah, it's not ideal to reflect on every single request. Can't we just do response.status == 404 or something? Also, why should 404 be excluded from metrics?

@mr-karan
Copy link
Collaborator

mr-karan commented Mar 5, 2022

It should be in the UI, probably under Performance. config.toml only has config that's required for the app to boot.

Okay, @avanier we need to take that in consideration in this PR. I can help out with frontend changes if you want.

Also, why should 404 be excluded from metrics?

We're not excluding 404s from the metric. Since we are embedding the HTTP handler path inside metrics, there's a chance of increasing high cardinality (which Prometheus or any time series DB doesn't like - as it makes things slower for them).

For eg, on visiting /abc, the metric will be like:

requests_total{path="/abc", status="404"}

There could be millions of requests to random paths that don't exist, but they will end up creating unique metric labels (each label is a basically a unique time series). This would have downstream effects on Prometheus/metrics collection server. So, what the intended behavior should be, is we define a hard-coded string (like /not-found) and use it for any kind of path value which is 404.

requests_total{path="/not-found", status="404"}

Can't we just do response.status == 404 or something?

Hm, yeah probably we can. This should work I think.

@avanier
Copy link
Contributor Author

avanier commented Mar 7, 2022

@mr-karan I gave this a look. Many thanks for the edits! 🙏 I've cherry-picked bc7acfe into this branch. Normally you should have been able to push as I did enable the Allow edits by maintainers ✔️ on the PR. ¯\_(ツ)_/¯

That said, I do have a few questions:

  • From what I can see in your edits, this would make Prometheus metrics enabled by default. Most applications I've seen always implemented a way to disable metrics altogether. I don't see any issues with this being enabled by default.
  • I see you're removed all the code relating to _info metrics. Those are pretty useful to have in Kubernetes environments to have a trace of deployments rolling out or correlate version changes to other issues. Would you mind if I brought that back in? Did you see any issues I could address with how that code was written?
  • I see you've made the metrics namespace of listmonk a configurable option instead of a hardcoded value. Is there a specific reason for this? Normally if a user had multiple Listmonks running, standard Prometheus doctrine would have you scrape it while adding additional labels (in Kubernetes this would be the default behaviour). Having different metrics namespaces would make writing queries aggregating multiple Listmonks hairy in the best of scenarios. It also brings documentation challenges since this would have to be factored in whenever we refer to those metrics. I don't think we'd be doing users a service by making this configurable.

As for the reflection thing I've amended that to a simple 404 check. I did a quick double check and as far as I can tell the end result should be identical.

@knadh Regarding the presence of these settings in the UI, I would personally keep it only in the config file. Considering there are no Prometheus solutions out there that can be configured by clicking, I don't think we'd be leaving too many users of that ecosystem stranded. Either way, I would argue this could be implemented separately if needed? (That and my Vue.js is more than rusty ... 😅 )

@knadh
Copy link
Owner

knadh commented Mar 7, 2022

@knadh Regarding the presence of these settings in the UI, I would personally keep it only in the config file. Considering there are no Prometheus solutions out there that can be configured by clicking, I don't think we'd be leaving too many users of that ecosystem stranded.

While this is subjective, as I said, listmonk only has config it requires to boot in the TOML file. That's a simple standard that can be followed or we end up in situations like this where every setting requires deliberation on where it should go.

@mr-karan @avanier if the API is behind the BasicAuth, then it can just be on all the time, maybe.

@mr-karan
Copy link
Collaborator

mr-karan commented Mar 7, 2022

From what I can see in your edits, this would make Prometheus metrics enabled by default. Most applications I've seen always implemented a way to disable metrics altogether. I don't see any issues with this being enabled by default.

That's correct. I've implemented in such a way that if the config block isn't present (which it won't be for people upgrading), reasonable defaults are loaded and Prometheus metrics are exposed.

I don't think having an option to disable metrics would make a significant difference, since anyway we are collecting metrics inside the function blocks. Disabling them would simply disable exposing the API Handler/Response. Don't think it's a big deal, anyway it's an internal API.

if the API is behind the BasicAuth API key, then it can just be on all the time maybe.

Yep, it'll be on all the time as discussed ^ .

I see you're removed all the code relating to _info metrics. Those are pretty useful to have in Kubernetes environments to have a trace of deployments rolling out or correlate version changes to other issues

Hm, seconded. Although I did had a chat about this with @knadh about exposing metrics like build/version info. Exposing that metric on every single request seemed wasteful. We can reconsider it though.

I don't think we'd be doing users a service by making this configurable.

Fair point, I agree with you. We can remove it from config, make it hardcoded as a constant in the app as listmonk.

@avanier
Copy link
Contributor Author

avanier commented Mar 8, 2022

Alrighty, I'll get on making the changes.

I started having a proper-er look and the Vue part is more straightforward than I expected. That said, I'm still facing significant challenges understanding how koanf is used.

Right now, this is my general understanding of how the settings API and koanf should be interacting. (Excuse the crude diagram)

sequenceDiagram
  participant main as main.go
  participant init as init.go
  participant koanf as koanf
  participant config as config file
  participant postgres
  participant environment
  participant request as HTTP request

main->>init: init() 
init->>config: initConfigFiles()
config->>koanf: set config values from config file
init->>environment: [...] initConfigFiles()
environment->>koanf: set config values from environment
init->>postgres: initDB()
postgres->>koanf: set config values from database

main->>init: main()
init->>koanf: initConstants() queries koanf
koanf->>init: returns config values<br />this is where the app settings will be maintained for the API
main->>init: initMetrics()
init->>init: metrics module initialized

main->>main: Application is happy and runs

request->>main: request to /api/settings<br />this actually happends in handlers.go ... but this graph is big enough
main->>request: return App constants as JSON

Right now I've pushed c5b92ef to a separate branch. If I query /api/settings everything I've set is returning false, most likely because nil is being evaluated falsey. I'm somehow expecting koanf to do "something" with the app constants so that the right values are returned at query-time … but that assumption is obviously incorrect on my part. @mr-karan I could use pointers on how to get this to work correctly if I may ask. 🙏

Sidenote: I love how the app SIGHUPs itself when settings are updated, that's a really clean and simple way to handle this I hadn't seen before. 🤘

Once I have the /api/settings part working correctly, "I think" I should be fine with putting the front end part together.

edit: Inserting values manually to postgres in the settings table, I can see this behaves as expected in the API. So I'm really missing the link between koanf and the API endpoint somehow.

@knadh
Copy link
Owner

knadh commented Mar 8, 2022

Ah, I thought we weren't doing config or UI settings but planning to keep metrics always on behind the standard BasicAuth API.

@avanier
Copy link
Contributor Author

avanier commented Mar 8, 2022

Errrr ... You had me under the impression that UI settings were a must for any exposed setting.

Right now Prometheus would export the following:

  • Select bespoke metrics like bounces, ship-rate, or other things we manually implement as always-on
  • HTTP metrics for all routes through an Echo middleware as a toggle-able
  • Golang process metrics as a toggle-able

If we make none of those things configurable ... that would solve the all those issues >_< .

I wouldn't think of the Golang metrics as too expensive to collect and export, but few of them are actually very useful in production unless you're a Listmonk contributor who knows how to do Golang performance optimization.

At-large HTTP metrics are usually the bread and butter of Prometheus queries, but they run the risk of being high cardinality depending on the size of your deployment. (Also what constitutes high cardinality depends on which Prometheus backend you use, so YMMV.) Either way, they're generally high-ish value.

The bounce metrics and ship-rate metrics are indicative of immediately actionnable issues.


TL;DR

@knadh I don't mind going the extra mile and doing the UI work for toggles, but if you want to avoid this PR becoming a 🐰 🕳️ relegating the toggles to a separate PR could make a lot of sense.

@knadh
Copy link
Owner

knadh commented Mar 8, 2022

@mr-karan @avanier if the API is behind the BasicAuth, then it can just be on all the time, maybe.
#694 (comment)

Yep, it'll be on all the time as discussed ^ .
#694 (comment)

Making some toggle-able and others not isn't consistent. It's okay to have to metrics on all the time. I'm guessing that internally it's quite very cheap to just do Incr() in a bunch of places. Maybe we could do a benchmark and then decide on this?

@mr-karan
Copy link
Collaborator

mr-karan commented Mar 9, 2022

+1 It makes sense for Listmonk to always expose metrics out of the box, as a good default.

The current config looks like:

# Prometheus Metrics.
[metrics]
export_process_metrics = true
export_http_metrics = true

Both of these, don't really need to be "toggled", as a sane default both of them can always be true.

At-large HTTP metrics are usually the bread and butter of Prometheus queries, but they run the risk of being high cardinality depending on the size of your deployment. (Also what constitutes high cardinality depends on which Prometheus backend you use, so YMMV.) Either way, they're generally high-ish value.

We took care of the cardinality issue by ensuring the labels aren't dynamic in any metric. Don't think this will be a problem in any listmonk deployment.

Maybe we could do a benchmark and then decide on this?

+1 @avanier we can probably run a HTTP Load Tests on some API endpoints. In all of the API endpoints, we have a middleware which does .Increment/.UpdateDuration and we can compare the same without metrics.

@knadh
Copy link
Owner

knadh commented Apr 3, 2022

Any updates on this?

@avanier
Copy link
Contributor Author

avanier commented Apr 4, 2022

I was under the impression that there was a desire to do performance testing on this. Otherwise, going over what we discussed, would I be correct in assuming the only changes left is to remove any configuration toggles? This I could probably get in by end of week.

@sbusso
Copy link

sbusso commented Jun 26, 2022

this looks like an almost done PR and interesting feature, is there any chance to get it merged at some point?

@knadh
Copy link
Owner

knadh commented Jun 26, 2022

Yep, this is on my to-do to review and merge in the next release.

@knadh knadh self-assigned this Jun 26, 2022
@knadh
Copy link
Owner

knadh commented Jul 13, 2022

Whew, I finally picked this up (for the upcoming release). Was making a few changes, primarily, moving config away from the config file into the settings UI. Was playing around with the /api/metrics endpoint and realised that the unique click/view links sent in e-mails would also be logged. In our production instance, for instance, that's millions of unique URLs (with unique campaign+subscriber UUIDs), all of which, the metrics lib will track in memory.

That's not ideal. Thoughts? @avanier @mr-karan

@knadh
Copy link
Owner

knadh commented Jul 14, 2022

Thinking about it, I think HTTP handler metrics aren't very meaningful in listmonk. There are only a handful of external URIs, and they vary widely with unique params. Not much value in tracking the ~4 public handlers.

@avanier
Copy link
Contributor Author

avanier commented Jul 15, 2022

@knadh Sorry for the late answer. If I remember correctly I had limited the path depth collected for those metrics to avoid this cardinality exposition. It's also not the most important thing to monitor. It should be fine to drop the metric and revisit this later if needed.

(The only legitimate use-case I can think of right now is catching errors on missing assets or the like. But that's something that could just as easily be done at the level of an ingress controller or a CDN.)

edit: After double-checking, I indeed did not commit the change about path depth.

@ChrisTG742
Copy link
Contributor

Does this implementation put additional load on your server even if one doesn't use Prometheus at all? Or is it just some sort of REST endpoint that only gets active when called via the API?
I'm asking since there are other tools out there that that have implemented rather cpu-hungry and sluggish Prometheus support and I'm afraid listmonk could lose it's lightweight, too.

@knadh
Copy link
Owner

knadh commented Sep 27, 2022

@ChrisTG742 apps event the Prometheus protocol (simple string formatting) which any system that can read the protocol can consume. It doesn't have to be Prometheus. From a load perspective, there wouldn't be an issue. But what we were discussing was whether fine grained per-URI metrics is actually useful in an app like listmonk which only has a tiny handful of public handlers.

(The only legitimate use-case I can think of right now is catching errors on missing assets or the like. But that's something that could just as easily be done at the level of an ingress controller or a CDN.)

@avanier I agree. Let's close this PR and revisit metrics support in the future when there's a stronger usecase. Thanks!

@knadh knadh closed this Sep 27, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants