Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose metrics for external consumption #627

Open
avanier opened this issue Dec 16, 2021 · 16 comments
Open

Expose metrics for external consumption #627

avanier opened this issue Dec 16, 2021 · 16 comments
Assignees
Labels
enhancement New feature or request

Comments

@avanier
Copy link
Contributor

avanier commented Dec 16, 2021

As a listmonk administrator,
I want to have metrics available externally
So that I can consume them and implement things like alerting with outboard systems.


Context

In our particular deployment of listmonk we're processing bounces from SES and the relevant subscribers are not being blocklisted automatically. Exporting metrics that we could hook into our monitoring system would help bring issues to our attention as they arise.

Misc

I'm specifically thinking of opening a PR to expose Prometheus metrics. I've seen that some of the stuff is already instrumented with go-metrics for internal dashboards, so building on that should be no biggie.

Do note, I'm suggesting exposing Prometheus metrics since I'm used to the tooling but we could just as well do something else like Influx, or Statsd. Whichever the community feels is the best common denominator.

@avanier avanier added the enhancement New feature or request label Dec 16, 2021
@knadh
Copy link
Owner

knadh commented Dec 17, 2021

Hi @avanier, what kind of metrics are you proposing? Can you give some examples?

Prometheus format should be fine.

@avanier
Copy link
Contributor Author

avanier commented Dec 17, 2021

For my particular use-case I know could use exporting something like listmonk_bounces_processed or something along those lines. Otherwise I'm guessing messenger / mailer errors? So generally I'm thinking of things representative of "things that you can do something about and that waiting to discover wouldn't be great".

The Go Prometheus client also has a bunch of "free" ones like build_info that's pretty handy in Kubernetes to watch your deployments roll out.

I'm open to suggestions if there anything that you feel should be part of a first pass.

@knadh
Copy link
Owner

knadh commented Dec 18, 2021

Makes sense. Off the top of my my head:

  • Bounce counts
  • Campaign send rates for running campaigns
  • Campaign counts grouped by status?
  • Subscriber counts grouped by status? (can be an expensive query).

Many of these queries already exist here and here.

@avanier
Copy link
Contributor Author

avanier commented Jan 27, 2022

Hey, status update with good new and bad news.

Good news: I got started and it's looking good. You can check the branch I've got going on here.

Bad news: I likely won't be able to use very many of the queries due to the way Prometheus groks about counters. I mean, I probably could, but making a blocking metrics request is usually an anti-pattern for Prometheus metrics. (It can lead to significant problems at scrape-time which can screw up the metrics themselves once aggregated.)

Silver lining: There's some really nice hooks in Listmonk to latch on, so it shouldn't be too hard to get the metrics we want exposed provided no one objects to me strewing around a bunch of counters all over the place.

Context: I'm a sysadmin by trade, so the engineering part of software engineering is really not my forte. I could absolutely use feedback on how to structure this so it makes sense to maintain in the long run.

@knadh
Copy link
Owner

knadh commented Jan 27, 2022

Was checking out the WIP. The bounce counters increment, but on changing a setting somewhere, listmonk will restart resetting it to 0. Are the metrics supposed to be ephemeral like that?

To return accurate states, eg: sending stats of running campaigns or total bounce counts per campaign etc., a separate goroutine could query the DB for metrics periodically and keep it in a global state somewhere and return on GET /metrics. I am not sure if this is the Prometheus way of doing things though. @mr-karan do you have any thoughts?

@mr-karan
Copy link
Collaborator

a separate goroutine could query the DB for metrics periodically and keep it in a global state somewhere and return on GET /metrics. I am not sure if this is the Prometheus way of doing things though

@knadh Okay, so as I understand going through the thread, these metrics are different from instrumentation metrics that you would typically write. Instrumentation of code via metrics means that, when a thing happens, you notify the metrics collector to update a certain value that it keeps in memory. A very simple example:

query, err := fetchFromDB()
if err!=nil {
  errCounter.increment()
}

The key thing to note here is that, we add metrics in the code path as and when real world state changes and not poll for it (like an external Goroutine). This is because here, source of truth is the app itself and not a DB.

However, in case of Listmonk, things like bounces and some other metrics aren't instrumentation metrics but more like analytical. And since they can't be pre-determined (inside the code path), we'll need to do what @knadh suggested:

a separate goroutine could query the DB for metrics periodically and keep it in a global state somewher

This is not a pattern I've observed usually however. I think this also warrants the usage of https://github.com/mr-karan/store-exporter (self plug 😄), if these metrics can be pulled from DB.

However, if there are certain metrics that aren't recorded on DB but only held within the app, then I think we can bundle the metrics exposition inside listmonk as well.

Sidenote

I mean, I probably could, but making a blocking metrics request is usually an anti-pattern for Prometheus metrics. (It can lead to significant problems at scrape-time which can screw up the metrics themselves once aggregated.)

@avanier Yeah, that's correct. But in the above proposed solution, we can have a separate Goroutine to update the metrics state. When the HTTP request for /metrics comes, it doesn't compute on the fly but it simply flushes the content of metrics map object to the response buffer (hence not blocking).

@avanier
Copy link
Contributor Author

avanier commented Jan 27, 2022

The whole idea is tickling my spider-senses, but I can't find any issue with @mr-karan 's suggestion. I've never done it this way and that store-exporter seems to make a lot of sense. I'll give it a try. 💪

A few random thoughts:

  • @knadh Yes, Prometheus application metrics are meant to be ephemeral. Prometheus is designed to handle counter resets, so that's not an issue either. It's actually a desirable feature since it allows you to see when your application restarts. It's especially useful for clustered applications. Ultimately, for alerting purposes, all we're interested in is the rate, so that disarms a lot edge cases too.
  • If we do this with metrics pulled from the database, we'll be locking ourselves out of the ability of getting granular metrics about which Listmonk instance has done which work if we were to run more than one Listmonk instance in a deployment. I could see this being particularly useful when looking at campaign send rates. (Right now I'm not sure we can do that, can we?)

@knadh
Copy link
Owner

knadh commented Jan 27, 2022

However, in case of Listmonk, things like bounces and some other metrics aren't instrumentation metrics but more like analytical.

If we're just counting bounces as they come, campaign messages as they're sent, subscribers as they're inserted etc., that's all instrumentation I guess? If that's the case, then we can just add counters instead of depending on the DB states.

I could see this being particularly useful when looking at campaign send rates. (Right now I'm not sure we can do that, can we?)

The --passive mode runs a listmonk instance that does not process campaigns. While this can be used to run multiple instances, only one instance connected to the DB should be processing campaigns and sending messages.

@avanier
Copy link
Contributor Author

avanier commented Jan 27, 2022

If we're just counting bounces as they come, campaign messages as they're sent, subscribers as they're inserted etc., that's all instrumentation I guess? If that's the case, then we can just add counters instead of depending on the DB states.

That's usually how I've seen it used. I think if we mean to do "transactional-level accuracy analytics", I think #660 makes more sense for that use-case. Not that you can't use Prometheus for business intelligence, but it wasn't designed for that.

@knadh
Copy link
Owner

knadh commented Jan 29, 2022

the *.Inc() approach makes sense then. Just that it doesn't feel right to bring in Prometheus packages and semantics all across listmonk. A simpler internal package abstraction, something like internal/metrics that wraps the Prometheus metrics package, that can be injected into listmonk's internal packages will be ideal.

@mr-karan
Copy link
Collaborator

Yep, that should be possible ^

I'd also add that maybe using https://github.com/VictoriaMetrics/metrics is a better idea than https://github.com/prometheus/client_golang/ because of a much simpler API and a lot less dependencies.

@knadh
Copy link
Owner

knadh commented Jan 30, 2022

I'd also add that maybe using https://github.com/VictoriaMetrics/metrics is a better idea

+1

@avanier
Copy link
Contributor Author

avanier commented Jan 30, 2022

Ooh, yeah, I'm liking where this is going.

  1. Putting it in a internal/metrics internal package is an idea my sysadmin lizard brain can wrap itself around. I should be able to pull that off and it makes sense. 💪
  2. I had completely forgotten about the VictoriaMetrics libraries! We use their product at work and it's just stellar. Also, the guys there are super nice! 😁

It will take me a while to scrounge up time to put this together, but I'll likely report back within a week or two with something.

@avanier
Copy link
Contributor Author

avanier commented Feb 8, 2022

Here's an update as promised.

Diff Linky

General thoughts:

  • I need a review on how I'm using koanf and the configuration system in general. I have no idea of what I'm doing, but I'm pretty sure I'm doing it wrong.
  • I hadn't registered how barebones the VictoriaMetrics metrics library is and I can't help but feel I'm reinventing the wheel for what the Echo Prometheus middleware provides by default for observing http traffic. I guess I'm taking that kind of stuff for granted, hence why I didn't explicitly state I was expecting to get that out of the metrics too. 😕 This is not an actionnable comment, just a feeling I've yet to process. (I just finished writing this)
  • The way I implemented the function to record bounces is very crude. I do plan to refactor that so it's a bit more generic and better named. The whole thing needs a second pass.
  • Most of my time was spent grokking this rather than getting actual work done. Apparently 🧠 == hard.
  • It might be time for me to actually press the button and make a PR

@knadh && @mr-karan , I could use your feedback before I proceed butchering this thing some more. 😈 😁

@mr-karan
Copy link
Collaborator

mr-karan commented Feb 9, 2022

Nice work @avanier and thanks for this!

Can you create a PR (Draft is fine too). I am unable to leave comments on the diff link.

@avanier
Copy link
Contributor Author

avanier commented Feb 9, 2022

@mr-karan I've collected this into a PR and gave it a once-over.

It should be a bit more readable now.

@knadh knadh self-assigned this Dec 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants