Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Substracting result of rate/irate gives "no data point" result #1802

Closed
LuboVarga opened this Issue Jul 11, 2016 · 8 comments

Comments

Projects
None yet
2 participants
@LuboVarga
Copy link

LuboVarga commented Jul 11, 2016

What did you do?
I executed wuery in prometheus console. Something like this:

irate(rabbitmq_queue_messages_published_total{env="env1",queue=~"app.legacy.*"}[10m])
-
irate(amqp_response_duration_milliseconds_count{env="env1",job="app"}[10m])

What did you expect to see?
I expect to see difference of these values:

irate(rabbitmq_queue_messages_published_total{env="env1",queue=~"app.legacy.*"}[10m])
or
irate(amqp_response_duration_milliseconds_count{env="env1",job="app"}[10m])

It is 0 - 0. I expect a result 0.

What did you see instead? Under which circumstances?
I got no data, when there is no change at current time (when both/single? rates results 0).

  • Prometheus version:

    prometheus, version 0.19.2 (branch: master, revision: 23ca13c)
    build** user: root@134dc6bbc274
    build date: 20160529-18:58:00
    go version: go1.6.2

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 11, 2016

Those have difference labels, it sounds like you want group_left as that's a many-to-one match.

I'd also advise against using irate across services, it's going to be quite racy. rate is what you want.

@LuboVarga

This comment has been minimized.

Copy link
Author

LuboVarga commented Jul 11, 2016

Thanks for advise. Now I understand difference between "-" and "or" operator ("or" operator by default match all possible entries in the right vector, doc here).

Is there possibility (on one metrics we do not have correctly parsed labels because of legacy app) to match right vector by regular expression on label? (no common labels between left and right operand)

For example:
Left operator series:

rabbitmq_queue_messages_delivered_total{env="dprh5",instance="drab5.asdf.sk:9090",job="rabbit",queue="app1.legacy.consumer@test3-dprh5app1"}    10707
rabbitmq_queue_messages_delivered_total{env="dprh5",instance="drab5.asdf.sk:9090",job="rabbit",queue="appW.legacy.consumer@test3-dprh5app3"}    147047
rabbitmq_queue_messages_delivered_total{env="dprh5",instance="drab5.asdf.sk:9090",job="rabbit",queue="MyAPP.legacy.consumer@test3-dprh5app1"}   38767

Right operator series:

amqp_response_duration_milliseconds_count{env="test3",instance="dprh5app1.asdf.sk:22551",job="app1",statusCode="200"}   5326
amqp_response_duration_milliseconds_count{env="test3",instance="dprh5app3.asdf.sk:22552",job="appW",statusCode="201"}   10748
amqp_response_duration_milliseconds_count{env="test3",instance="dprh5app1.asdf.sk:22550",job="MyAPP",statusCode="200"}  1302

As you see, on left hand side, there is label queue which have first part application name there. Right hand side operand have nice label job, where name of application is directly stored. Is there any solution to make difference of these metric nicely? (relabeling is probably most correct, but we would like not to do it now)

PS: Nongeneric solution for me to make "-" operator work (for me) is to have left and right operand wrapped in sum. Something like this:

sum(rabbitmq_queue_messages_delivered_total{env="dprh5",queue=~"bets.legacy.*@test3-.*"})
-
sum(amqp_response_duration_milliseconds_count{env="test3",job="bets"})
@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 11, 2016

What exactly are you trying to measure here?

It's not advised to have job varying for a single target, honor_labels should only be used in very specific use cases.

@LuboVarga

This comment has been minimized.

Copy link
Author

LuboVarga commented Jul 11, 2016

There are bunch of applications, which are communicating through AMQP (rabbitmq server). One rabbitmq server is used by many applications from different environments. Monitoring of rabbitmq server is thus labeled as env=dprh5, which is not exactly true, as it serves for many other environments (for example, test3). Also there is single metrics for many applications and label for application does not exist, as rabbitmq does not know about what exchange is used by which application. We have only internal rule that queue name starts with application name and than there is dot.

I am trying to make alert rule for difference between count of published rpc requests (messages) to rabbitmq server (rabbitmq_queue_messages_delivered_total metric with "wrong" labels) and number of responses generated to other system (amqp_response_duration_milliseconds_count) which is metric of application. This is integration alert rule, which will cover also lost messages while processing in application for example.

I would like to have rule, which will detect "permanent" (for example longer than 5 minute) increase of difference between requests in rabbitmq and responses generated by our application. As processing can take a few seconds, it will be also (for me) a challenge to write correct condition for detection of increasing this difference. I will try probably to use rate (but I have doubt it will be best for this task). Than I have plan to try to calculate minimum difference in some near past (utilizing time offset modifier) and current minimum. This I hope will be right solution for my whole problem.

@brian-brazil

This comment has been minimized.

Copy link
Member

brian-brazil commented Jul 11, 2016

For this sort of problem working just off request rates can miss problems if the difference is small but prolonged. What you really want is a way to capture delay.

As discussed in https://prometheus.io/docs/practices/instrumentation/#offline-processing I'd recommend alerting on if the data is taking too long to get processed, either by using timestamps on the messages or injecting fake messages to act as heartbeats.

@LuboVarga

This comment has been minimized.

Copy link
Author

LuboVarga commented Jul 11, 2016

Thanks for assistance and advices, I will try to use them and probably post final solution here afterwards. By all means, this is not a bug (just a bit different matching rule for "-" operator and "or" operator), so I will close this issue.

@LuboVarga LuboVarga closed this Jul 11, 2016

@LuboVarga

This comment has been minimized.

Copy link
Author

LuboVarga commented Aug 23, 2016

Ok, finally we do not monitor directly what I wrote here, but these things works for us:

  • monitor for nonzero messages count in AMQP rpc queue for longer than minute
  • monitor internal processing duration of message processing
  • check, if there are any publisher response each two hours (we log status codes for each processed message and there have to be at least one message each two hours)
@lock

This comment has been minimized.

Copy link

lock bot commented Mar 24, 2019

This thread has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

@lock lock bot locked and limited conversation to collaborators Mar 24, 2019

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
You can’t perform that action at this time.