-
Notifications
You must be signed in to change notification settings - Fork 152
Metrics not updated on external probe timeout #653
Comments
Thanks for the bug report, @ltagliamonte-dd. "total" should increment if there is a timeout. I'll look into it. |
Problem seems to be here: cloudprober/probes/external/external.go Line 501 in 30fb6ab
We do results processing in the goroutine that gets canceled on timeout. We should move this out. |
Also, add tests that would have caught this bug. Ref: #653 PiperOrigin-RevId: 393020350
Also, add tests that would have caught this bug. Ref: #653 PiperOrigin-RevId: 393020350
@ltagliamonte-dd This should now be fixed in the "master" branch (master tag). If you can verify with the latest changes, it will be great. I'll try to cut the next release (milestone: v0.11.3) in a week or so. |
Closing this issue now. Please feel free to reopen if it doesn't work. You can track the release here: |
Thanks for the quick resolution, with test it out and reopen if necessary. |
Thank you @manugarg i was able to re-test and it looks like the issue has been fixed. |
We use cloudprober with an external probe that executes a check on consul.
When the external probe hits a timeout the prometheus metric
total
for that probe doesn't get increments.In the logs I can clearly see that the external probe hits timeout:
But the Prometheus metric doesn't get any increments:
i'm making the probe failing with an iptables rule (can't connect to consul for the check)
as soon as I make the probe pass again (removing the iptables rule) the prometheus
total
counter start updating again with the correct number (will report also the number of check failed)The problem with this behavior is that if the probe keeps timing out the configured alert will never trigger because the metric will never update and my alert is configured on the diff between total and success counters.
currently we are using cloudprober@v0.11.0 i didn't find any existing issue related to this behavior.
Flat line in the monitoring system shows this behavior as well:
![Screen Shot 2021-08-23 at 12 55 45 PM](https://user-images.githubusercontent.com/51684360/130511044-7a261d02-b08b-4de5-ba3f-defa0097eb02.png)
The text was updated successfully, but these errors were encountered: