Skip to content
This repository has been archived by the owner on Nov 5, 2021. It is now read-only.

Metrics not updated on external probe timeout #653

Closed
ltagliamonte-dd opened this issue Aug 23, 2021 · 6 comments
Closed

Metrics not updated on external probe timeout #653

ltagliamonte-dd opened this issue Aug 23, 2021 · 6 comments
Assignees
Labels
Milestone

Comments

@ltagliamonte-dd
Copy link

ltagliamonte-dd commented Aug 23, 2021

We use cloudprober with an external probe that executes a check on consul.
When the external probe hits a timeout the prometheus metric total for that probe doesn't get increments.

In the logs I can clearly see that the external probe hits timeout:

E0823 19:30:51.106972      74 external.go:482] [cloudprober.consul_probe/consul_us_west_2] context deadline exceeded

But the Prometheus metric doesn't get any increments:

total{ptype="external",probe="consul_probe/consul_us_west_2",dst=""} 

i'm making the probe failing with an iptables rule (can't connect to consul for the check)

as soon as I make the probe pass again (removing the iptables rule) the prometheus total counter start updating again with the correct number (will report also the number of check failed)

The problem with this behavior is that if the probe keeps timing out the configured alert will never trigger because the metric will never update and my alert is configured on the diff between total and success counters.

currently we are using cloudprober@v0.11.0 i didn't find any existing issue related to this behavior.

Flat line in the monitoring system shows this behavior as well:
Screen Shot 2021-08-23 at 12 55 45 PM

@manugarg
Copy link
Contributor

Thanks for the bug report, @ltagliamonte-dd. "total" should increment if there is a timeout. I'll look into it.

@manugarg manugarg self-assigned this Aug 23, 2021
@manugarg
Copy link
Contributor

Problem seems to be here:

p.processProbeResult(&probeStatus{

We do results processing in the goroutine that gets canceled on timeout. We should move this out.

manugarg added a commit that referenced this issue Aug 26, 2021
Also, add tests that would have caught this bug.

Ref: #653
PiperOrigin-RevId: 393020350
manugarg added a commit that referenced this issue Aug 26, 2021
Also, add tests that would have caught this bug.

Ref: #653
PiperOrigin-RevId: 393020350
@manugarg manugarg added this to the v0.11.3 milestone Aug 26, 2021
@manugarg
Copy link
Contributor

@ltagliamonte-dd This should now be fixed in the "master" branch (master tag). If you can verify with the latest changes, it will be great. I'll try to cut the next release (milestone: v0.11.3) in a week or so.

@manugarg
Copy link
Contributor

Closing this issue now. Please feel free to reopen if it doesn't work. You can track the release here:
https://github.com/google/cloudprober/milestone/11

@ltagliamonte
Copy link

Thanks for the quick resolution, with test it out and reopen if necessary.

@ltagliamonte-dd
Copy link
Author

Thank you @manugarg i was able to re-test and it looks like the issue has been fixed.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

3 participants