[Bug] Cloudprober stops working #144

Daxten · 2018-07-30T15:57:37Z

We are using Cloudprober to ping ~20 hosts currently. From time to time it stops working, without crashing the container. The HTTP Endpoint still works, but there are no new results generated.

manugarg · 2018-07-30T17:02:20Z

@Daxten Thanks for the report. Can you tell me a little bit more about your setup?

Are you running on GCE?
Can you access container logs?
Where are you writing your data?
Do symptoms recover on their own, i.e. do you start seeing results after a while without taking any action?
Can you share your config (you can scrub the internal details, of course).

Daxten · 2018-07-30T18:10:25Z

Wow, thanks for coming back to me so fast!

We are running Rancher using Cattle, I think the problem can be breaked down to "we are using basic docker container"
Yes I can access Container logs in that case, I will take a closer look the next time it happens, but at least STDOUT didn't have anything interesting the last time, anything else where I should look?
We are using prometheus to poll the data
No it does not recover, and our healthcheck service (Port opens / Returns sane http response code) doesn't mark it as failed (which would recreate it)
Yes I can share the config tomorow with you

manugarg · 2018-07-31T12:03:45Z

I think I'll wait for config before commenting further. It does sound like a bug in cloudprober that is getting surfaced by something in your environment.

Also, are you running the latest cloudprober, that is, the latest cloudprober image, built from the source, or the last release (0.9.3)?

Daxten · 2018-08-01T08:13:24Z

Hi, I sent you the config (via mail).

With 0.9.3 the problem kept persisting until restarting I THINK

I switched to latest for a week now, and it seems like it regenerates on itself after a few hours with this version

manugarg · 2018-08-01T17:05:46Z

@Daxten I got the config, thanks! I'll certainly recommend using the latest cloudprober image instead of 0.9.3 -- there have been some bug fixes since that version.

I am not sure why cloudprober will stop working. Trying to think of a few options aloud:

It's possible there is some bug in prometheus surfacer. One way to exclude this will be to look at logs. By default, cloudprober logs all probe results. If it's an issue in surfacer logs will continue to be generated when data stops showing on prometheus handlers.
It's possible that prometheus is rejecting data for some reason. I remember that someone had an issue with their time not being synchronized and hence prometheus rejecting the data thinking that it's old. We added an option to not include timestamp in prometheus output:

cloudprober/surfacers/prometheus/proto/config.proto

Line 16 in a397582

optional bool include_timestamp = 2 [default = true];

(Also, you should be able to exclude this using logs.)

It's possible that cloudprober's internal global resolver has run into some bug. To rule this out, you could look at the sysvars variables. Cloudprober exports some sysvars variables by default, for example: uptime_msec. Try to access that variable and see if it continues to generate data when other data disappears.
There is also a possibility of some bug in HTTP probe code. That could also be ruled out using the above method.

Also, you said you didn't see anything in logs. Can you try mapping /tmp as a volume - "-v /tmp:/tmp" and see if it generates any logs? I think cloudprober will try to log in /tmp if not running on GCE (on GCE logs go to stackdriver logging).

manugarg · 2018-08-03T05:24:11Z

Regarding my last comment about logging, I verified that our docker image's default command line is set to log to stderr:

cloudprober/Dockerfile

Line 24 in a397582

ENTRYPOINT ["/cloudprober", "--logtostderr"]

So unless you're overriding the docker image entrypoint, cloudprober should be logging to stderr rather than a file under /tmp.

Daxten · 2018-08-06T07:57:52Z

Hey,
thanks for helping out so much, I send you the log right now. I will create a checklist for the other points you mentioned and will check those the next time it happens

manugarg · 2018-08-10T05:10:53Z

Hi @Daxten,

I got the logs. Also, responded over email but to close the loop here:

===
Looking at the logs, it seems that this is not probe specific as all probes stop outputting the data at around the same time. I'll try to add some more logging and profiling, and provide you with a different container image version. Sorry, it may require some more work.

Just to collect some more info -

You're not seeing any CPU/memory exhaustion related issues on the node/container?
Can you share your deployment environment with me - How is docker container run? Volumes mapped into the container. If there is a Kubernetes config, can you share that as well?
Which docker image are you running? I recently cut the release - 0.9.4. May be you can pin your deployment to that version so that we have something definitive to work with.
===

I improved the logging in last couple of changes. Can you retry with the "latest" container?

manugarg · 2018-12-08T18:32:10Z

@Daxten, I was wondering if you're still experiencing this issue. Can we close this issue if you're not.

Thanks,
Manu

manugarg · 2018-12-11T19:57:29Z

Closing this due to inactivity. Please feel free to reopen if it's still a problem. I'll be more than happy to debug this with you. Cheers.

manugarg self-assigned this Jul 30, 2018

manugarg closed this as completed Dec 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Cloudprober stops working #144

[Bug] Cloudprober stops working #144

Daxten commented Jul 30, 2018

manugarg commented Jul 30, 2018

Daxten commented Jul 30, 2018

manugarg commented Jul 31, 2018

Daxten commented Aug 1, 2018 •

edited

manugarg commented Aug 1, 2018

manugarg commented Aug 3, 2018

Daxten commented Aug 6, 2018 •

edited

manugarg commented Aug 10, 2018

manugarg commented Dec 8, 2018

manugarg commented Dec 11, 2018

[Bug] Cloudprober stops working #144

[Bug] Cloudprober stops working #144

Comments

Daxten commented Jul 30, 2018

manugarg commented Jul 30, 2018

Daxten commented Jul 30, 2018

manugarg commented Jul 31, 2018

Daxten commented Aug 1, 2018 • edited

manugarg commented Aug 1, 2018

manugarg commented Aug 3, 2018

Daxten commented Aug 6, 2018 • edited

manugarg commented Aug 10, 2018

manugarg commented Dec 8, 2018

manugarg commented Dec 11, 2018

Daxten commented Aug 1, 2018 •

edited

Daxten commented Aug 6, 2018 •

edited