Skip to content
This repository has been archived by the owner on Nov 5, 2021. It is now read-only.

[Bug] Cloudprober stops working #144

Closed
Daxten opened this issue Jul 30, 2018 · 10 comments
Closed

[Bug] Cloudprober stops working #144

Daxten opened this issue Jul 30, 2018 · 10 comments
Assignees

Comments

@Daxten
Copy link

Daxten commented Jul 30, 2018

We are using Cloudprober to ping ~20 hosts currently. From time to time it stops working, without crashing the container. The HTTP Endpoint still works, but there are no new results generated.

@manugarg
Copy link
Contributor

@Daxten Thanks for the report. Can you tell me a little bit more about your setup?

  • Are you running on GCE?
  • Can you access container logs?
  • Where are you writing your data?
  • Do symptoms recover on their own, i.e. do you start seeing results after a while without taking any action?
  • Can you share your config (you can scrub the internal details, of course).

@manugarg manugarg self-assigned this Jul 30, 2018
@Daxten
Copy link
Author

Daxten commented Jul 30, 2018

Wow, thanks for coming back to me so fast!

  • We are running Rancher using Cattle, I think the problem can be breaked down to "we are using basic docker container"
  • Yes I can access Container logs in that case, I will take a closer look the next time it happens, but at least STDOUT didn't have anything interesting the last time, anything else where I should look?
  • We are using prometheus to poll the data
  • No it does not recover, and our healthcheck service (Port opens / Returns sane http response code) doesn't mark it as failed (which would recreate it)
  • Yes I can share the config tomorow with you

@manugarg
Copy link
Contributor

I think I'll wait for config before commenting further. It does sound like a bug in cloudprober that is getting surfaced by something in your environment.

Also, are you running the latest cloudprober, that is, the latest cloudprober image, built from the source, or the last release (0.9.3)?

@Daxten
Copy link
Author

Daxten commented Aug 1, 2018

Hi, I sent you the config (via mail).

With 0.9.3 the problem kept persisting until restarting I THINK

I switched to latest for a week now, and it seems like it regenerates on itself after a few hours with this version

@manugarg
Copy link
Contributor

manugarg commented Aug 1, 2018

@Daxten I got the config, thanks! I'll certainly recommend using the latest cloudprober image instead of 0.9.3 -- there have been some bug fixes since that version.

I am not sure why cloudprober will stop working. Trying to think of a few options aloud:

  1. It's possible there is some bug in prometheus surfacer. One way to exclude this will be to look at logs. By default, cloudprober logs all probe results. If it's an issue in surfacer logs will continue to be generated when data stops showing on prometheus handlers.

  2. It's possible that prometheus is rejecting data for some reason. I remember that someone had an issue with their time not being synchronized and hence prometheus rejecting the data thinking that it's old. We added an option to not include timestamp in prometheus output:

    optional bool include_timestamp = 2 [default = true];

(Also, you should be able to exclude this using logs.)

  1. It's possible that cloudprober's internal global resolver has run into some bug. To rule this out, you could look at the sysvars variables. Cloudprober exports some sysvars variables by default, for example: uptime_msec. Try to access that variable and see if it continues to generate data when other data disappears.

  2. There is also a possibility of some bug in HTTP probe code. That could also be ruled out using the above method.

Also, you said you didn't see anything in logs. Can you try mapping /tmp as a volume - "-v /tmp:/tmp" and see if it generates any logs? I think cloudprober will try to log in /tmp if not running on GCE (on GCE logs go to stackdriver logging).

@manugarg
Copy link
Contributor

manugarg commented Aug 3, 2018

Regarding my last comment about logging, I verified that our docker image's default command line is set to log to stderr:

ENTRYPOINT ["/cloudprober", "--logtostderr"]

So unless you're overriding the docker image entrypoint, cloudprober should be logging to stderr rather than a file under /tmp.

@Daxten
Copy link
Author

Daxten commented Aug 6, 2018

Hey,
thanks for helping out so much, I send you the log right now. I will create a checklist for the other points you mentioned and will check those the next time it happens

@manugarg
Copy link
Contributor

Hi @Daxten,

I got the logs. Also, responded over email but to close the loop here:

===
Looking at the logs, it seems that this is not probe specific as all probes stop outputting the data at around the same time. I'll try to add some more logging and profiling, and provide you with a different container image version. Sorry, it may require some more work.

Just to collect some more info -

  • You're not seeing any CPU/memory exhaustion related issues on the node/container?
  • Can you share your deployment environment with me - How is docker container run? Volumes mapped into the container. If there is a Kubernetes config, can you share that as well?
  • Which docker image are you running? I recently cut the release - 0.9.4. May be you can pin your deployment to that version so that we have something definitive to work with.
    ===

I improved the logging in last couple of changes. Can you retry with the "latest" container?

@manugarg
Copy link
Contributor

manugarg commented Dec 8, 2018

@Daxten, I was wondering if you're still experiencing this issue. Can we close this issue if you're not.

Thanks,
Manu

@manugarg
Copy link
Contributor

Closing this due to inactivity. Please feel free to reopen if it's still a problem. I'll be more than happy to debug this with you. Cheers.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants