Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers in restarting mode are still reported but the timestamp falls outside of retention policy... #3383

Closed
crazycodr opened this issue Oct 24, 2017 · 17 comments · Fixed by #3437
Labels
bug unexpected problem or unintended behavior
Milestone

Comments

@crazycodr
Copy link

Bug report

I just started using the TICK stack in favor of our old Prometheus stack and found this. When you have a container that is stuck in restarting mode. Telegraf docker plugin generates points that fall outside my already pretty wide retention policy (180d) which probably means it is sending a timestamp that is extremely low or maybe even a 0 value.

Relevant telegraf.conf:

Nothing special, just:

  • Ensure docker plugin is scanning for metrics
  • Ensure you are sending your data to an influxdb

System info:

Ubuntu 16.04 LTS: 4.4.0-66-generic
Docker version 1.12.6, build 78d1802
Telegraf (running from docker) using "telegraf:alpine"

Steps to reproduce:

  1. Create a docker environment where
    1.1. a container is stuck in restarting mode (a custom container that fails to start because there is an error in the docker-entrypoint.sh usually does the trick)
    1.2. an influxdb is running
    1.3. a telegraf with docker plugin is active (no special config)
  2. Let telegraf run, you should eventually get "Points beyond retention policy: Response Error: Status Code [400], expected [204]"
  3. Stop the container that is in restarting mode, you will stop getting warnings about points beyond retention policy

Expected behavior:

You should not be submitting points that result in an invalid timestamp

Actual behavior:

You are submitting points that result in an invalid timestamp which generates error messages

@crazycodr
Copy link
Author

To be noted, it is actually ironic, but my stuck in restart mode container is my prometheus container that i was trying to replace... LOL

@danielnelson
Copy link
Contributor

Can you enable a file output and reproduce the error? This will let us see exactly what we are submitting to InfluxDB.

@crazycodr
Copy link
Author

I will put this on my todo list for this week

@danielnelson danielnelson added the bug unexpected problem or unintended behavior label Oct 24, 2017
@jspiewak
Copy link

We were just diagnosing this issue and tracked it down to a container that we were stopping.
Telegraf is reporting a negative timestamp:

{"fields": {"usage_percent": 0, "limit": 0, "container_id": "2f9a5944f17e923b3bd5e9d0a6af9d0112fa9c3ce0f43c011d5b238ac01e6032", "max_usage": 0, "usage": 0}, "tags": {"container_version": "latest", "container_name": "yyyyyy", "host": "apps9", "container_image": "xxxxxx"}, "name": "docker_container_mem", "timestamp": -6795364578}

@danielnelson
Copy link
Contributor

@jspiewak Can you add the output of:

curl --unix-socket /var/run/docker.sock http://localhost/containers/2f9a5944f17e923b3bd5e9d0a6af9d0112fa9c3ce0f43c011d5b238ac01e6032/stats

@jspiewak
Copy link

@danielnelson our system runs customer workloads and we create, start, stop, and then destroy containers all the time. That container is long gone.

That said, we are regularly seeing the negative timestamp occur, so if there is a different thing you would like me to try, happy to give it a go.

In case it is at all useful, we are running Docker 17.04 and telegraf 1.4.1

@danielnelson
Copy link
Contributor

Do you think you can capture this from another container exhibiting the bug?

@jspiewak
Copy link

Doubtful. The scenario we have is that our system decides the container should no longer be running, it is stopped, and then it is destroyed. My guess is that the telegraf docker plugin obtains a list of all running containers, then while it is obtaining the stats for the containers, the container in question is stopped and destroyed. When telegraf goes to obtain stats, something doesn't go quite right, and the negative timestamp is the consequence.

FWIW, we likely run too many containers per host, so there will be times when 90 containers need metrics collected in a pass.

@danielnelson
Copy link
Contributor

I could create a special build of Telegraf that will dump the JSON when an error occurs, would you be able to run this until a negative timestamp is found?

@jspiewak
Copy link

Sure thing. Would it be helpful to upgrade to 1.4.3 first?

@danielnelson
Copy link
Contributor

Here is a build of 1.4.3 with the logging patch, it will log the JSON from the endpoint if the timestamp is negative. Debian Package, let me know if you need a different package.

@jspiewak
Copy link

We are running telegraf as a Docker container, actually

@danielnelson
Copy link
Contributor

docker pull danielnelson/telegraf:22d26745

@jspiewak
Copy link

jspiewak commented Nov 2, 2017

FYI, looks like the image is missing the entrypoint.sh, and is much smaller than 1.4.3:

$ docker images | grep tele
danielnelson/telegraf                                    22d26745              179e4180d22a        45 hours ago        136MB
telegraf                                                 1.4.3                 1325ecd2dbf7        7 days ago          207MB

We using the Chef Docker cookbook to configure the container, and something about not having the entrypoint broke the update to use this image.

Anyway, it is now deployed in production so we should have something by tomorrow. What can I grep for in the output, since we have it log the metrics as well?

@jspiewak
Copy link

jspiewak commented Nov 2, 2017

I don't see any JSON debugging, but I had the processors.printer enabled anyway:

docker_container_cpu,host=apps16 throttling_throttled_time=0i,usage_total=0i,usage_system=0i,throttling_periods=0i,throttling_throttled_periods=0i,container_id="503106b28c85a71502fcc6350dfee60a61b4a99f604c02429ebbadba739e4869",usage_percent=0,usage_in_usermode=0i,usage_in_kernelmode=0i -6795364578871345152

@jspiewak
Copy link

jspiewak commented Nov 3, 2017

By the way, -6795364578871345152 appears to be a magic number:

https://groups.google.com/d/msg/influxdb/ZT1NOsKIHCs/9Ojqh083IyAJ
https://gist.github.com/patdowney/b64a4fee790ac66dd1a2

@danielnelson
Copy link
Contributor

Okay, it all makes sense now and explains why the logging didn't work too, this is just the zero value for a time.Time. I think the best solution is to ignore the time reported by docker and just always use the current time according to Telegraf.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants