Containers in restarting mode are still reported but the timestamp falls outside of retention policy... #3383

crazycodr · 2017-10-24T12:37:00Z

Bug report

I just started using the TICK stack in favor of our old Prometheus stack and found this. When you have a container that is stuck in restarting mode. Telegraf docker plugin generates points that fall outside my already pretty wide retention policy (180d) which probably means it is sending a timestamp that is extremely low or maybe even a 0 value.

Relevant telegraf.conf:

Nothing special, just:

Ensure docker plugin is scanning for metrics
Ensure you are sending your data to an influxdb

System info:

Ubuntu 16.04 LTS: 4.4.0-66-generic
Docker version 1.12.6, build 78d1802
Telegraf (running from docker) using "telegraf:alpine"

Steps to reproduce:

Create a docker environment where
1.1. a container is stuck in restarting mode (a custom container that fails to start because there is an error in the docker-entrypoint.sh usually does the trick)
1.2. an influxdb is running
1.3. a telegraf with docker plugin is active (no special config)
Let telegraf run, you should eventually get "Points beyond retention policy: Response Error: Status Code [400], expected [204]"
Stop the container that is in restarting mode, you will stop getting warnings about points beyond retention policy

Expected behavior:

You should not be submitting points that result in an invalid timestamp

Actual behavior:

You are submitting points that result in an invalid timestamp which generates error messages

crazycodr · 2017-10-24T12:38:26Z

To be noted, it is actually ironic, but my stuck in restart mode container is my prometheus container that i was trying to replace... LOL

danielnelson · 2017-10-24T18:30:01Z

Can you enable a file output and reproduce the error? This will let us see exactly what we are submitting to InfluxDB.

crazycodr · 2017-10-24T19:24:04Z

I will put this on my todo list for this week

jspiewak · 2017-10-26T22:29:01Z

We were just diagnosing this issue and tracked it down to a container that we were stopping.
Telegraf is reporting a negative timestamp:

{"fields": {"usage_percent": 0, "limit": 0, "container_id": "2f9a5944f17e923b3bd5e9d0a6af9d0112fa9c3ce0f43c011d5b238ac01e6032", "max_usage": 0, "usage": 0}, "tags": {"container_version": "latest", "container_name": "yyyyyy", "host": "apps9", "container_image": "xxxxxx"}, "name": "docker_container_mem", "timestamp": -6795364578}

danielnelson · 2017-10-26T22:42:12Z

@jspiewak Can you add the output of:

curl --unix-socket /var/run/docker.sock http://localhost/containers/2f9a5944f17e923b3bd5e9d0a6af9d0112fa9c3ce0f43c011d5b238ac01e6032/stats

jspiewak · 2017-10-27T13:29:31Z

@danielnelson our system runs customer workloads and we create, start, stop, and then destroy containers all the time. That container is long gone.

That said, we are regularly seeing the negative timestamp occur, so if there is a different thing you would like me to try, happy to give it a go.

In case it is at all useful, we are running Docker 17.04 and telegraf 1.4.1

danielnelson · 2017-10-27T18:38:06Z

Do you think you can capture this from another container exhibiting the bug?

jspiewak · 2017-10-27T20:25:45Z

Doubtful. The scenario we have is that our system decides the container should no longer be running, it is stopped, and then it is destroyed. My guess is that the telegraf docker plugin obtains a list of all running containers, then while it is obtaining the stats for the containers, the container in question is stopped and destroyed. When telegraf goes to obtain stats, something doesn't go quite right, and the negative timestamp is the consequence.

FWIW, we likely run too many containers per host, so there will be times when 90 containers need metrics collected in a pass.

danielnelson · 2017-10-27T20:42:02Z

I could create a special build of Telegraf that will dump the JSON when an error occurs, would you be able to run this until a negative timestamp is found?

jspiewak · 2017-10-30T12:58:07Z

Sure thing. Would it be helpful to upgrade to 1.4.3 first?

danielnelson · 2017-10-30T22:40:59Z

Here is a build of 1.4.3 with the logging patch, it will log the JSON from the endpoint if the timestamp is negative. Debian Package, let me know if you need a different package.

jspiewak · 2017-10-31T12:32:37Z

We are running telegraf as a Docker container, actually

danielnelson · 2017-10-31T20:25:42Z

docker pull danielnelson/telegraf:22d26745

jspiewak · 2017-11-02T17:27:12Z

FYI, looks like the image is missing the entrypoint.sh, and is much smaller than 1.4.3:

$ docker images | grep tele
danielnelson/telegraf                                    22d26745              179e4180d22a        45 hours ago        136MB
telegraf                                                 1.4.3                 1325ecd2dbf7        7 days ago          207MB

We using the Chef Docker cookbook to configure the container, and something about not having the entrypoint broke the update to use this image.

Anyway, it is now deployed in production so we should have something by tomorrow. What can I grep for in the output, since we have it log the metrics as well?

jspiewak · 2017-11-02T22:24:35Z

I don't see any JSON debugging, but I had the processors.printer enabled anyway:

docker_container_cpu,host=apps16 throttling_throttled_time=0i,usage_total=0i,usage_system=0i,throttling_periods=0i,throttling_throttled_periods=0i,container_id="503106b28c85a71502fcc6350dfee60a61b4a99f604c02429ebbadba739e4869",usage_percent=0,usage_in_usermode=0i,usage_in_kernelmode=0i -6795364578871345152

jspiewak · 2017-11-03T13:08:58Z

By the way, -6795364578871345152 appears to be a magic number:

https://groups.google.com/d/msg/influxdb/ZT1NOsKIHCs/9Ojqh083IyAJ
https://gist.github.com/patdowney/b64a4fee790ac66dd1a2

danielnelson · 2017-11-06T19:56:35Z

Okay, it all makes sense now and explains why the logging didn't work too, this is just the zero value for a time.Time. I think the best solution is to ignore the time reported by docker and just always use the current time according to Telegraf.

danielnelson added the bug unexpected problem or unintended behavior label Oct 24, 2017

danielnelson mentioned this issue Nov 6, 2017

Use current time if container read time is zero value #3437

Merged

3 tasks

danielnelson added this to the 1.4.4 milestone Nov 7, 2017

danielnelson closed this as completed in #3437 Nov 7, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Containers in restarting mode are still reported but the timestamp falls outside of retention policy... #3383

Containers in restarting mode are still reported but the timestamp falls outside of retention policy... #3383

crazycodr commented Oct 24, 2017

crazycodr commented Oct 24, 2017

danielnelson commented Oct 24, 2017

crazycodr commented Oct 24, 2017

jspiewak commented Oct 26, 2017

danielnelson commented Oct 26, 2017

jspiewak commented Oct 27, 2017

danielnelson commented Oct 27, 2017

jspiewak commented Oct 27, 2017

danielnelson commented Oct 27, 2017

jspiewak commented Oct 30, 2017

danielnelson commented Oct 30, 2017

jspiewak commented Oct 31, 2017

danielnelson commented Oct 31, 2017

jspiewak commented Nov 2, 2017

jspiewak commented Nov 2, 2017

jspiewak commented Nov 3, 2017

danielnelson commented Nov 6, 2017

Containers in restarting mode are still reported but the timestamp falls outside of retention policy... #3383

Containers in restarting mode are still reported but the timestamp falls outside of retention policy... #3383

Comments

crazycodr commented Oct 24, 2017

Bug report

Relevant telegraf.conf:

System info:

Steps to reproduce:

Expected behavior:

Actual behavior:

crazycodr commented Oct 24, 2017

danielnelson commented Oct 24, 2017

crazycodr commented Oct 24, 2017

jspiewak commented Oct 26, 2017

danielnelson commented Oct 26, 2017

jspiewak commented Oct 27, 2017

danielnelson commented Oct 27, 2017

jspiewak commented Oct 27, 2017

danielnelson commented Oct 27, 2017

jspiewak commented Oct 30, 2017

danielnelson commented Oct 30, 2017

jspiewak commented Oct 31, 2017

danielnelson commented Oct 31, 2017

jspiewak commented Nov 2, 2017

jspiewak commented Nov 2, 2017

jspiewak commented Nov 3, 2017

danielnelson commented Nov 6, 2017