Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove cont_id tag from docker plugin #1015

Merged
merged 1 commit into from
Apr 18, 2016
Merged

Remove cont_id tag from docker plugin #1015

merged 1 commit into from
Apr 18, 2016

Conversation

sparrc
Copy link
Contributor

@sparrc sparrc commented Apr 12, 2016

this would make container_id a field

also renaming cont_name and cont_image to container_name and
container_image.

closes #1014

@sparrc
Copy link
Contributor Author

sparrc commented Apr 12, 2016

cc @jchauncey @gena01 @stanch

including some potentially interested parties for review. See description above for an overview of the changes this would bring, some of which would require adjustment to your queries.

@gena01
Copy link

gena01 commented Apr 12, 2016

@sparrc can you make this one optional?

I know it's an edge case at this point but graphing metrics localized to a single container is based on container_id and if we remove that then we can't track the behavior of the container over time.

@jchauncey
Copy link

Yeah container id is definitely needed

@sparrc
Copy link
Contributor Author

sparrc commented Apr 12, 2016

@gena01 @jchauncey what if we made it a field instead of a tag?

@gena01
Copy link

gena01 commented Apr 13, 2016

@sparrc can you remind me what the difference is?

@sparrc
Copy link
Contributor Author

sparrc commented Apr 13, 2016

@gena01

In line-protocol terms:

cpu,host=localhost usage=99 <timestamp>
     ^tag            ^field

basically, tags are indexed and fields are not. This means that WHERE queries on fields will be less performant than tags. You will still be able to run SELECT queries on fields. The major drawback of tags is that each unique set of measurement/tags creates a new "series" in influxdb, which increases cardinality of the database (which significantly effects performance)

There's a more detailed doc here: https://docs.influxdata.com/influxdb/v0.12/concepts/schema_and_data_layout/#encouraged-schema-design

@jchauncey
Copy link

What is the reasoning behind removing it? Just trying to get a little context.

@sparrc
Copy link
Contributor Author

sparrc commented Apr 13, 2016

some users complained because when you bring a docker container up and down, the ID will change. This means that for every container restart you will create a new container_id tag value, which creates a new series in InfluxDB.

Creating a new series increases the cardinality and too many series can bog down the performance of the database.

So the reason behind this is because it can easily create a very poorly performing InfluxDB schema design, since most docker users stop & restart containers frequently.

@jchauncey
Copy link

Ah ok. My first thought is more philosophical in that I wonder how they are graphing the data? I would want to slice the data along the container_id so when new containers come online or go away I would see that reflected in the graph. Container names are pointless in almost all cases because they are either going to be teh same or some random junk that is hard to parse anyways.

If going from tag to field helps with performance than thats fine with me.

@sparrc
Copy link
Contributor Author

sparrc commented Apr 13, 2016

@jchauncey, that should still work, SELECT container_id FROM docker will still be a valid query.

The main difference would be that SELECT <some_metric> FROM docker WHERE container_id='...' will have worse performance.

@jchauncey
Copy link

K i doubt people are doing the second (unless they have some templated query they are using to filter a graph)

@sparrc
Copy link
Contributor Author

sparrc commented Apr 13, 2016

cc @tripledes @asdfsx @AdithyaBenny @zstyblik @sporokh including some more people who I think will want to know about this potential change

@sporokh
Copy link

sporokh commented Apr 14, 2016

@sparrc Thanks a lot for notifying. Much appreciate.

@asdfsx
Copy link

asdfsx commented Apr 14, 2016

thx for notifying!!!

@zstyblik
Copy link
Contributor

@sparrc turning container ID into field makes sense as you've explained.

@gena01
Copy link

gena01 commented Apr 15, 2016

@sparrc would it make sense to make cont_name a field as well?

@sparrc
Copy link
Contributor Author

sparrc commented Apr 15, 2016

I suppose it depends on how the user is using container names in docker. Some people might setup their containers with the same name on restart, others might go with the randomized docker default names.....

but since docker randomizes by default, I'm thinking that you're right and that we should make container_name a field as well.

Anyone here have other thoughts? Do you ever do WHERE queries using the cont_name tag? (ie SELECT <metric> FROM docker_cpu WHERE cont_name='foo')

@jchauncey
Copy link

Knowing how containers are named in kubernetes it would probably be best if container name was a field.

@zarnovican
Copy link
Contributor

I'm one of those hit by "high cardinality of docker measurements" issue. From my point, #1017 would fix the problem. You would simply filter-out tag which has randomized value.

However, we have static container names, which is used in Grafana queries like:

SELECT "usage_percent" FROM "docker_cpu"
WHERE "host" = '$host' AND $timeFilter
GROUP BY "cont_name"

Would I be able to GROUP BY name if it was a field ??

@tripledes
Copy link

tripledes commented Apr 17, 2016

@sparrc thanks for pinging.

I guess this change makes sense specially for InfluxDB, right? Just wondering would the change affect other outputs which are not that flexible from a data structure point of view...any idea?

@whetherharder
Copy link

May be the best way is to make it configurable ? In my particular case
grouping data by cont_name is important to have required detalization
17.04.2016 14:01 пользователь "Sergio Jimenez" notifications@github.com
написал:

@sparrc https://github.com/sparrc thanks for pinging.

I guess this change makes sense specially for InfluxDB, right? Just
wondering would the change would affect other outputs which are not that
flexible from a data structure point of view...any idea?


You are receiving this because you are subscribed to this thread.
Reply to this email directly or view it on GitHub
#1015 (comment)

@sparrc
Copy link
Contributor Author

sparrc commented Apr 17, 2016

Investigating it further, I actually think that we'll need to keep container_name a tag (but I will keep container_id a field).

The reason is that, if we don't make container_name a tag, then there is nothing to uniquely identify two containers with the same image. If container_name is creating too much cardinality then users can use the new tagexclude argument to remove it from their measurements if they don't need it (see #1017)

To @tripledes point, this is true for InfluxDB, Graphite, and prometheus

@sparrc
Copy link
Contributor Author

sparrc commented Apr 17, 2016

Thinking about this a bit more, I think that it makes the most sense to rename a few of the measurements, specifically:

  • docker_cpu
  • docker_mem
  • docker_net

would be renamed to:

  • docker_container_cpu
  • docker_container_mem
  • docker_container_net

Why? Because these metrics are specifically tracking per-container stats. The problem with per-container stats, in some use-cases, is that if containers are short-lived AND names are not kept consistent, then the series cardinality will balloon very quickly.

So adding "_container" to each metric would:

  • make it more clear that these metrics are per-container, and
  • allow users to easily drop per-container metrics if cardinality is an issue (namedrop = ["docker_container_*"])

NOTE to clarify, we will still be changing container_id to a field, and container_name will remain a tag (so GROUP BY and WHERE will still work on container_name).

thoughts?

@jchauncey
Copy link

Sounds good to me.

@sporokh
Copy link

sporokh commented Apr 18, 2016

@sparrc totally agree with you considering container_name. I think it should be keep as tag, first of all cause you've already mentioned - we removing the ability to identify two containers with the same image, and the second one - stable production environments run by docker/docker-compose do not use randomly generated names.
I guess the problem only with:
Kubernetes
Docker when you don't set "--name"

@sparrc sparrc force-pushed the cs1014 branch 3 times, most recently from 90dd487 to f559214 Compare April 18, 2016 17:39
- renaming cont_name and cont_image to container_name and
container_image.
- cont_id is now a field, called container_id
- docker_cpu, docker_mem, docker_net measurements have been renamed to
  docker_container_cpu, docker_container_mem, and docker_container_net

closes #1014
closes #1052
@sparrc
Copy link
Contributor Author

sparrc commented Apr 18, 2016

OK, this has now been merged into master as I think I've given everyone time to provide input, and overall I felt like there was positive feedback on this change, despite it being a breaking change.

see the changelog for a full rundown

thanks all for the input!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

remove cont_id tag from docker plugin
9 participants