New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metrics: prometheus integration road map #27307

Open
stevvooe opened this Issue Oct 11, 2016 · 35 comments

Comments

Projects
None yet
@stevvooe
Contributor

stevvooe commented Oct 11, 2016

At the Docker Distributed Systems Summit, we discussed the integration of prometheus into docker. We made several clarifications around how to match the two models, both at the engine-level and the swarm mode-level.

Most importantly, we defined the high level steps that need to be taken to achieve a nice result:

  • Integrate a prometheus metrics output for internal behavior of docker itself. Currently, we are interested instrumenting container startup, but may include further metrics. This will help support the Docker Maintainer and Contributor team to make docker better while also allowing use to learn about using prometheus in a production product. (#25820)
  • Expose externally observable containers metrics. This should replace use cases that involve docker stats and external metrics exporters. The focus must be on a stable and scalable schema, while supporting the future goals of application-level metrics. We may have to rely on externally support target discovery, but it may make sense to tackle some of this in the interim.
  • Define cluster-level discovery mechanism to expose topology to prometheus discovery. This will include integration into swarm-mode as well as single node target discovery. We may find that we need to do this in conjunction with externally observable metric.
  • Expose per-container metrics including application-level proxied metrics. This will mean that application and container-level metrics will be served under a single target. No aggregation will be done in the docker engine. Such an integration will be a pass-through proxy, annotated with target-specific data, such as container ids.

cc @juliusv

@juliusv

This comment has been minimized.

juliusv commented Oct 11, 2016

Thanks @stevvooe, that's a great summary of our discussion! One question:

This will mean that application and container-level metrics will be served under a single target.

Does this mean you'd want to merge together the internal whitebox metrics from an application's /metrics endpoint together with the externally observable container metrics as supplied by Docker itself? I'd say there's cases where you'd want to scrape one, but not the other (for example, a cluster admin would only want to ingest external metrics), so maybe better to keep it separate? But maybe I'm misunderstanding it.

@brian-brazil

This comment has been minimized.

brian-brazil commented Oct 11, 2016

Expose externally observable containers metrics.

This should go somewhere other than /metrics, as otherwise these metrics will take significant resources away from both infrastructure level monitoring and service-level monitoring due to them having to monitor everyone else's machines/containers.

Expose per-container metrics including application-level proxied metrics.

Scraping of applications should happen directly. Application metrics are not container metrics, and similarly container metrics are not application metrics.

Such an integration will be a pass-through proxy, annotated with target-specific data, such as container ids.

This is an anti-pattern in Prometheus. Target labels should come from service discovery and relabelling, not the target itself.

@stevvooe

This comment has been minimized.

Contributor

stevvooe commented Oct 12, 2016

@brian-brazil @juliusv Saying something is an anti-pattern is an anti-pattern. 🐹

Let's try to view this more as a high-level road map than a place for discussion of technical details.

The design discussion needs to really be around how we integrate target discovery to control metadata tagging. If I was not clear above, there will be no tagging if we do any kind of proxying. This will all be done via the target metadata associated with the scrape. While I agree that it would be ideal for applications to be scraped directly, there may be complexities created in that model that can be avoided if the management plane is leveraged.

The options are pretty much as follows:

  1. Have one, gigantic slow metric endpoint that does everything.
  2. Expose internal engine targets with a separate target for each container, along with application metadata. This has the benefit keeping the container performance associated with the application.
  3. Separate out into internal engine, container targets and application targets.

I suspect the right answer is between 2 and 3.

Remember, the goal here is to require almost zero configuration for this to work out of the box. Anything that involves running a specific container on a specific network can start making this more complex for end users.

BUT, let's discuss this in further detail after we've been through steps of exporting engine metrics (step 1 on the road map). I suspect there are some misunderstandings around terminology that will be more clear when we have further experience. Specifically, there are some details around target discovery that I'm a little unclear on.

@juliusv

This comment has been minimized.

juliusv commented Oct 12, 2016

@brian-brazil As background information, the motivation for scraping apps through a transparent proxy (with target metadata coming via the SD as target labels, not in the metrics output itself) is that on Docker, people run everything in their own little segmented virtual Docker networks (per service or similar), so Prometheus would be unable to reach apps directly except for the ones in its own Docker network. That transparent proxy would effectively punch a hole through that networking segmentation for the purpose of gathering metrics.

@jimmidyson

This comment has been minimized.

jimmidyson commented Oct 12, 2016

This is a great idea - looking forward to it already. In time, feels like it would make sense to be part of OCI spec perhaps?

@brian-brazil

This comment has been minimized.

brian-brazil commented Oct 12, 2016

Thanks for the clarification. I'd propose 3). A transparent proxy sounds like a good idea to handle varying network deployments.

I don't think injecting metrics into the application's metrics via the proxy is workable due to collisions, the potential for multiple applications living inside one container and generally getting in the way (consider the blackbox or snmp exporter, container metrics are irrelevant for its typical usage).

Specifically, there are some details around target discovery that I'm a little unclear on.

Anything in particular we can help clarify?

@stevvooe

This comment has been minimized.

Contributor

stevvooe commented Oct 12, 2016

I don't think injecting metrics into the application's metrics via the proxy is workable due to collisions, the potential for multiple applications living inside one container and generally getting in the way (consider the blackbox or snmp exporter, container metrics are irrelevant for its typical usage).

We'll have to try a few things out here. I think we'll know more when we start seeing the volume of container metrics.

It might be good to review these PRs based on metrics endpoint output, rather than purely code. I think it will clear up a lot of confusion and that will be the interface that we standardize upon.

Anything in particular we can help clarify?

I just need to go read the implementation. This one is on me.

@brian-brazil

This comment has been minimized.

brian-brazil commented Oct 12, 2016

We'll have to try a few things out here. I think we'll know more when we start seeing the volume of container metrics.

I don't see the volume being a major issue as long as we avoid 1), the challenge is more around semantics and edge cases.

I just need to go read the implementation. This one is on me.

The short version is that you give Prometheus a list of targets, each target having whatever metadata might be useful as a set of key/value pairs. I'm guessing here we'd be talking about a regular poll of the Docker API to get this information. That's then offered up to the user to munge with relabelling.

The main things to do then are determine how/what information is pulled from Docker, and what the example Prometheus configuration looks like for this (there'll likely be a moderate amount of boilerplate, so you want something copy&pasteable).

@jmkgreen

This comment has been minimized.

jmkgreen commented Dec 12, 2016

Is there a discussion document that defines the scope of this work that you guys can supply a reference to? Having stumbled upon this, it looks both interesting and potentially very broad.

Some goals and non-goals would be helpful.

@stevvooe

This comment has been minimized.

Contributor

stevvooe commented Dec 13, 2016

@jmkgreen What aren't we covering in the description of this issue? Indeed it is broad. Creating a rigid, structured plan is both a lot of work and won't necessarily provide a better result, especially when there is a feedback component to each stage.

As we take each item here, we will create issues that cover the details of the issue at hand. Scope and goals will be better defined at that time. If we need to adjust scope based on the results of that plan, we will adjust this road map.

Is there something specific missing here? What decision are you trying to make?

@jmkgreen

This comment has been minimized.

jmkgreen commented Dec 14, 2016

The original post reads as though there are already a set of intentions in mind. When talking of goals and non-goals, I do not expect a long bullet-list, but would like to know what is to be monitored and who is the expected audience of it's output.

Are we talking operations people monitoring hardware capacity here ("we need to order more machines"), the operating capacity of individual docker containers ("we need to spread the load more" or "we can evacuate these machines for reboot and still cope"), or applications engineers ("we clearly have a memory leak when in production" or "we need to look at the db - it is really slow").

Monitoring is an enormous topic. What are you trying to do here?

@errordeveloper

This comment has been minimized.

Contributor

errordeveloper commented Dec 14, 2016

@stevvooe

This comment has been minimized.

Contributor

stevvooe commented Dec 14, 2016

@jmkgreen If you read carefully, your questions are covered in the bullet points in the description. I'll admit, there is a leap in context in favor of brevity. It might help to read up on prometheus to better understand the points.

Let me break this down into simpler bullet points:

  • Docker engine metrics
  • Externally observable container metrics (CPU, memory, etc.)
  • Integrated target discovery (which endpoints to scrape metrics from)
  • Application-level metrics (forward targets for running container applications)

With all of these implemented, one should be able to hit all of the described use cases. However, the goal of this exercise is data flow. How you consume those metrics and who consumes those metrics is really out of scope. That should be up to the operators of the infrastructure.

As you have posed your inquiry, it sounds like you are just critiquing the methodology. If you could expand on why your asking this question or what conclusion you are attempting to make, I may be able to give you a better answer.

@jmkgreen

This comment has been minimized.

jmkgreen commented Dec 15, 2016

So this boils down to allowing for discovery of topology and the polling of metrics both within Docker and it's hosted applications? If so, that's a fine summary, and a welcome move. Finally, I assume this will result in something both prometheus and others can work against?

@brian-brazil

This comment has been minimized.

brian-brazil commented Dec 15, 2016

Prometheus is an open ecosystem, we've currently parsers in Go and Python for our format that others can use to integrate with whatever they like. For example with our Python client library it's possible to write less than 10 lines of code that can regularly fetch Prometheus-formatted data and push it out to Graphite.

I don't know the full details on the discovery and forwarding, but I'd be very surprised if there's anything that ties it to Prometheus.

@stevvooe

This comment has been minimized.

Contributor

stevvooe commented Dec 20, 2016

@jmkgreen Expaning on, @brian-brazil's response, that is exactly the case. Prometheus was chosen for its format. There may be community work to flow prom data into other systems but filling those gaps will help both the prom community and docker users, alike.

@stevvooe

This comment has been minimized.

Contributor

stevvooe commented Jan 25, 2017

Note that a default prometheus port has been requested in prometheus/prometheus#2366.

cc @FrenchBen

@lukemarsden

This comment has been minimized.

Contributor

lukemarsden commented Feb 15, 2017

@stevvooe do you have an idea of how to make something (a proxy, or a prom instance for instance) run on all docker networks? @justincormack and I might take a look at this today.

@stevvooe

This comment has been minimized.

Contributor

stevvooe commented Feb 15, 2017

@stevvooe do you have an idea of how to make something (a proxy, or a prom instance for instance) run on all docker networks?

You'd have to update the service every time a new network is added.

I think we'll have to come up with a better solution here, such as a metrics network or make it easy to export these metrics at the node level. Prometheus is very target-oriented and having a "direct path" to exported service is fairly important to their model.

@nustiueudinastea

This comment has been minimized.

nustiueudinastea commented Mar 30, 2017

Hi all, we (ContainerSolutions and Weave) made a proof of concept implementation for Prometheus service discovery in Swarm. You can see a demo in this video: https://drive.google.com/open?id=0B-ef0kzr77N8Zm5WRkZob0x3dEk

The POC repo can be found here https://github.com/ContainerSolutions/prometheus-swarm-discovery . The implementation is quite barebones at the moment and it works by continuously scanning all the Swarm services, and connecting the Prometheus container to all the networks that belong to Swarm services with running tasks. The user doesn't need to configure anything.

Would be happy to hear some opinions! @justincormack @juliusv @stevvooe

@bvis

This comment has been minimized.

bvis commented Mar 30, 2017

@nustiueudinastea Very interesting approach.
Could you share your /etc/prometheus/prometheus.yml config file to let me understand how do you process the file-based service discovery?

@nustiueudinastea

This comment has been minimized.

nustiueudinastea commented Mar 30, 2017

@bvis the config can be found in the POC repo: https://github.com/ContainerSolutions/prometheus-swarm-discovery/blob/master/prometheus-configs/prometheus.yaml. If you take a look at the docker-compose.yml file in the same repo, you will see that the discovery tool container shares a volume with the prometheus container, through which it writes the scrape target file.

@bvis

This comment has been minimized.

bvis commented Mar 30, 2017

@nustiueudinastea I've seen it through the source code of the image. I have some questions/suggestions:

  1. I've seen some errors in the logs, probably not important:
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | time="2017-03-30T06:11:45Z" level=info msg="Connecting network monitoring(10.0.3.0) to 6889df04fc7abc6cfd5e6da90d37904e38ce2f85fbc31e3f4445bb3cc868f908(10.0.3.253)"
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | panic: Error response from daemon: network 2a4sjfnp8n6fhaxk5oeikaxtt not found
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    |
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | goroutine 1 [running]:
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | panic(0x759480, 0xc420ab62d0)
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | 	/usr/local/go/src/runtime/panic.go:500 +0x1a1
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | main.connectNetworks(0xc4206603c8, 0xc420a79240, 0x40)
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | 	/go/src/github.com/weaveworks/prometheus-swarm/swarm.go:79 +0x8b6
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | main.discoverSwarm(0xc420a79240, 0x40)
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | 	/go/src/github.com/weaveworks/prometheus-swarm/swarm.go:223 +0x129c
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | main.discoveryProcess(0xc42009ad80, 0xc4201cca80, 0x0, 0x6)
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | 	/go/src/github.com/weaveworks/prometheus-swarm/swarm.go:254 +0x362
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | github.com/weaveworks/prometheus-swarm/vendor/github.com/spf13/cobra.(*Command).execute(0xc42009ad80, 0xc4201cc9c0, 0x6, 0x6, 0xc42009ad80, 0xc4201cc9c0)
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | 	/go/src/github.com/weaveworks/prometheus-swarm/vendor/github.com/spf13/cobra/command.go:648 +0x443
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | github.com/weaveworks/prometheus-swarm/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc42009afc0, 0xc42016dee8, 0x1, 0xc42009afc0)
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | 	/go/src/github.com/weaveworks/prometheus-swarm/vendor/github.com/spf13/cobra/command.go:734 +0x367
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | github.com/weaveworks/prometheus-swarm/vendor/github.com/spf13/cobra.(*Command).Execute(0xc42009afc0, 0xc42016dee0, 0x1)
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | 	/go/src/github.com/weaveworks/prometheus-swarm/vendor/github.com/spf13/cobra/command.go:693 +0x2b
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | main.main()
prometheus_swarm-discover.1.jxp6ny5bppdo@moby    | 	/go/src/github.com/weaveworks/prometheus-swarm/swarm.go:277 +0x36f
  1. I see that all the service tasks are automatically discovered and scraped by the swarm-discovery system. Wouldn't it have sense to just activate it with a label (whitelist with label -> prometheus.discover: "true") instead of blacklist (label -> prometheus.ignore: "true")

  2. Would it be possible to add more labels into the swarm-endpoints.json file like: node hostname and other service labels?

@nustiueudinastea

This comment has been minimized.

nustiueudinastea commented Mar 30, 2017

@bvis

  1. This shouldn't happen. Perhaps some kind of timing issue where the network disappeared in the meantime? If you open an issue on the project page and add more details about how you ended up getting that error, I will look into it. Is it happening continuously?

  2. Agreed. That can be easily changed and perhaps we could add a configuration flag to the tool, which makes the discovery implicit or explicit. I would like to see what other people think regarding this feature.

  3. Most definitely. If you can, please open an issue with this request.

The POC has been done quickly to see if we can get it to work, and we would have to see how much further effort we put into the POC or if we attempt to do a plugin implementation for Prometheus.

@thaJeztah

This comment has been minimized.

Member

thaJeztah commented Mar 30, 2017

Thanks @nustiueudinastea, that looks promising

@fabio-barile

This comment has been minimized.

fabio-barile commented Jun 4, 2017

Hi guys
Do you have a roadmap for the service discovery ?
Monitoring our services should be a great feature and permit us to create some auto scaling tool easier ;)
Thanks

@bklau

This comment has been minimized.

bklau commented Jun 4, 2017

I would like to see "Expose externally observable containers metrics." time-averaged and somehow integrated into a out-the-box built-in feature for autoscaling. Autoscaling is a MUST for Swarm Mode to be adopted at enterprise-level.

@vfarcic

This comment has been minimized.

vfarcic commented Jun 5, 2017

@bklau Maybe http://monitor.dockerflow.com/tutorial/ can help until out-of-the-box solution comes.

@errordeveloper

This comment has been minimized.

@Rucknar

This comment has been minimized.

Rucknar commented Sep 1, 2017

@stevvooe @thaJeztah What's the best way to track the progression of the metrics being implemented. The 'Expose externally observable containers metrics' has a ton of benefit and would remove the need for cadvisor etc from a lot of systems. Are you open to PR's for such work?

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Sep 1, 2017

@Rucknar We expect to get `Expose externally observable containers metrics' when containerd 1.0 is integrated.

@stevvooe

This comment has been minimized.

Contributor

stevvooe commented Sep 1, 2017

Here is an example of the metrics from containerd with and id of "foo1":

# HELP container_blkio_io_service_bytes_recursive_bytes The blkio io service bytes recursive
# TYPE container_blkio_io_service_bytes_recursive_bytes gauge
container_blkio_io_service_bytes_recursive_bytes{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Async"} 1.07159552e+08
container_blkio_io_service_bytes_recursive_bytes{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Read"} 0
container_blkio_io_service_bytes_recursive_bytes{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Sync"} 81920
container_blkio_io_service_bytes_recursive_bytes{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Total"} 1.07241472e+08
container_blkio_io_service_bytes_recursive_bytes{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Write"} 1.07241472e+08
# HELP container_blkio_io_serviced_recursive_total The blkio io servied recursive
# TYPE container_blkio_io_serviced_recursive_total gauge
container_blkio_io_serviced_recursive_total{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Async"} 892
container_blkio_io_serviced_recursive_total{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Read"} 0
container_blkio_io_serviced_recursive_total{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Sync"} 888
container_blkio_io_serviced_recursive_total{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Total"} 1780
container_blkio_io_serviced_recursive_total{container_id="foo1",device="/dev/nvme0n1",major="259",minor="0",namespace="default",op="Write"} 1780
# HELP container_cpu_kernel_nanoseconds The total kernel cpu time
# TYPE container_cpu_kernel_nanoseconds gauge
container_cpu_kernel_nanoseconds{container_id="foo1",namespace="default"} 3.7e+08
# HELP container_cpu_throttle_periods_total The total cpu throttle periods
# TYPE container_cpu_throttle_periods_total gauge
container_cpu_throttle_periods_total{container_id="foo1",namespace="default"} 0
# HELP container_cpu_throttled_periods_total The total cpu throttled periods
# TYPE container_cpu_throttled_periods_total gauge
container_cpu_throttled_periods_total{container_id="foo1",namespace="default"} 0
# HELP container_cpu_throttled_time_nanoseconds The total cpu throttled time
# TYPE container_cpu_throttled_time_nanoseconds gauge
container_cpu_throttled_time_nanoseconds{container_id="foo1",namespace="default"} 0
# HELP container_cpu_total_nanoseconds The total cpu time
# TYPE container_cpu_total_nanoseconds gauge
container_cpu_total_nanoseconds{container_id="foo1",namespace="default"} 1.091985041e+09
# HELP container_cpu_user_nanoseconds The total user cpu time
# TYPE container_cpu_user_nanoseconds gauge
container_cpu_user_nanoseconds{container_id="foo1",namespace="default"} 7.1e+08
# HELP container_hugetlb_failcnt_total The hugetlb failcnt
# TYPE container_hugetlb_failcnt_total gauge
container_hugetlb_failcnt_total{container_id="foo1",namespace="default",page="1GB"} 0
container_hugetlb_failcnt_total{container_id="foo1",namespace="default",page="2MB"} 0
# HELP container_hugetlb_max_bytes The hugetlb maximum usage
# TYPE container_hugetlb_max_bytes gauge
container_hugetlb_max_bytes{container_id="foo1",namespace="default",page="1GB"} 0
container_hugetlb_max_bytes{container_id="foo1",namespace="default",page="2MB"} 0
# HELP container_hugetlb_usage_bytes The hugetlb usage
# TYPE container_hugetlb_usage_bytes gauge
container_hugetlb_usage_bytes{container_id="foo1",namespace="default",page="1GB"} 0
container_hugetlb_usage_bytes{container_id="foo1",namespace="default",page="2MB"} 0
# HELP container_memory_active_anon_bytes The active_anon amount
# TYPE container_memory_active_anon_bytes gauge
container_memory_active_anon_bytes{container_id="foo1",namespace="default"} 2.666496e+06
# HELP container_memory_active_file_bytes The active_file amount
# TYPE container_memory_active_file_bytes gauge
container_memory_active_file_bytes{container_id="foo1",namespace="default"} 7.671808e+06
# HELP container_memory_cache_bytes The cache amount used
# TYPE container_memory_cache_bytes gauge
container_memory_cache_bytes{container_id="foo1",namespace="default"} 5.0950144e+07
# HELP container_memory_dirty_bytes The dirty amount
# TYPE container_memory_dirty_bytes gauge
container_memory_dirty_bytes{container_id="foo1",namespace="default"} 380928
# HELP container_memory_hierarchical_memory_limit_bytes The hierarchical_memory_limit amount
# TYPE container_memory_hierarchical_memory_limit_bytes gauge
container_memory_hierarchical_memory_limit_bytes{container_id="foo1",namespace="default"} 9.223372036854772e+18
# HELP container_memory_hierarchical_memsw_limit_bytes The hierarchical_memsw_limit amount
# TYPE container_memory_hierarchical_memsw_limit_bytes gauge
container_memory_hierarchical_memsw_limit_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_inactive_anon_bytes The inactive_anon amount
# TYPE container_memory_inactive_anon_bytes gauge
container_memory_inactive_anon_bytes{container_id="foo1",namespace="default"} 1.0752e+07
# HELP container_memory_inactive_file_bytes The inactive_file amount
# TYPE container_memory_inactive_file_bytes gauge
container_memory_inactive_file_bytes{container_id="foo1",namespace="default"} 3.2526336e+07
# HELP container_memory_kernel_failcnt_total The kernel failcnt
# TYPE container_memory_kernel_failcnt_total gauge
container_memory_kernel_failcnt_total{container_id="foo1",namespace="default"} 0
# HELP container_memory_kernel_limit_bytes The kernel limit
# TYPE container_memory_kernel_limit_bytes gauge
container_memory_kernel_limit_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_kernel_max_bytes The kernel maximum usage
# TYPE container_memory_kernel_max_bytes gauge
container_memory_kernel_max_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_kernel_usage_bytes The kernel usage
# TYPE container_memory_kernel_usage_bytes gauge
container_memory_kernel_usage_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_kerneltcp_failcnt_total The kerneltcp failcnt
# TYPE container_memory_kerneltcp_failcnt_total gauge
container_memory_kerneltcp_failcnt_total{container_id="foo1",namespace="default"} 0
# HELP container_memory_kerneltcp_limit_bytes The kerneltcp limit
# TYPE container_memory_kerneltcp_limit_bytes gauge
container_memory_kerneltcp_limit_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_kerneltcp_max_bytes The kerneltcp maximum usage
# TYPE container_memory_kerneltcp_max_bytes gauge
container_memory_kerneltcp_max_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_kerneltcp_usage_bytes The kerneltcp usage
# TYPE container_memory_kerneltcp_usage_bytes gauge
container_memory_kerneltcp_usage_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_mapped_file_bytes The mapped_file amount used
# TYPE container_memory_mapped_file_bytes gauge
container_memory_mapped_file_bytes{container_id="foo1",namespace="default"} 1.071104e+07
# HELP container_memory_oom_total The number of times a container received an oom event
# TYPE container_memory_oom_total gauge
container_memory_oom_total{container_id="foo",namespace="default"} 0
container_memory_oom_total{container_id="foo1",namespace="default"} 0
# HELP container_memory_pgfault_bytes The pgfault amount
# TYPE container_memory_pgfault_bytes gauge
container_memory_pgfault_bytes{container_id="foo1",namespace="default"} 32935
# HELP container_memory_pgmajfault_bytes The pgmajfault amount
# TYPE container_memory_pgmajfault_bytes gauge
container_memory_pgmajfault_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_pgpgin_bytes The pgpgin amount
# TYPE container_memory_pgpgin_bytes gauge
container_memory_pgpgin_bytes{container_id="foo1",namespace="default"} 36285
# HELP container_memory_pgpgout_bytes The pgpgout amount
# TYPE container_memory_pgpgout_bytes gauge
container_memory_pgpgout_bytes{container_id="foo1",namespace="default"} 23195
# HELP container_memory_rss_bytes The rss amount used
# TYPE container_memory_rss_bytes gauge
container_memory_rss_bytes{container_id="foo1",namespace="default"} 2.666496e+06
# HELP container_memory_rss_huge_bytes The rss_huge amount used
# TYPE container_memory_rss_huge_bytes gauge
container_memory_rss_huge_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_swap_failcnt_total The swap failcnt
# TYPE container_memory_swap_failcnt_total gauge
container_memory_swap_failcnt_total{container_id="foo1",namespace="default"} 0
# HELP container_memory_swap_limit_bytes The swap limit
# TYPE container_memory_swap_limit_bytes gauge
container_memory_swap_limit_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_swap_max_bytes The swap maximum usage
# TYPE container_memory_swap_max_bytes gauge
container_memory_swap_max_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_swap_usage_bytes The swap usage
# TYPE container_memory_swap_usage_bytes gauge
container_memory_swap_usage_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_total_active_anon_bytes The total_active_anon amount
# TYPE container_memory_total_active_anon_bytes gauge
container_memory_total_active_anon_bytes{container_id="foo1",namespace="default"} 2.666496e+06
# HELP container_memory_total_active_file_bytes The total_active_file amount
# TYPE container_memory_total_active_file_bytes gauge
container_memory_total_active_file_bytes{container_id="foo1",namespace="default"} 7.671808e+06
# HELP container_memory_total_cache_bytes The total_cache amount used
# TYPE container_memory_total_cache_bytes gauge
container_memory_total_cache_bytes{container_id="foo1",namespace="default"} 5.0950144e+07
# HELP container_memory_total_dirty_bytes The total_dirty amount
# TYPE container_memory_total_dirty_bytes gauge
container_memory_total_dirty_bytes{container_id="foo1",namespace="default"} 380928
# HELP container_memory_total_inactive_anon_bytes The total_inactive_anon amount
# TYPE container_memory_total_inactive_anon_bytes gauge
container_memory_total_inactive_anon_bytes{container_id="foo1",namespace="default"} 1.0752e+07
# HELP container_memory_total_inactive_file_bytes The total_inactive_file amount
# TYPE container_memory_total_inactive_file_bytes gauge
container_memory_total_inactive_file_bytes{container_id="foo1",namespace="default"} 3.2526336e+07
# HELP container_memory_total_mapped_file_bytes The total_mapped_file amount used
# TYPE container_memory_total_mapped_file_bytes gauge
container_memory_total_mapped_file_bytes{container_id="foo1",namespace="default"} 1.071104e+07
# HELP container_memory_total_pgfault_bytes The total_pgfault amount
# TYPE container_memory_total_pgfault_bytes gauge
container_memory_total_pgfault_bytes{container_id="foo1",namespace="default"} 32935
# HELP container_memory_total_pgmajfault_bytes The total_pgmajfault amount
# TYPE container_memory_total_pgmajfault_bytes gauge
container_memory_total_pgmajfault_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_total_pgpgin_bytes The total_pgpgin amount
# TYPE container_memory_total_pgpgin_bytes gauge
container_memory_total_pgpgin_bytes{container_id="foo1",namespace="default"} 36285
# HELP container_memory_total_pgpgout_bytes The total_pgpgout amount
# TYPE container_memory_total_pgpgout_bytes gauge
container_memory_total_pgpgout_bytes{container_id="foo1",namespace="default"} 23195
# HELP container_memory_total_rss_bytes The total_rss amount used
# TYPE container_memory_total_rss_bytes gauge
container_memory_total_rss_bytes{container_id="foo1",namespace="default"} 2.666496e+06
# HELP container_memory_total_rss_huge_bytes The total_rss_huge amount used
# TYPE container_memory_total_rss_huge_bytes gauge
container_memory_total_rss_huge_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_total_unevictable_bytes The total_unevictable amount
# TYPE container_memory_total_unevictable_bytes gauge
container_memory_total_unevictable_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_total_writeback_bytes The total_writeback amount
# TYPE container_memory_total_writeback_bytes gauge
container_memory_total_writeback_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_unevictable_bytes The unevictable amount
# TYPE container_memory_unevictable_bytes gauge
container_memory_unevictable_bytes{container_id="foo1",namespace="default"} 0
# HELP container_memory_usage_failcnt_total The usage failcnt
# TYPE container_memory_usage_failcnt_total gauge
container_memory_usage_failcnt_total{container_id="foo1",namespace="default"} 0
# HELP container_memory_usage_limit_bytes The memory limit
# TYPE container_memory_usage_limit_bytes gauge
container_memory_usage_limit_bytes{container_id="foo1",namespace="default"} 9.223372036854772e+18
# HELP container_memory_usage_max_bytes The memory maximum usage
# TYPE container_memory_usage_max_bytes gauge
container_memory_usage_max_bytes{container_id="foo1",namespace="default"} 7.4551296e+07
# HELP container_memory_usage_usage_bytes The memory usage
# TYPE container_memory_usage_usage_bytes gauge
container_memory_usage_usage_bytes{container_id="foo1",namespace="default"} 6.3070208e+07
# HELP container_memory_writeback_bytes The writeback amount
# TYPE container_memory_writeback_bytes gauge
container_memory_writeback_bytes{container_id="foo1",namespace="default"} 0
# HELP container_per_cpu_nanoseconds The total cpu time per cpu
# TYPE container_per_cpu_nanoseconds gauge
container_per_cpu_nanoseconds{container_id="foo1",cpu="0",namespace="default"} 3.64542053e+08
container_per_cpu_nanoseconds{container_id="foo1",cpu="1",namespace="default"} 4.5808741e+07
container_per_cpu_nanoseconds{container_id="foo1",cpu="2",namespace="default"} 1.8118069e+07
container_per_cpu_nanoseconds{container_id="foo1",cpu="3",namespace="default"} 1.91873454e+08
container_per_cpu_nanoseconds{container_id="foo1",cpu="4",namespace="default"} 3.94664351e+08
container_per_cpu_nanoseconds{container_id="foo1",cpu="5",namespace="default"} 2.3973698e+07
container_per_cpu_nanoseconds{container_id="foo1",cpu="6",namespace="default"} 3.7524589e+07
container_per_cpu_nanoseconds{container_id="foo1",cpu="7",namespace="default"} 1.5480086e+07
# HELP container_pids_current The current number of pids
# TYPE container_pids_current gauge
container_pids_current{container_id="foo1",namespace="default"} 6
# HELP container_pids_limit The limit to the number of pids allowed
# TYPE container_pids_limit gauge
container_pids_limit{container_id="foo1",namespace="default"} 0
@Rucknar

This comment has been minimized.

Rucknar commented Sep 2, 2017

Awesome, thanks guys. The above are near enough exactly what i'm using now through the API so won't take much tweaking. Just found the ticket #34662 so i'll follow it there 👍

@Vad1mo

This comment has been minimized.

Vad1mo commented Feb 7, 2018

There haven't been any update on this subject for more then a year. Maybe it needs a different approach?

Why not creating a monitoring plugin for docker like for (storage or logging). I think in this way this underdeveloped area would gain more traction, attention and would create a even bigger ecosystem.

@cpuguy83

This comment has been minimized.

Contributor

cpuguy83 commented Feb 7, 2018

@Vad1mo

Container stats are fully integrated into containerd 1.0.
These should also come with significantly improved performance when collecting stats from containers, which docker/moby is able to take advantage of as well already from the /containers/<id>/stats endpoint.... I haven't personally benchmarked this but it should be much better since docker 17.11.

The only thing that remains to do in docker with this is to expose this on a metrics endpoint.... I'm not sure if this should be the main daemon metrics endpoint or a separate one.... there's probably a few ways to look at it.

Why not creating a monitoring plugin for docker

Can you be more specific as to what you are looking for?

Note that there is already a "metrics" plugin which currently just exposes the daemon's /metrics endpoint to a plugin over a unix socket.
You can read about implementing one here: https://docs.docker.com/engine/extend/plugins_metrics/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment