Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add api for stats all containers #25361

Closed
wants to merge 6 commits into from

Conversation

WeiZhang555
Copy link
Contributor

@WeiZhang555 WeiZhang555 commented Aug 3, 2016

Fixes #22052

Add a new API for docker stats all containers, and refactored docker
client to make use of new API.

There're some obvious benefits from this commit:

  1. it saves lots of TCP connections when stats bundles of containers.
  2. the client side is expected to gain higher performance due to less goroutine.
  3. client codes are simpler and easier to maintain.

Signed-off-by: Zhang Wei zhangwei555@huawei.com

ping @runcom @thaJeztah
/cc @vincentwoo @jeanpralo Is this what you want?

TODO:

  • api-tests and integration tests
  • docs

@thaJeztah
Copy link
Member

nice!!

@WeiZhang555 WeiZhang555 force-pushed the docker-stats-all branch 2 times, most recently from 7dab337 to 4fa3ee9 Compare August 3, 2016 15:26
@jeanpralo
Copy link

Thanks for the work @WeiZhang555 :) Much appreciated that will save a bit of cpu on those servers with 100+ containers :)

I was wondering if we could potentially add the names of the containers in the response, I guess id is fine but it is just more convenient if you have a script pulling this endpoint regularly rather than having to rely on some other external mapping or have to do a second api call to /containers/json.

I guess for most people the usage of that API endpoint is to get metrics in order to present them nicely, since a container in itself has a short lifetime you most likely graph the name of your container rather than its id.

@vincentwoo
Copy link
Contributor

@rileysaurus if you have some time can you kick the tires on this PR?

@WeiZhang555
Copy link
Contributor Author

@jeanpralo

I think it's inappropriate to put names in response, there's no space for docker client showing the container names, and ID is already a unique key for identifying a container.
But I can change my idea if maintainers thinks it's good, technically it's not hard to add a new "name" field.

@thaJeztah
Copy link
Member

@WeiZhang555 there's an open pull request to add --format to stats. I haven't yet looked how that will conflict with this PR, but that PR adds a name field, so that it can be used as a custom format; #24987

@thaJeztah
Copy link
Member

ping @bfirsh; this adds a /containers/stats endpoint; does that fit in your API suggestions? #25015; guess that makes stats a reserved keyword for container names?

@jeanpralo
Copy link

@WeiZhang555 @thaJeztah this PR is interesting (#24987) but it looks like it is doing the formatting on the client side, not as a response to the API call though.
So if I will still need to do an extra call to match the id with the name.

@thaJeztah
Copy link
Member

@jeanpralo correct; both PR's solve a different issue; this PR implements the current stats (all containers) server side, but doesn't change the output; I think that changing the output belongs in #24987, which could build on top of this PR (and add extra fields to the API response. I think having both merged / implemented could be realistic for the next release (famous last words)

@WeiZhang555
Copy link
Contributor Author

@jeanpralo Just as @thaJeztah said, these two PRs are solving different issues. Combining these two will give what you want, but need to solve the code conflict, I believe there will be lots of.

@bfirsh
Copy link
Contributor

bfirsh commented Aug 5, 2016

Yes, this would conflict with #25015.

If there is a valid reason for doing this instead of multiple calls in parallel (which this does seem to be), then we need to figure out a good way of doing batching. There is no agreed upon way of doing batch operations with a REST API, and it's often a bit messy. See also #24724 where I explain in some more detail.

Having more endpoints at /object/X is going to paint our URL design into a corner, so I think that is a bad idea. But – I think there are some good ways we can do it:

  1. Build a generic batch operation API. Facebook have a good example of how to do this, but it's also an example of how complicated it can get...
  2. Have a special name to represent all containers, perhaps using a special character that can't be in a container name. e.g. /containers/.all/stats

(1) sounds like a big chunk of work, but (2) seems workable. Anybody got any other ideas?

@justincormack
Copy link
Contributor

Also /allcontainers (or a nicer name) could be at the top level? Maybe thats all thats needed, stats is its inspect?

@hhcauldwell
Copy link

hhcauldwell commented Aug 5, 2016

I tested this PR out with 250 containers. I see only a negligible change in CPU usage, and just a slight shifting of the CPU usage from dockerd to docker-containerd.

I've testing with both requesting a stream of stats and just a single sample of the stats (once a second).

I'm running a ubuntu 14.04 VM on a mid 2015 Macbook Pro.

Here is the output of docker info for both versions I've tested.

12.0:
Containers: 250
Running: 250
Paused: 0
Stopped: 0
Images: 123
Server Version: 1.12.0
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: none
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: host bridge null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor
Kernel Version: 4.2.0-27-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.937 GiB
Name: coderpad
ID: QC3J:K2Q5:B6PQ:36IH:IAKK:BH4T:G6G4:3OUK:6YRC:CCC2:LIU5:AZ46
Docker Root Dir: /data/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8

13.0-dev:
Containers: 250
Running: 0
Paused: 0
Stopped: 250
Images: 123
Server Version: 1.13.0-dev
Storage Driver: overlay
Backing Filesystem: extfs
Logging Driver: none
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: overlay null bridge host
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Security Options: apparmor seccomp
Kernel Version: 4.2.0-27-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 4
Total Memory: 1.937 GiB
Name: coderpad
ID: QC3J:K2Q5:B6PQ:36IH:IAKK:BH4T:G6G4:3OUK:6YRC:CCC2:LIU5:AZ46
Docker Root Dir: /data/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

@WeiZhang555 WeiZhang555 force-pushed the docker-stats-all branch 2 times, most recently from be7a30c to 95d0c31 Compare August 6, 2016 15:03
@WeiZhang555
Copy link
Contributor Author

WeiZhang555 commented Aug 11, 2016

I tested this PR out with 250 containers. I see only a negligible change in CPU usage, and just a slight shifting of the CPU usage from dockerd to docker-containerd.

@rileysaurus I just tested this with 1000 containers, I agree with you that the CPU usage didn't change much on daemon side, but I didn't see the shifting of the CPU usage from dockerd to docker-containerd.

Interestingly, I noticed that the client CPU usage is much more lower with this PR.

hardware for client: 12 CPU, 125G Mem
hardware for daemon: 8 CPU, 15G Mem
<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
client1 (generated with this PR):

Client:
 Version:      1.13.0-dev
 API version:  1.25
 Go version:   go1.6.3
 Git commit:   95d0c31
 Built:        Wed Aug 10 08:59:39 2016
 OS/Arch:      linux/amd64
 Experimental: true

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

client2(from upstream master):

Client:
 Version:      1.13.0-dev
 API version:  1.25
 Go version:   go1.6.3
 Git commit:   b2b41b2
 Built:        Tue Aug  9 10:02:46 2016
 OS/Arch:      linux/amd64
 Experimental: true

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<

Below is the performance data I got:

1. deducation of CPU usage on Client side

client1(from this PR):

Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.3 us,  1.1 sy,  0.0 ni, 98.5 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13170728+total, 17432720 free,   967036 used, 11330752+buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13031072+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                        
 75537 root      20   0  753024  26964  11796 S   6.0  0.0   0:01.67 docker-1.13.0-d

client2(from upstream):

Tasks:   1 total,   0 running,   1 sleeping,   0 stopped,   0 zombie
%Cpu(s):  0.6 us,  1.9 sy,  0.0 ni, 97.4 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem : 13170728+total, 17372616 free,  1025468 used, 11330920+buff/cache
KiB Swap:        0 total,        0 free,        0 used. 13025072+avail Mem 

   PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND                                                                                        
 75559 root      20   0  671868  86280  12292 S  13.5  0.1   0:05.04 docker-client 

2. less resource consumption on daemon side

This PR will save TCP connections, File descriptors and Goroutines, which all will influence numbers of clients daemon can serve.

client1(from this PR):

# docker info
...
Debug Mode (server): true
 File Descriptors: 4017
 Goroutines: 6026
 System Time: 2016-08-11T10:16:22.361121791+08:00
 EventsListeners: 0

client2(from upstream):

Debug Mode (server): true
 File Descriptors: 5017
 Goroutines: 9026
 System Time: 2016-08-11T10:18:43.145383775+08:00
 EventsListeners: 0

This PR can save 999 TCP connections, 3000 Goroutines and 1000 File Descriptors when stats 1000 containers!

3. this PR guarantee client always get all stats numbers.

client1: (from this PR)

# docker stats
<part of results...>
2cb984ad9837        0.00%               52 KiB / 15.57 GiB    0.00%               0 B / 0 B           0 B / 0 B           1
2cd3f12491c7        0.00%               52 KiB / 15.57 GiB    0.00%               0 B / 0 B           0 B / 0 B           1
2cf77cf17545        0.00%               52 KiB / 15.57 GiB    0.00%               0 B / 0 B           0 B / 0 B           1
2d7fb0aa0aa4        0.00%               56 KiB / 15.57 GiB    0.00%               0 B / 0 B           0 B / 0 B           1
2db85a985917        0.00%               52 KiB / 15.57 GiB    0.00%               0 B / 0 B           0 B / 0 B           1

client2:(from upstream)

# docker stats
<part of result...>
543b3823e24c        0.00%               56 KiB / 15.57 GiB   0.00%               0 B / 0 B           0 B / 0 B           1
0330db77418f        0.00%               60 KiB / 15.57 GiB   0.00%               0 B / 0 B           0 B / 0 B           1
5550702b0eef        --                  -- / --              --                  -- / --             -- / --             --
5f5b6429d7ca        --                  -- / --              --                  -- / --             -- / --             --
20b200f2ddf6        --                  -- / --              --                  -- / --             -- / --             --

This is because old client trys to get data of containers one by one but can't guarantee it get correct numbers while displaying, but client from this PR will only make one request, and daemon will batch all data and send it to client once, which will guarantee client can always get all the data without care of synchronization.

That's all my point, correct me if I'm wrong @rileysaurus 😄

Also ping @docker-maintainers, please tell me what do you think of this change.

@thaJeztah
Copy link
Member

In general, I'm +1 for this PR; I think this would also make it easier to (for example) implement a --filter for stats to only show a limited set of containers, without introducing a race between client obtaining containers to filter, and requesting those containers (but that's another topic)

w.r.t. design, I think we need to have a look at the endpoint (#25361 (comment)), to prevent future conflicts, all of /containers/-/stats, /containers/.all/stats or /stats/containers/ could work, but no real preference at this moment, so better suggestions are welcome.

@thaJeztah
Copy link
Member

ping @bfirsh @WeiZhang555 any preference for one of those options?

@WeiZhang555
Copy link
Contributor Author

WeiZhang555 commented Aug 19, 2016

I vote to get /allcontainers/stats or get /all/containers/stats, then if we need another consolidate API for (say) images, we can add /allimages/xxx or /all/images/xxx, also /allservices/xxx /all/services/ and so on.

I'm not quite used to format with special name of /containers/-/stats /containers/.all/stats, does any popular website apply this rule/format widely?

what do you think? @thaJeztah @bfirsh @justincormack

@bfirsh
Copy link
Contributor

bfirsh commented Aug 19, 2016

I can't think of any examples of this off the top of my head, but we do have some pretty clear conventions in place that we can follow.

I am -1 on /all/containers/stats because it breaks the /resource/id/verb convention in a really confusing way.

I am -0 about /allcontainers/stats because it also breaks that convention, but not as much and not in a confusing way.

I prefer /containers/SOMETHING/stats. No strong feelings about what SOMETHING is, but I think a character that can't be in a container name is better than having reserved container names.

@WeiZhang555
Copy link
Contributor Author

@bfirsh Sound reasonable. Then it looks like GET /containers/.all/stats is the right option.

Any more suggestions from any one?

@thaJeztah
Copy link
Member

I don't expect multiple entities to be included in a single stats stream (e.g. containers, and services in the same stream), so having it under /containers (for now) seems like the best thing to do.

I can't really put my finger on it why but not fond of .all. Perhaps it's because .all may put too much meaning in it; - may be slightly clearer to indicate "url component omitted", so if at some point we have "deeper" URL structures, the same can be used, e.g. (just for illustration);

/services/-/tasks/-/stats
/services/*/tasks/*/stats

It looks like * is actually allowed in URLs, see http://www.w3.org/2002/11/dbooth-names/rfc2396-numbered_clean.htm (line 460), so could be a candidate. (unless we're worried about people using it on the command-line, and getting issues with the shell trying to expand it

@WeiZhang555
Copy link
Contributor Author

WeiZhang555 commented Aug 21, 2016

The reason I picked GET /containers/stats is that it's consistent with existing APIs, for example, we have /images/json and also /images/{name: .*}/json, as we still don't have an agreeable plan on how to make a cleaner API for consolidated stats data. I would say we keep using /containers/stats for consistency.

I've looked into #25015, the suggestion is good, but there're some hard problems on the backward compatibility, so I don't expect it(API renaming) will take place soon. Considering this, I'll re-propose the /containers/stats .

Involve @calavera for API name discussion, I think we need his approval to merge the new API into engine-api, so I'd like to hear your suggestion earlier. ping @calavera

@cpuguy83
Copy link
Member

I don't think a new endpoint at /container/stats is a good idea.

@duglin
Copy link
Contributor

duglin commented Jan 15, 2017

Overall it seems good. One question, if you look at the output you've shown (e.g. #25361 (comment) ) you'll see that some containers have -- instead of 0 for the data. Before this PR it was 0 so this will break existing clients. Was this change discussed?

@WeiZhang555
Copy link
Contributor Author

WeiZhang555 commented Jan 16, 2017

@duglin Thanks 😄

I don't think it's what you said. -- indicates "error" or "time out when fetching stat", before this PR, you often get -- results when stating bunches of containers, but after applying these commits, you can get right data or 0. You can see #25361 (comment) for client compare.

Note: only part of client changed, docker stats is using new API but docker stats `docker ps -q` is using old API. I hope to move latter part to new API too, but that requires also more discussions, I can do that later in a follow up PR after this is accepted.

I'll do another rebase to resolve conflicts.

@duglin
Copy link
Contributor

duglin commented Jan 16, 2017

When i run the old stuff i don't see any --'s but with this pr i get a lot of them. That doesn't seem right.

@WeiZhang555
Copy link
Contributor Author

WeiZhang555 commented Jan 16, 2017

@duglin Weird, I get the opposite result. How did you test it, maybe I miss something...

This is my test result with 1000 running containers (run top in busybox):

Test with master branch # docker stats:
1

Test with this PR # docker stats:
2

Result of # docker stats is what I expected. But docker stats `docker ps -aq` looks like a mess, maybe you are testing with this command? I need I need to do some fix for this.

@duglin
Copy link
Contributor

duglin commented Jan 16, 2017

Perhaps its related to the stopped containers. For me, the old stuff shows me:

# docker stats -a --no-stream
CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O
056af2957acb        0.00%               0 B / 0 B           0.00%               0 B / 0 B           0 B / 0 B
059f67920fa9        0.00%               0 B / 0 B           0.00%               0 B / 0 B           0 B / 0 B
1b012c67dd16        0.00%               0 B / 0 B           0.00%               0 B / 0 B           0 B / 0 B
...

and the new stuff shows me:

# docker stats -a --no-stream
CONTAINER           CPU %               MEM USAGE / LIMIT   MEM %               NET I/O             BLOCK I/O           PIDS
20005af60f0d        --                  -- / --             --                  --                  --                  --
f5648bd8d226        --                  -- / --             --                  --                  --                  --
91ab3be3134a        --                  -- / --             --                  --                  --                  --
829d1821d0b9        --                  -- / --             --                  --                  --                  --
...

All zeros vs all --'s

@WeiZhang555
Copy link
Contributor Author

@duglin I see, you are right 👍 I'll fix it during the rebasing~

@thaJeztah
Copy link
Member

@WeiZhang555 one thing I just realised we should take into account is that, starting with docker 1.13, the client is able to talk to older daemon versions; that means that when talking to a daemon that not supports the "all" stats endpoint, it should still use the old approach

Add a new API for docker stats all containers, and refactored docker
client to make use of new API.

There're some obvious benefits from this commit:

1. it saves lots of TCP connections when stats bundles of containers.
2. the client side is expected to gain higher performance.
3. client codes are simplier and easier to maintain.

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
This commit did several things:

1. adds filter support for API `/containers/-/stats`, we reuse `ps`
filters for this, so currently container stats API can also support all
filters from `ps` command.
2. `docker stats --all` will bind to default `/containers/-/stats`
without filters.
3. `docker stats` is implemented based on `/containers/-/stats` with
filter `status=running`

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
This commit redirects API `/containers/{name:*}/stats` to handler of
`/containers/-/stats`, now both stating one container and stating all
containers will be handled by same backend function.
For stating one container, we will first get all stating data for all
containers, and filter them by name to get specific one.

This also fixes bug of concurrent map write

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
With new `/containers/{name:*}/stats` handler, one windows test case
hangs, this commit fixes the hanging issue.

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
For consistency, we need to set "0" as result instead of "--" for
non-running containers.

Signed-off-by: Zhang Wei <zhangwei555@huawei.com>
@WeiZhang555
Copy link
Contributor Author

WeiZhang555 commented Jan 18, 2017

@duglin Your issue is addressed.

@thaJeztah Dammit, this is really a terrible problem. I can fix it, but it requires lots of work, and lots of duplicate codes. I'm afraid I need to refactor again and the client implementation will be ugly afterwards. 😢

@cpuguy83
Copy link
Member

Ok, so I have to say I've pretty much never been +1 for this, just a -0.... basically I see the need but, don't like the design but can't see anything better.

Is this solving a real, urgent problem? I think stats in general has a ton of room for improvement as even the stats collection is extremely inefficient. So is this really improving things?

At some point we'll have container stats exposed in the metrics API, which I think is much more suitable for this kind of thing.

I think adding this will lead to tremendous regret later (even for reasons beyond the metrics API).
Because of that I would have to say I'm pretty firmly -1 on this.

@stevvooe
Copy link
Contributor

I'm against expanding the scope of the stats API. It is very inefficient and I am not sure it could be streamlined. I'd prefer we focus our efforts on expanding prometheus support.

@LK4D4
Copy link
Contributor

LK4D4 commented Jan 27, 2017

I see that overall decision is -1 now. Bringing a lot of complexity doesn't sound very cool(for stats in particular). Let's start from refactoring and maybe return to this later.
@WeiZhang555 Thanks for your work! We appreciate it.

@aryamsft
Copy link

Bumping this issue. Has this issue been revisited?

There is a feature I'm working on where I need to pull stats for all the containers. I would rather not make a restful call for each and every container as there could be hundreds of containers running on a particular machine.

Can we add a parameter to the restful call for pulling all container stats (or adding a new restful api)? I'm sure this would be useful for other people as I don't think I'm the only person running into this issue.

@aryamsft
Copy link

@thaJeztah ping on this?

@thaJeztah
Copy link
Member

^^ conversation related to the above ongoing in #22052

leweafan pushed a commit to leweafan/beats that referenced this pull request Apr 28, 2023
Currently fetching container stats is very slow as each request takes up to 2 seconds. To improve the fetching time if lots of containers are around, this creates the rrequests in parallel. The main downside is that this opens lots of connections. This fix should only temporary until the bulk api is available: moby/moby#25361
(cherry picked from commit f1ecd31)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Docker stats for all container