Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docker top fails when large PIDs exceed 5 figures #34282

Closed
mbentley opened this issue Jul 27, 2017 · 32 comments · May be fixed by #36621
Closed

docker top fails when large PIDs exceed 5 figures #34282

mbentley opened this issue Jul 27, 2017 · 32 comments · May be fixed by #36621
Labels
area/daemon kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/claimed version/17.06

Comments

@mbentley
Copy link
Contributor

mbentley commented Jul 27, 2017

Description
On a system that supports greater than 5 figures, docker top on the container will fail due to what appears to be a parsing failure of the returned values:

docker run -it --rm --name health health
$ docker top health
Error response from daemon: Unexpected pid '103005root': strconv.Atoi: parsing "103005root": invalid syntax
time="2017-07-27T07:50:58.385629919-04:00" level=error msg="Handler for GET /v1.30/containers/health/top returned error: Unexpected pid '103005root': strconv.Atoi: parsing \"103005root\": invalid syntax"

Steps to reproduce the issue:

  1. Check pid_max to see if your kernel supports a 6 figures number:

    $ cat /proc/sys/kernel/pid_max
    131072
    # if not, change it:
    $ echo 131072 > /proc/sys/kernel/pid_max
    
  2. Run enough processes to get to large PIDs; I'd expect some sort of for loop should do this pretty quickly

  3. Attempt to run docker top:
    Docker client output:

    $ docker top health
    Error response from daemon: Unexpected pid '103005root': strconv.Atoi: parsing "103005root": invalid syntax
    

Describe the results you received:
Did not give me top output

Describe the results you expected:
I should get docker top output

Additional information you deem important (e.g. issue happens only occasionally):

Output of docker version:

$ docker version
Client:
 Version:      17.06.0-ce
 API version:  1.30
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:15:15 2017
 OS/Arch:      linux/amd64

Server:
 Version:      17.06.0-ce
 API version:  1.30 (minimum version 1.12)
 Go version:   go1.8.3
 Git commit:   02c1d87
 Built:        Fri Jun 23 21:51:55 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker info:

$ docker info
Containers: 0
 Running: 0
 Paused: 0
 Stopped: 0
Images: 290
Server Version: 17.06.0-ce
Storage Driver: overlay2
 Backing Filesystem: extfs
 Supports d_type: true
 Native Overlay Diff: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: cfb82a876ecc11b5ca0977d1733adbe58599088a
runc version: 2d41c047c83e09a6d61d464906feb2a2f3c52aa4
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.33
Operating System: Alpine Linux v3.6
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 1.94GiB
Name: alpine
ID: ANNE:4FUU:XCPW:GM7X:3OBN:E4KZ:VY63:ZB6H:NS3W:MAOI:54XZ:RBHL
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: mbentley
Registry: https://index.docker.io/v1/
Labels:
 foo=bar
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Additional environment details (AWS, VirtualBox, physical, etc.):
VMware Fusion VM

@allencloud
Copy link
Contributor

Actually I tried to reproduce this issue, but I failed with the following output:

root@ubuntu:~# docker run -d ubuntu:14.04 sleep 1000000
84ed0462e5fa727a54e997f85099151c89dea90d97f72401775d40272f45d76b
root@ubuntu:~# docker top 84ed0462
UID                 PID                 PPID                C                   STIME               TTY                 TIME                CMD
root                103771              103751              0                   17:36               ?                   00:00:00            sleep 1000000

The PID is 103771.

And I am wondering why your data is 103005root with a root in it. This is something I cannot understand. @mbentley

@mbentley
Copy link
Contributor Author

I'm not really sure. I can see this by hitting the API directly:

When I first start my system where the PIDs are below 6 digits:

$ curl -s --unix-socket /var/run/docker.sock "http:/v1.30/containers/718/top" | jq .
{
  "Processes": [
    [
      "5080",
      "root",
      "0:00",
      "nginx: master process nginx -g daemon off;"
    ],
    [
      "5155",
      "openvpn",
      "0:00",
      "nginx: worker process"
    ]
  ],
  "Titles": [
    "PID",
    "USER",
    "TIME",
    "COMMAND"
  ]
}

After running a quick for loop to increase the pids:

$ curl -s --unix-socket /var/run/docker.sock "http:/v1.30/containers/718/top" | jq .
{
  "message": "Unexpected pid '118604mbentley': strconv.Atoi: parsing \"118604mbentley\": invalid syntax"
}

I'm not sure how Docker is exactly getting the list of PIDs but I'm assuming that whatever my system is returning is different than what you're seeing.

@thaJeztah
Copy link
Member

A quick update, if someone wants to look into this (don't have time myself at this moment);

@mbentley sent me the output of ps -ef on the host, which is what docker is running behind the scenes to collect the information (see daemon/top_unix.go#L118-L155, and daemon/top_unix.go#L61-L111)

PID   USER     TIME   COMMAND
    1 root       0:00 /sbin/init
    2 root       0:00 [kthreadd]

...

100676root       0:00 [kworker/0:1]
100785root       0:00 [kworker/0:3]
100787root       0:00 docker-containerd-shim 605da7454597d536380f61dabc9e6d5fc9fbf99a65c9357140035ab46b350a0a /var/run/docker/libcontainerd/605da7454597d536380f61dabc9e6d5fc9fbf99a65c9357140035ab46b350a0a docker-runc
100805root       0:00 nginx: master process nginx -g daemon off;
100848openvpn    0:00 nginx: worker process
100998mbentley   0:00 ps -ef

The problem here is that that output (at least by default) glues together the PID and USER columns, therefore docker fails to split them (which causes the error).

I did a quick search if there's a portable way to change the format, so that we can guarantee that the columns are always separated. This ServerFault discussion provides some possible options; https://serverfault.com/a/157618

However, if there's an option, we need to be sure it's supported on all platforms/distros, given that docker shells out to the ps binary on the host, and we don't know what that ps binary supports on each distro.

The docker top command is known to be problematic for the same reason, as the output (and options) are not standardised; ideally docker wouldn't shell out, and obtain the same information other ways, but that will be a bigger change.

@thaJeztah thaJeztah added area/daemon kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. labels Aug 17, 2017
@dccurtis
Copy link
Contributor

@thaJeztah I can take a look at this. I just reproduced the problem.

@thaJeztah
Copy link
Member

@dccurtis thanks! I'm currently on vacation myself and limited internet access, so feel free to work on it 👍

@dccurtis
Copy link
Contributor

Quick update on this.

I can reproduce the problem with functional and unit testing. I've spent some time experimenting with changing the output format of the ps command with using the -o option but haven't found a way that works (change order of columns, change width, etc) for all host OSs.

If there is no universal way to re-fromat the host 'ps' output to avoid this problem then the other way to fix this is to detect the [PID][USER] problem in parsePSOutput and fix the line before fieldsASCII(line) is called.

Detecting the problem is easy enough: if len(fieldsASCII(line)) != len(procList.Titles) By doing this before altering the ps line then the change would only come into affect when the host OS has PID-USER concatenation issue. So far it's only known on Alpine linux but theres no reason to think it's limited to Alpine. This should mitigate some risk on regressing on other distros.

I'm going to start working on the change to add a space between the two columns. If anyone sees any issues with this approach please let me know.

@yummypeng
Copy link
Contributor

@dccurtis How do you split [PID] and [USER] ? Take 118604mbentley for example, how do you judge which part is PID or which part is USER?

@dccurtis
Copy link
Contributor

Yes, this is where things could get complicated.

Assumptions:

  • PID is always digits
  • We can't guarantee which column will be after PID. Next column could start with a digit.
    -ps -o pid so far seems to be dependable way to get the pids on the few distros I've tried so far.

For this case, the delimiter withUSER can be found with unicode.IsLetter(). That doesn't help if the user specified arguments that would have values starting with a digit for the column after PID (for example TIME, CPU, etc).

One idea that I'm still working with is that ps -o pid will return only the pids. This will give a something to compare the value at fieldsASCII(line)[pidIndex] to. If there is a match from ps -o pid with whats in the broken line we will be able to figure out at what index the PID field ends.

@dccurtis
Copy link
Contributor

While I would have much preferred to handle this problem to correct the PID field regardless of what the next column is, I don't think there is a practical way to accomplish this. I haven't been able to find a deterministic way of knowing where the PID ends and the next column's field begins when the next column starts with a number.

Instead, I'm focusing on this specific problem, when the next column starts with a letter and the two columns are concatenated. We will always know that this is wrong and can find the delimitation point since the PID will always be a numeric value.

I will have a PR out soon to address this.

@jinkai
Copy link

jinkai commented Mar 6, 2018

Are there any updates for this? I ran into this issue a couple days ago with my Jenkins Pipeline using Docker set up. One of the plugins being used is executing docker top 9a10de4a083e -eo pid,comm and failing on the PID, shown as 111612slirp-proxy. I'm running the latest versions of these plugins. There's no workaround that I've been able to find.

@dccurtis
Copy link
Contributor

dccurtis commented Mar 6, 2018

Sorry for the delay on this. I’ll sync my branch and retest my fix. If all goes well I’ll have it out for review soon.

@dccurtis
Copy link
Contributor

dccurtis commented Mar 8, 2018

I'm wrapping up the re-running my functional test but am working through some issues I'm having around running Docker on Alpine linux. Once I complete the testing I'll get this out for review.

One caveat about my change. I can only correct PID values where the next column starts with a non-numeric character. If the PID+1 column value starts with a number we resort to the pre-existing behavior and it errors out.

I couldn't find a generic way of getting only the PID values for all hosts (linux distributions). If anyone has any ideas on how to do that please let me know. That way we could always figure out the correct values for PID and PID+1.

@Hazok
Copy link

Hazok commented Apr 29, 2018

Two quick questions:

  1. Is there a target release for this fix? I'm hitting this issue regularly with Docker for Windows 18.03.1-ce-win64 (17438) using the official Jenkins image (i.e. jenkins/jenkins:lts-slim) when leveraging Docker containers in the pipeline.
  2. For a "universal" solution, did you look into using the position of the "USER" or "UID" label in the column headers to determine where to cutoff the PID? I'm not sure if it is the same for all distros, but seems like in most that I've used the label and values are usually both left-aligned with the same starting horizontal position.

@thaJeztah
Copy link
Member

For a "universal" solution, did you look into using the position of the "USER" or "UID" label in the column headers to determine where to cutoff the PID? I'm not sure if it is the same for all distros, but seems like in most that I've used the label and values are usually both left-aligned with the same starting horizontal position.

I think that may actually be a good solution; one other option that may work is to customize the COLUMN-headers; not all variations of ps seem to support -o column:<width>, but custom names seem to be supported (even by busybox); from https://unix.stackexchange.com/a/313470

Column width will increase as needed for wide headers; this may be used to widen up columns such as WCHAN (ps -o pid,wchan=WIDE-WCHAN-COLUMN -o comm)

So doing something like;

 ps -o pid=PID______________________ -o user=USER______________________ -o time -o comm

Will make the columns wider, accommodating the underscores;

PID______________________ USER______________________ TIME   COMMAND
                        1 root                         0:00 sh
                        9 root                         0:00 ps

@thxmasj
Copy link

thxmasj commented Jun 28, 2018

Seems like Jenkins on Docker for Windows is the use case that triggers this bug for most people, according to the comments here. I experience the same and it is a pain. Would love to see some priority put into this.

@dccurtis
Copy link
Contributor

I apologize for my silence. I haven't had a lot of cycles lately. :(. I'll dust off my old branch and see if I can incorporate what @thaJeztah suggested, and give it a try on a few distos as a sanity check.

If anyone has cycles and experience on the Windows side please feel free to jump in. I don't want the fact that I have a PR out signal to others that this issue is taken. Happy to help; but don't want to turn away others.

@thaJeztah
Copy link
Member

For "Docker for Windows" in this case, it's likely "Linux" containers on Docker for Windows (which run in a LinuxKit VM and use busybox, so no Windows-specific fox should be needed for that)

@katanacrimson
Copy link

Until there's a proper fix for the parsing in place, can't the Linux VM (mobylinuxvm) just set (on boot) /proc/sys/kernel/pid_max to 99999 each time to avoid this issue? I've been doing this by hand, it does work but the fix has to be manually reapplied every time the host or the VM restarts.

@thaJeztah
Copy link
Member

@rn ^^ would that be an option?

@rn
Copy link
Member

rn commented Jul 14, 2018

Weirdly for Docker for Mac pid_max defaults to 32768 while on Windows it defaults to 131072.

I've created a PR to set the limit to 99999 for both. Not sure when this will land (ie which release will pick it up)

@natalie-o-perret
Copy link

natalie-o-perret commented Jul 14, 2018

On windows...

docker container run --detach --publish 8080:80 --name david nginx
b4cf3526c28bb179f57ad331db47e7269d1308ce58bd76aea225574c8f3be3ca

docker container top walter
Error response from daemon: Unexpected pid '100404root': strconv.Atoi: parsing "100404root": invalid syntax

@rn pidmax to 131072 on Window? How come the atoi call failed then? Is it antoher issue?

On Docker playground which runs against ubuntu, the same stuff as above works just fine =/

@thaJeztah
Copy link
Member

It's very likely due to a bug/presentation issue in busybox; docker uses the output of ps on the host (#34282 (comment)), which has this issue if the host is using busybox (which is the case on docker for mac/linux), but not if it's running on Ubuntu

@rn
Copy link
Member

rn commented Jul 15, 2018

@thaJeztah we could also install procps if that version of ps does not have the issue

@katanacrimson
Copy link

katanacrimson commented Jul 20, 2018

For the time being - those who wish to work around it much like I had, this should work...

Start a new privileged throwaway container:

docker run --privileged --rm -it ubuntu

Identify if your pid_max is over five digits long:

cat /proc/sys/kernel/pid_max

Change it to 99999:

echo 99999 > /proc/sys/kernel/pid_max

This should allow use cases, such as Jenkins running via Docker-in-Docker (on Windows) to work reliably afterwards.

@rn
Copy link
Member

rn commented Jul 20, 2018

Todays Docker Desktop Edge release should have the fix. Should also be in the next stable release

@thaJeztah
Copy link
Member

Thanks Rolf!

@natalie-o-perret
Copy link

@rn how long before the next stable release?

@ijc
Copy link
Contributor

ijc commented Jul 25, 2018

Is there some way we can just avoid the fragile parsing of the ps output, using something like https://github.com/prometheus/procfs/ or some other suitalble library perhaps?

@thaJeztah
Copy link
Member

@ijc yes; that's been discussed a few times; i.e., don't use an external binary, but implement our own way to get this information.

There's one issue with that, and that's that currently the command accepts options to be passed to ps (bad choice from the past, as which options work, and what effect they have, is also dependent on "what version of ps is available on the host)

@rn
Copy link
Member

rn commented Jul 25, 2018

@ehouarn-perret earlier today, i think

@katanacrimson
Copy link

Workaround is going smoothly currently. Thanks for getting that in place @rn!

@mbentley
Copy link
Contributor Author

mbentley commented Jul 7, 2021

I'm guessing this is fixed by now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/claimed version/17.06
Projects
None yet
Development

Successfully merging a pull request may close this issue.