Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ghost container can't be stopped #30927

Closed
Foorack opened this issue Feb 11, 2017 · 21 comments
Closed

Ghost container can't be stopped #30927

Foorack opened this issue Feb 11, 2017 · 21 comments

Comments

@Foorack
Copy link

Foorack commented Feb 11, 2017

Hello,

Basically got a "ghost" container if that is the right word. It's completely unstoppable. Neither docker-compose stop nor docker-compose kill works. They both say "done" but docker-compose ps says the service is still up. The service is an nginx container binding to port 80 and 443.
docker stop also says it stoppd the container but docker ps says it is still running. Same with docker kill. docker-compose top throws a TypeError when it comes to the problematic container.

Traceback (most recent call last):
  File "bin/docker-compose", line 3, in <module>
  File "compose/cli/main.py", line 88, in main
  File "compose/cli/main.py", line 140, in perform_command
  File "compose/cli/main.py", line 825, in top
TypeError: 'NoneType' object is not iterable
Failed to execute script docker-compose

Steps to reproduce the issue:
Unknown how to reproduce as I don't know what caused it. The container is a normal nginx:1.11.8-alpine running for 2 days with normal configuration.

Describe the results you received:
A shim process with no child process. (the shim process with id 0585...) All other process has a child process except the problematic container which suggests it might be stuck in a frozen state.

image

Describe the results you expected:
I expected the container to be stopped... 🤔

Additional information you deem important (e.g. issue happens only occasionally):
Issue happens completely at random. I have read something similar happened in 2015 and the common theme seems to be they are both network-heavy processes. docker-compose logs and docker logs just get stuck without printing anything.

Output of docker version:

Client:
 Version:      1.13.1
 API version:  1.26
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:57:21 2017
 OS/Arch:      linux/amd64

Server:
 Version:      1.13.1
 API version:  1.26 (minimum version 1.12)
 Go version:   go1.7.5
 Git commit:   092cba3
 Built:        Wed Feb  8 06:57:21 2017
 OS/Arch:      linux/amd64
 Experimental: false

Output of docker-compose version:

docker-compose version 1.11.1, build 7c5d5e4
docker-py version: 2.0.2
CPython version: 2.7.13
OpenSSL version: OpenSSL 1.0.1t  3 May 2016

Output of docker info:

Containers: 12
 Running: 10
 Paused: 0
 Stopped: 2
Images: 13
Server Version: 1.13.1
Storage Driver: aufs
 Root Dir: /var/lib/docker/aufs
 Backing Filesystem: extfs
 Dirs: 132
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins: 
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: aa8187dbd3b7ad67d8e5e3a15115d3eef43a7ed1
runc version: 9df8b306d01f59d3a8029be411de015b7304dd8f
init version: 949e6fa
Security Options:
 apparmor
 seccomp
  Profile: default
Kernel Version: 4.8.0-37-generic
Operating System: Ubuntu 16.10
OSType: linux
Architecture: x86_64
CPUs: 8
Total Memory: 31.12 GiB
Name: ns*******.ip-***-**-**.eu
ID: WXWS:RBMK:E24V:WOHE:YQTF:USGR:RSDV:SW6C:5B3H:PRQP:DLR5:3WCY
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Username: foorack
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):
Physical box hosted at OVH (SoYouStart).

@Foorack
Copy link
Author

Foorack commented Feb 11, 2017

Could it fix itself if I kill the problematic shim process or will that make the main docker service to go crazy? I have seen a few people write "Just reboot the machine" or "Just restart the docker service" however there are a few database containers running that would cause complete downtime if they went down.

@Foorack
Copy link
Author

Foorack commented Feb 11, 2017

Additional information, don't know if it is helpful or not:
image
The shim process doesn't have a normal child process like the other containers, but there are a lot of docker-containe userland process threads running under the shim process. The config.v2.json also says the state is "Running: true" even though the nginx process is non-existent.

Output of cat /var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/config.v2.json:

{  
  "StreamConfig":{  

  },
  "State":{  
    "Running":true,
    "Paused":false,
    "Restarting":false,
    "OOMKilled":false,
    "RemovalInProgress":false,
    "Dead":false,
    "Pid":9628,
    "ExitCode":0,
    "Error":"",
    "StartedAt":"2017-02-10T08:19:37.822843356Z",
    "FinishedAt":"0001-01-01T00:00:00Z",
    "Health":null
  },
  "ID":"0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67",
  "Created":"2017-02-10T08:19:37.415973424Z",
  "Managed":false,
  "Path":"nginx",
  "Args":[  
    "-g",
    "daemon off;"
  ],
  "Config":{  
    "Hostname":"0589585b7d9e",
    "Domainname":"",
    "User":"",
    "AttachStdin":false,
    "AttachStdout":false,
    "AttachStderr":false,
    "ExposedPorts":{  
      "443/tcp":{  

      },
      "80/tcp":{  

      }
    },
    "Tty":false,
    "OpenStdin":false,
    "StdinOnce":false,
    "Env":[  
      "PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin",
      "NGINX_VERSION=1.11.8"
    ],
    "Cmd":[  
      "nginx",
      "-g",
      "daemon off;"
    ],
    "ArgsEscaped":true,
    "Image":"nginx:1.11.8-alpine",
    "Volumes":{  
      "/etc/letsencrypt":{  

      },
      "/etc/nginx/nginx.conf":{  

      },
      "/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web":{  

      },
      "/var/www":{  

      }
    },
    "WorkingDir":"",
    "Entrypoint":null,
    "OnBuild":null,
    "Labels":{  
      "com.docker.compose.config-hash":"07911b3e1f03e79c3ec63637fd9d1618d8f48483be05bc820dc287ea254038da",
      "com.docker.compose.container-number":"1",
      "com.docker.compose.oneoff":"False",
      "com.docker.compose.project":"ckdc",
      "com.docker.compose.service":"nginx",
      "com.docker.compose.version":"1.10.0"
    }
  },
  "Image":"sha256:f1fae62370b48b33cbe68e0c7f251194ff7566587cdcb406336484a9b35824ed",
  "NetworkSettings":{  
    "Bridge":"",
    "SandboxID":"e9e14f101ab1753ff481e86911aa332ee38e768c2681838003c1488c31955e08",
    "HairpinMode":false,
    "LinkLocalIPv6Address":"",
    "LinkLocalIPv6PrefixLen":0,
    "Networks":{  
      "ckdc_website":{  
        "IPAMConfig":null,
        "Links":null,
        "Aliases":[  
          "nginx",
          "0589585b7d9e"
        ],
        "NetworkID":"127c536f22632a3713a3b1790e758edf68e1851bc2d32a9e4b6370bfa1a5732f",
        "EndpointID":"0ae734c2c97449b40c73891a3db085f2cd623a3b813dc39573586d2742f3669f",
        "Gateway":"172.19.0.1",
        "IPAddress":"172.19.0.7",
        "IPPrefixLen":16,
        "IPv6Gateway":"",
        "GlobalIPv6Address":"",
        "GlobalIPv6PrefixLen":0,
        "MacAddress":"02:42:ac:13:00:07",
        "IPAMOperational":false
      }
    },
    "Service":null,
    "Ports":{  
      "443/tcp":[  
        {  
          "HostIp":"0.0.0.0",
          "HostPort":"443"
        }
      ],
      "80/tcp":[  
        {  
          "HostIp":"0.0.0.0",
          "HostPort":"80"
        }
      ]
    },
    "SandboxKey":"/var/run/docker/netns/e9e14f101ab1",
    "SecondaryIPAddresses":null,
    "SecondaryIPv6Addresses":null,
    "IsAnonymousEndpoint":false,
    "HasSwarmEndpoint":false
  },
  "LogPath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67-json.log",
  "Name":"/ckdc_nginx_1",
  "Driver":"aufs",
  "MountLabel":"",
  "ProcessLabel":"",
  "RestartCount":0,
  "HasBeenStartedBefore":true,
  "HasBeenManuallyStopped":true,
  "MountPoints":{  
    "/etc/letsencrypt":{  
      "Source":"/opt/letsencrypt",
      "Destination":"/etc/letsencrypt",
      "RW":true,
      "Name":"",
      "Driver":"",
      "Type":"bind",
      "Relabel":"rw",
      "Spec":{  
        "Type":"bind",
        "Source":"/opt/letsencrypt",
        "Target":"/etc/letsencrypt"
      }
    },
    "/etc/nginx/nginx.conf":{  
      "Source":"/opt/nginx/nginx.conf",
      "Destination":"/etc/nginx/nginx.conf",
      "RW":false,
      "Name":"",
      "Driver":"",
      "Type":"bind",
      "Relabel":"ro",
      "Spec":{  
        "Type":"bind",
        "Source":"/opt/nginx/nginx.conf",
        "Target":"/etc/nginx/nginx.conf",
        "ReadOnly":true
      }
    },
    "/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web":{  
      "Source":"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web",
      "Destination":"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web",
      "RW":true,
      "Name":"",
      "Driver":"",
      "Type":"bind",
      "Relabel":"rw",
      "Spec":{  
        "Type":"bind",
        "Source":"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web",
        "Target":"/home/xxxxxxxxxx/xxxxxxxxx/xxxxxxxxx/web"
      }
    },
    "/var/www":{  
      "Source":"/var/www",
      "Destination":"/var/www",
      "RW":true,
      "Name":"",
      "Driver":"",
      "Type":"bind",
      "Relabel":"rw",
      "Spec":{  
        "Type":"bind",
        "Source":"/var/www",
        "Target":"/var/www"
      }
    }
  },
  "SecretReferences":null,
  "AppArmorProfile":"",
  "HostnamePath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/hostname",
  "HostsPath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/hosts",
  "ShmPath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/shm",
  "ResolvConfPath":"/var/lib/docker/containers/0589585b7d9e4424a105db31ada41d77775f15454349e572abc2f44c8e1e1d67/resolv.conf",
  "SeccompProfile":"",
  "NoNewPrivileges":false
}

(The reason for all the mounted volumes is because we are transitioning to containerisation.)

I see now that "Pid" in the config.v2.json file is not a process on the machine... There is no process running with the PID 9628. 🤔

@Foorack Foorack changed the title Impossible to stop ghost container Ghost container can't be stopped Feb 11, 2017
@Foorack
Copy link
Author

Foorack commented Feb 11, 2017

Killing the shim process made the whole dockerd unresponsive. Went into quick quick downtime and restarted the docker service. Everything was back within a minute and the ghost problem is now gone. I find this a dirty fix but it works. Due to the inability to reproduce the bug this can be closed, however, I still think the StackTrace from docker-compose top might be worth looking into.

@Foorack
Copy link
Author

Foorack commented Feb 11, 2017

This just happened again. Nginx again. Can't stop container. Will try using the non-alpine version to see if that changes anything.

@Foorack
Copy link
Author

Foorack commented Feb 11, 2017

This seems to be a duplicate of #10589 however that issue is closed and we are still experiencing the same problem.

@thaJeztah
Copy link
Member

The linked issue has some similarities, but isdefinitely not the same (the whole runtime for docker has been replaced since docker 1.5)

What seems to happen in your case, is that either;

  • the container died for some reason, but the docker daemon did not get this signal, therefore the container status did not match the actual status.
  • something went wrong when deleting the container, causing only some parts of the container to be left behind

The docker compose Traceback won't help debugging here, because that error is only becsuse of this situation. What could help narrowing down what's happening;

  • if you now what steps lead to this situation (e.g. did you docker-compose up the application, or do any actions?)
  • daemon log files around the time this happened (preferably in "debug" mode, although be aware that that will result in a lot more logging)
  • logs of the faulty container - they may reveal why the container quit (docker logs <container id>); if that fails, there could be json-log logfiles inside the containers directory

/cc @mlaventure

@Foorack
Copy link
Author

Foorack commented Feb 11, 2017

Hello @thaJeztah ,

This has happened twice today, on a production machine. In both cases I had to restart the docker service as this is the only known fix. Therefore I do not have any logfiles, and that the docker logs command did not work as explained below. These are the following steps I have taken today:

  1. Created a docker-compose.yml file. Here is the nginx configuration:
  nginx:
    depends_on:
     - portainer
     - phpmyadmin
     - artifactory
     - nodebb
    image: nginx:1.11.8-alpine
    volumes:
     - /var/www:/var/www
     - /opt/nginx/nginx.conf:/etc/nginx/nginx.conf:ro
     - /opt/letsencrypt:/etc/letsencrypt
     - /home/**********/**********/**********/web:/home/**********/**********/**********/web
    ports:
     - "80:80"
     - "443:443"
    restart: always
    networks:
     - website
  1. This was tested on on a dev machine to check the compose file was working as expected. It did.
  2. The config was transfered to the production machine and I ran docker-compose -f /opt/ckdc/docker-compose.yml -p ckdc up -d.
  3. Everything ran perfect, until a few ours later when the website just froze and didn't respond. The website servers plain html/js/css (i.e no server-side language). I tried to go docker-compose -f /opt/ckdc/docker-compose.yml -p ckdc logs nginx however it got stuck at Attaching to ....
  4. I tried the normal docker command docker logs ckdc_nginx_1 but it didn't print anything and I had to cancel with CTRL+C.
  5. As docker was unable to get logs and the web server didn't respond I tried to stop it and start it again. The frozen webserver is strange in itself but here it got even more weird. As reported above, neither docker-compose stop, docker-compose kill, docker stop nor docker kill worked. Docker said the process was stopped in all 4 commands but docker ps showed it still running. Before I ran those 4 commands the docker-container-shim process did have the nginx subprocess. After I ran them the -shim process was still there but the nginx sub-process was gone. To me, this sounds like a daemon did not get this signal error.
  6. The first time this happened I ended up trying to kill just the shim process but it caused the whole dockerd to go unresponsive, none of the docker commands worked and I had to restart the service.
  7. The second time I tried to remove the container with first docker rm ckdc_nginx_1 and then with the -f flag. I do not remember the output of the first attempt but with the -f printed an error message saying it failed to stop the container. It also deleted all files in /var/lib/docker/containers/containerid000000/ except the shm folder because of Device or resource busy. The -shim process and the docker-proxy process was still running and occupied the port. I enabled live-restore according to the following post: https://docs.docker.com/engine/admin/live-restore/ and sent a SIGHUP to the dockerd process. I did "service docker restart" but all containers restarted. I don't know if this is intentional or not, if it is not then the dockerd process might not have gotten the SIGHUP signal.
  8. After these two crashes I switched the nginx image from 1.11.8-alpine to 1.11.9. The web server has been running for a few hours and it hasn't crashed yet, however, it is too early to declare this bug fixed. I am now holding with moving anything else to docker in case the freeze happens again.

Thanks for the help! Any comments or ideas what could be causing these crashes?

@mlaventure
Copy link
Contributor

@Foorack could you put your daemon into debug mode and provide the daemon logs when it happens? The daemon would have to have been in debug mode before the issue occurs.

Usually, the only reason for a shim to stay stuck is that it is waiting for IOs to be consumed. It could be an issue with the communication with containerd though, sending a SIGUSR1 to docker-containerd could help diagnosing this (the output will appears in the docker daemon logs).

The sub docker-container.. process under the shim are quite unexpected though, would it be possible to get the full command line for those (cat /proc/$pid/cmdline)?

@thinkhard-j-park
Copy link

Hi,

I have encountered same problem. I had run mongodb, kafka images upon ubuntu 16.04 in swarm mode.

Please see belows,
docker stop broker02.1.1rf7j2a6njxycwjbc6hc069k2
Error response from daemon: Cannot stop container broker02.1.1rf7j2a6njxycwjbc6hc069k2: Cannot kill container e1da77e2e13b681312d8bd6bab019c607cc16b88a0696826b2593dc093d34047: rpc error: code = 14 desc = grpc: the connection is unavailable

docker version
Client:
Version: 1.13.0
API version: 1.25
Go version: go1.7.3
Git commit: 49bf474
Built: Tue Jan 17 09:58:26 2017
OS/Arch: linux/amd64

Server:
Version: 1.13.0
API version: 1.25 (minimum version 1.12)
Go version: go1.7.3
Git commit: 49bf474
Built: Tue Jan 17 09:58:26 2017
OS/Arch: linux/amd64
Experimental: false

Containers: 35
Running: 6
Paused: 0
Stopped: 29
Images: 13
Server Version: 1.13.0
Storage Driver: aufs
Root Dir: /var/lib/docker/aufs
Backing Filesystem: extfs
Dirs: 151
Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
Volume: local
Network: bridge host macvlan null overlay
Swarm: active
NodeID: mvimcgosd3ajnul07i7s23puw
Is Manager: false
Node Address: 192.168.0.29
Manager Addresses:
192.168.0.27:2377
192.168.0.32:2377
192.168.0.33:2377
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: N/A (expected: 03e5862ec0d8d3b3f750e19fca3ee367e13c090e)
runc version: 2f7393a47307a16f8cee44a37b262e8b81021e3e
init version: 949e6fa
Security Options:
apparmor
seccomp
Profile: default
Kernel Version: 4.4.0-62-generic
Operating System: Ubuntu 16.04.1 LTS
OSType: linux
Architecture: x86_64
CPUs: 2
Total Memory: 7.797 GiB
Name: worker01
ID: 3BUL:LI3A:ZTCA:RKJE:VGDN:FNVG:5ROB:ZPFV:UYIN:W5AU:HMLK:VOA2
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): false
Registry: https://index.docker.io/v1/
WARNING: No swap limit support
Experimental: false
Insecure Registries:
127.0.0.0/8
Live Restore Enabled: false

@sgnn7
Copy link

sgnn7 commented Feb 16, 2017

Having the same issue:

$ sudo docker ps --no-trunc  | grep 2a7
49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e   <redacted>        "<redacted>"   44 hours ago        Up 44 hours         0.0.0.0:31239->8080/tcp   mesos-1312479e-2636-44ff-94fa-5d62b4f87205-S0.2a74cafe-cbaf-44e7-83af-c79d967dcf4a
$ sudo docker inspect 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e | grep Pid
            "Pid": 24402,
            "PidMode": "",
            "PidsLimit": 0,
$ sudo ps -ef | grep 24402
$ 

$ sudo docker exec -it 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e /bin/bash
rpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

$ sudo docker start 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e 
49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e
$ sudo docker exec -it 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e /bin/bashrpc error: code = 2 desc = oci runtime error: exec failed: cannot exec a container that has run and stopped

$ sudo docker kill 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e 
49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e
$ sudo docker ps --no-trunc | grep 49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e
49b7d2d640d2ec3c9b40bcf49621e514a7eaa02c7fdad10a2fdef0af216fe25e   <redacted>       "<redacted>"   45 hours ago        Up 45 hours         0.0.0.0:31239->8080/tcp   mesos-1312479e-2636-44ff-94fa-5d62b4f87205-S0.2a74cafe-cbaf-44e7-83af-c79d967dcf4a

Docker version 1.13.1, build 092cba3 on Mesos/Marathon

Some files do seem still open:

$ sudo lsof | grep 2a74cafe-cbaf-44e7-83af-c79d967dcf4a | wc -l
75

Edit:
The base docker process seems to be hanging around:

$ sudo lsof | grep 2a74cafe-cbaf-44e7-83af-c79d967dcf4a
<snip>
docker    23383 23727       root    2w      REG              202,1  64237922    1061162 <redacted>/stderr
$ ps -ef | grep 23383
root     23383 23023  0 Feb14 ?        00:00:09 docker -H unix:///var/run/docker.sock run --cpu-shares 102 --memory 134217728 --env-file /tmp/uBa2x8 <redacted>:/mnt/mesos/sandbox --net bridge -p 31239:8080/tcp --name mesos-1312479e-2636-44ff-94fa-5d62b4f87205-S0.2a74cafe-cbaf-44e7-83af-c79d967dcf4a <redacted>

@sgnn7
Copy link

sgnn7 commented Feb 16, 2017

Cont'd:

Trying to kill that process with docker kill

$ sudo strace -f -p 23383
[pid 23383] futex(0x115f4d0, FUTEX_WAIT, 0, NULL <unfinished ...>                                                                                         [11/26]
[pid 23727] futex(0xc42043bd10, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23724] futex(0xc42043b110, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23722] futex(0xc42043a910, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23392] <... restart_syscall resumed> ) = -1 ETIMEDOUT (Connection timed out)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173906, 78132442}) = 0
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173906, 78167235}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265231, 275782449}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173906, 78423604}) = 0
[pid 23392] futex(0x115e838, FUTEX_WAIT, 0, {60, 0} <unfinished ...>
[pid 23725] <... epoll_wait resumed> [{EPOLLIN|EPOLLOUT, {u32=3767856896, u64=139976752033536}}], 128, -1) = 1
[pid 23725] clock_gettime(CLOCK_MONOTONIC, {173956, 483853076}) = 0
[pid 23725] futex(0x115e838, FUTEX_WAKE, 1) = 1
[pid 23392] <... futex resumed> )       = 0
[pid 23725] read(3,  <unfinished ...>
[pid 23392] sched_yield( <unfinished ...>
[pid 23725] <... read resumed> "1db\r\n{\"status\":\"kill\",\"id\":\"49b7"..., 4096) = 482
[pid 23392] <... sched_yield resumed> ) = 0
[pid 23725] futex(0xc42043bd10, FUTEX_WAKE, 1 <unfinished ...>
[pid 23392] futex(0x115e770, FUTEX_WAKE, 1 <unfinished ...>
[pid 23725] <... futex resumed> )       = 1
[pid 23727] <... futex resumed> )       = 0
[pid 23392] <... futex resumed> )       = 0
[pid 23727] epoll_wait(4, [], 128, 0)   = 0
[pid 23727] select(0, NULL, NULL, NULL, {0, 3} <unfinished ...>
[pid 23725] read(3,  <unfinished ...>
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 484611405}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265281, 682258878}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 23727] <... select resumed> )      = 0 (Timeout)
[pid 23727] epoll_wait(4, [], 128, 0)   = 0
[pid 23727] epoll_wait(4,  <unfinished ...>
[pid 23392] <... select resumed> )      = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 484840215}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265281, 682460501}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20}) = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 485092474}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265281, 682712173}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 23725] <... read resumed> 0xc4203e8000, 4096) = -1 EAGAIN (Resource temporarily unavailable)
[pid 23392] <... select resumed> )      = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 485800021}) = 0
[pid 23392] clock_gettime(CLOCK_REALTIME, {1487265281, 683432263}) = 0
[pid 23392] select(0, NULL, NULL, NULL, {0, 20} <unfinished ...>
[pid 23725] futex(0xc42043b510, FUTEX_WAIT, 0, NULL <unfinished ...>
[pid 23392] <... select resumed> )      = 0 (Timeout)
[pid 23392] clock_gettime(CLOCK_MONOTONIC, {173956, 486084890}) = 0
[pid 23392] futex(0x115e838, FUTEX_WAIT, 0, {60, 0}^Cstrace: Process 23383 detached

sudo kill 23383 had no effect
sudo kill -9 23383 correctly ended the container process

@mlaventure
Copy link
Contributor

@sgnn7 The daemon logs would be useful here, also if you could provide which files related to docker itself are still open it would be useful.

When the daemon doesn't answer to SIGTERM, try sending a SIGUSR1 instead, it will dump its stacktrace so we can try and figure out where it is stuck.

@mlaventure
Copy link
Contributor

@thinkhard-j-park your issue seem to be different, your error seems to indicate that containerd is not running or responding. Could you confirm this?

@sgnn7
Copy link

sgnn7 commented Feb 16, 2017

@mlaventure Sadly I can't provide the daemon logs (security issues) and I can't do SIGUSR1since I "fixed" it with a kill -9 to the parent process. The open files were mounted by mesos as the stdout sandbox and the problem seems very similar to this one with inability to properly clean up.

@sudo-bmitch
Copy link

We're seeing a very similar error, with jwilder/nginx-proxy, lots of logs with docker logs on the container hanging, and the container going into a ghost state when we try to stop or remove it. Here are some of our (censored) logs:

We first saw the issue when promoting to production from our Jenkins server, here's the relevant logs from that:

+ docker-compose down
Stopping nginx-proxy-backend ...
Stopping nginx-proxy-frontend ...
Stopping nginx-proxy-frontend ... done
Removing nginx-proxy-backend ...

Removing nginx-proxy-backend ... done
Network frontend is external, skipping
Network backend is external, skipping

+ docker-compose up -d --remove-orphans
Recreating nginx-proxy-frontend
Creating nginx-proxy-backend
ERROR: for nginx-proxy-frontend  Error response from daemon: watch null for network "frontend"
Encountered errors while bringing up the project.

While debugging directly on the host, I couldn't kill or remove the container until the daemon was reloaded to turn on debugging:

[root@***host1 ~]# docker ps -a | grep frontend
ce7d9ba75846        1824b7cdcef5                "/app/docker-entry..."   5 weeks ago         Up 2 weeks               0.0.0.0:80->80/tcp, 443/tcp   nginx-proxy-frontend
[root@***host1 ~]# docker logs --since 5m ce7
^C
[root@***host1 ~]# cd /var/lib/docker/
[root@***host1 docker]# cd containers/
[root@***host1 containers]# cd ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30/
[root@***host1 ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30]# ls -al
total 209256
drwx------. 4 root root      4096 Feb 17 12:59 .
drwx------. 8 root root      4096 Feb 17 12:06 ..
-rw-r-----. 1 root root 214239548 Feb  3 14:05 ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30-json.log
drwx------. 2 root root         6 Jan 12 10:21 checkpoints
-rw-r--r--. 1 root root      3698 Feb 17 12:59 config.v2.json
-rw-r--r--. 1 root root      1210 Feb 17 12:59 hostconfig.json
-rw-r--r--. 1 root root        13 Jan 30 12:06 hostname
-rw-r--r--. 1 root root       172 Jan 30 12:06 hosts
-rw-r--r--. 1 root root        60 Jan 30 12:06 resolv.conf
-rw-r--r--. 1 root root        71 Jan 30 12:06 resolv.conf.hash
drwxrwxrwt. 2 root root        40 Jan 30 12:06 shm
[root@***host1 ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30]# cd
[root@***host1 ~]# docker kill ce7
ce7
[root@***host1 ~]# docker ps -a | grep frontend
ce7d9ba75846        1824b7cdcef5                "/app/docker-entry..."   5 weeks ago         Up 2 weeks               0.0.0.0:80->80/tcp, 443/tcp   nginx-proxy-frontend
[root@***host1 ~]# docker rm -f ce7
Error response from daemon: Unable to remove filesystem for ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: remove /var/lib/docker/containers/ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30/shm: device or resource busy
[root@***host1 ~]# vi /etc/docker/daemon.json
[root@***host1 ~]# systemctl reload docker
[root@***host1 ~]# docker kill ce7
Error response from daemon: Cannot kill container ce7: No such container: ce7
[root@***host1 ~]# docker ps -a | grep frontend
[root@***host1 ~]# 

Our logs from journalctl include the following, unfortunately turning on debugging also eliminated the issue:

Feb 17 12:59:07 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:07.953293414-07:00" level=info msg="Container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 failed to exit within 10 seconds of signal 15 - using the force"
Feb 17 12:59:07 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:07.953525238-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 12:59:17 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:17.953757486-07:00" level=info msg="Container ce7d9ba75846 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Feb 17 12:59:18 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:18.169064177-07:00" level=error msg="Handler for POST /v1.25/containers/ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30/rename returned error: watch null for network \"frontend\""
Feb 17 12:59:58 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:58.014949896-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 12:59:58 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:58.175071431-07:00" level=warning msg="cannot find target task in store" forwarder.id=qqmusvubg0z4ooqkysj8u5gta method="(*Dispatcher).UpdateTaskStatus" node.id=qqmusvubg0z4ooqkysj8u5gta node.session=ac7ejcds3chpr211pz5ybxqtu task.id=hgrk2zcwnyqt3bfmz63xiqqf4
Feb 17 12:59:58 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:58.175634006-07:00" level=warning msg="cannot find target task in store" forwarder.id=qqmusvubg0z4ooqkysj8u5gta method="(*Dispatcher).UpdateTaskStatus" node.id=qqmusvubg0z4ooqkysj8u5gta node.session=ac7ejcds3chpr211pz5ybxqtu task.id=hgrk2zcwnyqt3bfmz63xiqqf4
Feb 17 12:59:58 ***host1.****.corp dockerd[7887]: time="2017-02-17T12:59:58.202512892-07:00" level=error msg="task unavailable" method="(*Dispatcher).processUpdates" module=dispatcher task.id=hgrk2zcwnyqt3bfmz63xiqqf4
Feb 17 13:00:08 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:08.015193075-07:00" level=info msg="Container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 failed to exit within 10 seconds of signal 15 - using the force"
Feb 17 13:00:08 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:08.015862054-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 13:00:18 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:18.016063087-07:00" level=info msg="Container ce7d9ba75846 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Feb 17 13:00:39 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:39.694613460-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 13:00:49 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:49.694829630-07:00" level=info msg="Container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 failed to exit within 10 seconds of signal 15 - using the force"
Feb 17 13:00:49 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:49.695054174-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
Feb 17 13:00:59 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:00:59.695285744-07:00" level=info msg="Container ce7d9ba75846 failed to exit within 10 seconds of kill - trying direct SIGKILL"
Feb 17 13:02:05 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:02:05.396756518-07:00" level=warning msg="container kill failed because of 'container not found' or 'no such process': Cannot kill container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30: rpc error: code = 2 desc = no such process"
....
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.510806160-07:00" level=info msg="Got signal to reload configuration, reloading from: /etc/docker/daemon.json"
Feb 17 13:05:13 ***host1.****.corp systemd[1]: Reloaded Docker Application Container Engine.
....
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.656533223-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating bulk sync with nodes [***host1.****.corp-820f2ede70b5]"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.656558292-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating unsolicited bulk sync for networks [m94essemfcik8h9fbjqbsthnx tus35vbtw5zq155pwa90yl87a] with node ***host1.****.corp-820f2ede70b5"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.656572447-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating bulk sync with nodes [***host2.****.corp-2a899fe772fc]"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.656581353-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating unsolicited bulk sync for networks [m94essemfcik8h9fbjqbsthnx tus35vbtw5zq155pwa90yl87a un66divtohdyb0kwb7wfgagfh osghxmb47ofeackn2q2ndm3hz] with node ***host2.****.corp-2a899fe772fc"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.658081017-07:00" level=debug msg="memberlist: TCP connection from=10.155.1.54:44614"
Feb 17 13:05:13 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:13.658687830-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Bulk sync to node ***host2.****.corp-2a899fe772fc took 1.449645ms"
Feb 17 13:05:17 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:17.043882197-07:00" level=debug msg="memberlist: TCP connection from=10.155.1.54:44616"
Feb 17 13:05:17 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:17.044355955-07:00" level=debug msg="***host1.****.corp-5b36d322a50d: Initiating  bulk sync for networks [m94essemfcik8h9fbjqbsthnx un66divtohdyb0kwb7wfgagfh tus35vbtw5zq155pwa90yl87a osghxmb47ofeackn2q2ndm3hz] with node ***host2.****.corp-2a899fe772fc"
Feb 17 13:05:18 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:18.550826379-07:00" level=debug msg="memberlist: TCP connection from=10.155.1.54:44618"
Feb 17 13:05:21 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:21.188956135-07:00" level=debug msg="Calling GET /_ping"
Feb 17 13:05:21 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:21.189383775-07:00" level=debug msg="Calling POST /v1.25/containers/ce7/kill?signal=KILL"
Feb 17 13:05:21 ***host1.****.corp dockerd[7887]: time="2017-02-17T13:05:21.189422271-07:00" level=error msg="Handler for POST /v1.25/containers/ce7/kill returned error: Cannot kill container ce7:No such container: ce7"

From the above logs, once we reloaded the daemon to turn on debugging, the ghost container finally died and we were able to deploy again using compose.

Our environment consists of two docker hosts (host1 and host2), setup with swarm mode to act as a k/v store for our attachable overlay networks (we'll finish the transition to services later), and classic swarm being used to spin up the containers on these hosts. There is a logspout container on host1 that is shipping logs off to another system.

One of my next steps will be to setup docker container log rotation to see if this prevents this issue. This container probably has more logs than any other due to some frequently polling from a monitoring tool.

@mlaventure
Copy link
Contributor

@bmitch3020 it looks like docker never received the exit event for that container or failed to process it.

There should be a message like:

libcontainerd: received containerd event: &types.Event{Type:"exit", Id:"ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30",...

If containerd generated the event and docker received it.

Could you check if they are present in your log (assuming the daemon was already in debug mode)?

@sudo-bmitch
Copy link

@mlaventure I have some logs from after we got debugging turned on, but that's after the container failed to stop or rm:

[root@***host1 ~]# journalctl -u docker | grep containerd | grep exit | grep ce7d
Jan 30 12:06:38 ***host1.****.corp dockerd[186450]: time="2017-01-30T12:06:38.559080774-07:00"level=error msg="containerd: deleting container" error="exit status 1: \"container ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 does not exist\\none or more of the container deletionsfailed\\n\""
Feb 17 13:14:30 ***host1.****.corp dockerd[36555]: time="2017-02-17T13:14:30.743988657-07:00" level=debug msg="containerd: process exited" id=ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30 pid=init status=2 systemPid=8212
Feb 17 13:14:31 ***host1.****.corp dockerd[36555]: time="2017-02-17T13:14:31.728078304-07:00" level=debug msg="libcontainerd: received containerd event: &types.Event{Type:\"exit\", Id:\"ce7d9ba75846e96655f43f4dfa97aa591573d649b6cdbb5c2bf13322f32efb30\", Status:0x2, Pid:\"init\", Timestamp:(*timestamp.Timestamp)(0xc4201ee110)}"

@crosbymichael
Copy link
Contributor

If you are binding to things like 80:80 and 443:443 it would be interesting if we could see if these are networking related.

A quick test would be to run the container in host networking mode, it will not go though NAT or the docker proxy. The container will have access to the hosts network interfaces directly. This is just an idea because you suggested this seems to happen with heavy network use. This test would be a way for us to rule networking out of the things that could be causing this.

If anyone is comfortable with running this test, that would help a lot. The docker run command for doing this would be docker run --net host, I don't know what the compose syntax would be and the container will bind to the default ports on our host interfaces. Its a little less secure as the container would have access to your loopback interface but if you trust the container you are good to go. Plus you will see a nice performance bump during this test for your networked applications ;)

@crosbymichael crosbymichael self-assigned this Feb 27, 2017
@sgnn7
Copy link

sgnn7 commented Mar 2, 2017

@mlaventure I think we found a commonality between two occurrences of this error in our case: the logs for the container have grown to be massive so both times someone did a docker logs for it that never exited before killing the container in Marathon UI.

@xverges
Copy link

xverges commented Mar 2, 2017

@sgnn7 ... and given #30135 (closed) your finding may be very good news for you!

@thaJeztah
Copy link
Member

Let me close this ticket for now, as it looks like it went stale.

@thaJeztah thaJeztah closed this as not planned Won't fix, can't repro, duplicate, stale Sep 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants