Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

systemd cgroup scope somehow gets left behind not allowing containers to start #7015

Closed
ibuildthecloud opened this issue Jul 14, 2014 · 42 comments · Fixed by #7597
Closed

systemd cgroup scope somehow gets left behind not allowing containers to start #7015

ibuildthecloud opened this issue Jul 14, 2014 · 42 comments · Fixed by #7597

Comments

@ibuildthecloud
Copy link
Contributor

I ran into a situation in which I got:

[error] container.go:468 Error running container: Unit docker-d3ca1668bff4e74d22113886ba1433ecae920ece02c76ce1a9344409b68903bc.scope already exists.

It seems the scope did not get cleaned up by systemd. The scope file was still present in /run/systemd/system/... but systemd-cgls did not show it. In the end I had to delete the container because I couldn't start it.

The way I got into this situation is I had a container that would die on startup and I had a script that was auto-restarting it. So it was starting this container every 5 seconds and it would die every 5 seconds. Eventually it must have hit some race condition, and then it just started to always fail to start in docker because of this cgroup issue.

This is systemd 212 and docker 1.0.1 running CoreOS 367.1.0

@ibuildthecloud
Copy link
Contributor Author

Upon further inspection I found the following cgroups still exist.

/sys/fs/cgroup/blkio/system.slice/docker-d3ca1668bff4e74d22113886ba1433ecae920ece02c76ce1a9344409b68903bc.scope
/sys/fs/cgroup/memory/system.slice/docker-d3ca1668bff4e74d22113886ba1433ecae920ece02c76ce1a9344409b68903bc.scope
/sys/fs/cgroup/cpu,cpuacct/system.slice/docker-d3ca1668bff4e74d22113886ba1433ecae920ece02c76ce1a9344409b68903bc.scope
/sys/fs/cgroup/systemd/system.slice/docker-d3ca1668bff4e74d22113886ba1433ecae920ece02c76ce1a9344409b68903bc.scope

For all cgroups, cgroup.procs is empty. I suspect this is a systemd issue.

@crosbymichael
Copy link
Contributor

Yes, this is a systemd cgroup issue

@crosbymichael
Copy link
Contributor

@ibuildthecloud
Copy link
Contributor Author

Looking into this, the docker systemd cgroup code is cleaning up properly. It's deleting all cgroups that it created, that systemd doesn't support. The issue seems to be that the cgroups that systemd manages are supposed to be deleted automatically and they are not. So, still seems like a specific systemd issue.

Since systemd is out there in the wild and this issue exists, it seems as though docker should use the cgroups if they already exist, and not fail. This does mean that systemd will be orphaning cgroups, but that will have to be address separately.

I will try to look into what the fix may be for this

@joschi127
Copy link

I have this issue when using manjaro linux as host system. When stopping a container, the cgroup cleanup does not seem to work, the following .scope folders still exist after stopping the container:

/sys/fs/cgroup/blkio/system.slice/docker-663a7122c20d75e6c34b94bac7dafa144ac06e6ce1d9c09c71d6dbb4fe94be4c.scope
/sys/fs/cgroup/memory/system.slice/docker-663a7122c20d75e6c34b94bac7dafa144ac06e6ce1d9c09c71d6dbb4fe94be4c.scope
/sys/fs/cgroup/cpu,cpuacct/system.slice/docker-663a7122c20d75e6c34b94bac7dafa144ac06e6ce1d9c09c71d6dbb4fe94be4c.scope
/sys/fs/cgroup/systemd/system.slice/docker-663a7122c20d75e6c34b94bac7dafa144ac06e6ce1d9c09c71d6dbb4fe94be4c.scope

As far as I can see this happens every time for me, not only from time to time.

@djmaze
Copy link
Contributor

djmaze commented Aug 6, 2014

Same problem on Arch Linux. Is there a way to fix this state, apart from rebooting?

@joschi127
Copy link

@djmaze For me the only real fix is rebooting. As a workaround you can save the container to an image like this:

docker commit containername imagename

Then you can remove the old container and run a new container from the saved image:

docker rm containername
docker run --name containername [...] -t imagename [...]

@mastercactapus
Copy link

workaround in the meantime:

#systemctl stop docker-FULL_CONTAINER_ID.scope
systemctl stop docker-663a7122c20d75e6c34b94bac7dafa144ac06e6ce1d9c09c71d6dbb4fe94be4c.scope

SergeiDiachenko added a commit to smilart/smilart.os.distribution that referenced this issue Sep 4, 2014
@gegere
Copy link

gegere commented Dec 9, 2014

@mastercactapus great, thank you for this post it solved my issue, for now!

@guilhermebr
Copy link

Hi, this happened many times.

core@coreos03 ~ $ docker start containername
Error response from daemon: Cannot start container containername: Unit docker-93408c6f3c1bb4a944b04d39bb06509aa1b5159f715d3ce157b34a4874cb5109.scope already exists.

$ docker --version
Docker version 1.1.2, build d84a070

@gegere
Copy link

gegere commented Dec 12, 2014

@guilhermebr ok... run the following command.

$ systemctl stop docker-93408c6f3c1bb4a944b04d39bb06509aa1b5159f715d3ce157b34a4874cb5109.scope

@defunctzombie
Copy link

@crosbymichael I am still seeing this issue come up on one of my machines

Docker version 1.3.2, build 50b8feb
core@localtunnel ~ $ docker restart localtunnel
Error response from daemon: Cannot restart container localtunnel: Unit docker-9e57a7bb9f04302a5cc19f85536fff4b82d12fb4a735a45870689d680c78132c.scope already exists.
2015/01/04 19:14:31 Error: failed to restart one or more containers

I have the container launched with --restart=always and --net host. Seems that sometimes it crashes in a bad way that prevents docker from restarting it automatically causing downtime for the service until I go in and manually do the fix.

@LK4D4
Copy link
Contributor

LK4D4 commented Jan 4, 2015

I have this error on master too.

@jokesterfr
Copy link

Same issue here on archlinux with systemd. A systemctl stop docker-xxxxxxxxxx does the trick, but well, it's uneasy to find out the issue if you are a complete beginner..

@LK4D4
Copy link
Contributor

LK4D4 commented Feb 6, 2015

I'm not sure why we closed this, because it's still an issue.
ping @crosbymichael

@Seth-Miller
Copy link

I just hit this issue with a brand new installation of Oracle Linux 7.0.

Here is the docker version I am running.
[root@localhost ~]# docker version
Client version: 1.3.3
Client API version: 1.15
Go version (client): go1.3.3
Git commit (client): 4e9bbfa/1.3.3
OS/Arch (client): linux/amd64
Server version: 1.3.3
Server API version: 1.15
Go version (server): go1.3.3
Git commit (server): 4e9bbfa/1.3.3

@agusgr
Copy link

agusgr commented Feb 17, 2015

Hi all,

The same issue in a new installation or CoreOs 557.2.0 (current stable channel):

CoreOS stable (557.2.0)
core@core-01 ~ $ docker version
Client version: 1.4.1
Client API version: 1.16
Go version (client): go1.3.3
Git commit (client): 5bc2ff8-dirty
OS/Arch (client): linux/amd64
Server version: 1.4.1
Server API version: 1.16
Go version (server): go1.3.3
Git commit (server): 5bc2ff8-dirty
core@core-01 ~ $

@defunctzombie
Copy link

@LK4D4 @crosbymichael what is it going to take to re-open this? This sounds like a hard issue to fix for just anyone to create a PR. Maybe someone with knowledge around the matter can take a quick look? Restart functionality is untrustworthy as a result.

@stefwalter
Copy link

In the Cockpit project our integration tests hit this race (?) issue routinely. For example, running on Fedora 21, docker-io-1.4.1-8.fc21.x86_64

@Malet
Copy link

Malet commented Feb 19, 2015

Also coming across this on CoreOS beta (584.0.0) with

core@localhost ~ $ docker version
Client version: 1.4.1
Client API version: 1.16
Go version (client): go1.3.3
Git commit (client): 5bc2ff8-dirty
OS/Arch (client): linux/amd64
Server version: 1.4.1
Server API version: 1.16
Go version (server): go1.3.3
Git commit (server): 5bc2ff8-dirty

Does not happen for all containers, and can be brought back to a working state using:

sudo systemctl stop docker-<container_hash>.scope

@Hokutosei
Copy link

systemctl stop docker-*

did the trick

@wptad
Copy link

wptad commented Feb 26, 2015

helps for me. thanks

systemctl stop docker-*

@dreibh
Copy link

dreibh commented Feb 26, 2015

The issue looks very similar to the bug reported here: https://bugzilla.redhat.com/show_bug.cgi?id=1066926 . I have an issue with left-behind scope files when destroying LXC containers via libvirt (I am not using Docker).

@dreibh
Copy link

dreibh commented Feb 26, 2015

In my case, the problem is with Fedora 20 and libvirt-1.2.11.

@fxposter
Copy link

Debian 8
Linux hostname 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt2-1 (2014-12-08) x86_64 GNU/Linux
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.4.1
Git commit (client): a8a31ef
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.4.1
Git commit (server): a8a31ef

Same issue...

@agusgr
Copy link

agusgr commented Apr 7, 2015

Hi all,

Only if this can help someone, after upgrading my enviroment I can avoid the error on docker restart.
My new environment:

CoreOS stable (607.0.0)

core@core-01 ~ $ docker version
Client version: 1.5.0
Client API version: 1.17
Go version (client): go1.3.3
Git commit (client): a8a31ef-dirty
OS/Arch (client): linux/amd64
Server version: 1.5.0
Server API version: 1.17
Go version (server): go1.3.3
Git commit (server): a8a31ef-dirty

The bad news are that now I get an error on docker stop:

Apr 07 16:38:41 core-01 systemd[1]: xx.service: main process exited, code=exited, status=143/n/a
Apr 07 16:38:41 core-01 bash[6595]: Failed to stop docker-345029b9c94588f154a9e517e0b480f18c6488f42c72d84fcc04f6f476d51406.scope: Interactive authentication required.
Apr 07 16:38:41 core-01 systemd[1]: xx.service: control process exited, code=exited status=1

I will continue looking for some light on this topic and keep you posted

Agus

@LK4D4
Copy link
Contributor

LK4D4 commented Apr 7, 2015

ping @philips
Can you imagine from where "Interactive authentication required. " error can come?

@philips
Copy link
Contributor

philips commented Apr 7, 2015

@LK4D4 Looking at the systemd code it is around the policy kit, I am going to have to dig into it, very bizarre.

@LK4D4
Copy link
Contributor

LK4D4 commented Apr 7, 2015

@philips Thank you! Looks really weird.

@philips
Copy link
Contributor

philips commented Apr 7, 2015

@agusgr I filed an issue on the CoreOS bug tracker so we can hunt this down and not make more noise on this already confusing issue. Can you followup over here with what you were doing to cause this: coreos/bugs#321

@defunctzombie
Copy link

Can this issue be opened? Plenty of people have stated it is still a problem for them and until it can be verified as fixed every currently deployed system continues to suffer from this. My own services fail to restart daily and require manual intervention as a result.

@LK4D4
Copy link
Contributor

LK4D4 commented Apr 17, 2015

@defunctzombie Let's create new then. I actually can't reproduce it anymore.

@defunctzombie
Copy link

@LK4D4 which version of docker stopped failing for you?

@onorua
Copy link

onorua commented May 19, 2015

Client version: 1.6.2
Client API version: 1.18
Go version (client): go1.4.2
Git commit (client): 7c8fca2
OS/Arch (client): linux/amd64
Server version: 1.6.2
Server API version: 1.18
Go version (server): go1.4.2
Git commit (server): 7c8fca2
OS/Arch (server): linux/amd64

Same issue:
docker start 374af5c9a5c5
Error response from daemon: Cannot start container 374af5c9a5c5: [8] System error: Unit docker-374af5c9a5c57e749c0ed137455a67e1006a7fbeb603adccaac8d33a7ae8fb9b.scope already exists.
FATA[0000] Error: failed to start one or more containers

@runcom
Copy link
Member

runcom commented Jun 3, 2015

1.7.0-RC1 just hit this on Ubuntu 15.04 with systemd
the only workround for me is to systemctl stop scope file

@runcom
Copy link
Member

runcom commented Jun 3, 2015

ping @LK4D4

@harobed
Copy link

harobed commented Jun 7, 2015

Same issue with Debian Jessie

@LK4D4
Copy link
Contributor

LK4D4 commented Jun 8, 2015

@runcom I'll propose to create new issue with reproduction case there.

@runcom
Copy link
Member

runcom commented Jun 8, 2015

@LK4D4 I'm trying to reproduce this again but it's intermittent (this happens to me when using compose)

@thaJeztah
Copy link
Member

A new issue has been created to track this in #13802.

Please continue the discussion on that issue if you're still encountering this.

@lucaspottersky
Copy link

It's just happened to me.

Docker version 1.7.0, build 0baf609
Lubuntu/Ubuntu 15.04

@thaJeztah
Copy link
Member

@lucaspottersky please see my comment above yours

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.