Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Image remove blocked by a container lock #37072

Open
dnephin opened this issue May 15, 2018 · 10 comments
Open

Image remove blocked by a container lock #37072

dnephin opened this issue May 15, 2018 · 10 comments
Assignees
Labels
area/runtime kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/needs-attention Calls for a collective discussion during a review session version/18.03

Comments

@dnephin
Copy link
Member

dnephin commented May 15, 2018

This issue occurs when trying to deploy new hosts using 18.03.1-ce. Previously workloads would run correctly using ~17.06.

Trying to image rm any image will block and never complete. There were 5 containers running on the host. Trying to inspect or rm two of them had the same behaviour (blocked and never complete).

Full output from SIGUSR1:
goroutine-stacks-2018-05-15T151422Z.log

There are a bunch of goroutines stuck on image remove. There is also one stuck on container remove that seems to be holding the lock:

1: select [728 minutes] [Created by http.(*Server).Serve @ server.go:2720]
    transport     transport.go:256                (*Stream).Header(*Stream(#502), #659, #264, #689)
    grpc          call.go:64                      recvResponse(Context(#721), dialOptions(#189), ClientTransport(#685), *callInfo(0x0), *Stream(0x0), interface{}(0x0), ...)
    grpc          call.go:279                     invoke(Context(#721), string(#642, len=40), interface{}(#651), interface{}(#654), *ClientConn(#150), CallOption(0x0), ...)
    containerd    grpc.go:18                      namespaceInterceptor.unary(Context(#636), string(#720, len=842350551064), interface{}(#642), interface{}(#651), *ClientConn(#654), UnaryInvoker(#693), ...)
    containerd    grpc.go:34                      unary)-fm(string(#720, len=842350551064), #642, 0x28, #651, #517, #654, #693, #150, #658, ...)
    grpc          call.go:141                     Invoke(Context(#720), string(#642, len=40), interface{}(#651), interface{}(#654), *ClientConn(#150), CallOption(0x0), ...)
    v1            tasks.pb.go:468                 (*tasksClient).Kill(*tasksClient(#429), Context(#720), *KillRequest(#517), CallOption(0x0), CallOption(0x0), CallOption(#396))
    containerd    task.go:181                     (*task).Kill(*task(#533), Context(#687), Signal(0x9), KillOpts(0x0), KillOpts(0x0))
    libcontainerd client_daemon.go:389            (*client).SignalProcess(*client(#132), Context(#687), string(#316, len=64), string(#635, len=4), 9, #305, #434)
    daemon        kill.go:179                     (*Daemon).kill(*Daemon(#33), *Container(#548), 9, 0x2, 0x2)
    daemon        kill.go:99                      (*Daemon).killWithSignal(*Daemon(#33), *Container(#548), 9, 0x0, 0x0)
    daemon        kill.go:169                     (*Daemon).killPossiblyDeadProcess(*Daemon(#33), *Container(#548), 9, #265, #615)
    daemon        kill.go:129                     (*Daemon).Kill(*Daemon(#33), *Container(#548), 0x0, 0x0)
    daemon        delete.go:93                    (*Daemon).cleanupContainer(*Daemon(#33), *Container(#548), bool(0x101), bool(0x0))
    daemon        delete.go:48                    (*Daemon).ContainerRm(*Daemon(#33), string(#567, len=64), *ContainerRmConfig(#237), 0x0, 0x0)
    container     container_routes.go:494         (*containerRouter).deleteContainers(*containerRouter(#208), Context(#721), ResponseWriter(#684), *Request(#315), map[string]string(#514), 0x5)
    container     container.go:68                 deleteContainers)-fm(*containerRouter(#721), #516, #684, #274, #315, #514, #721, #516)
    middleware    experimental.go:27              ExperimentalMiddleware.WrapHandler.func1(func(#721), #684, #274, #315, #514, #721, #516)
    middleware    version.go:62                   VersionMiddleware.WrapHandler.func1(func(#721), #684, #274, #315, #514, #606, 0x40)
    authorization middleware.go:59                (*Middleware).WrapHandler.func1(*Middleware(#721), func(#515), #274, #315, #514, #721, #515)
    server        server.go:137                   (*Server).makeHTTPHandler.func1(*Server(#684), APIFunc(#274))
    http          server.go:1918                  HandlerFunc.ServeHTTP(ReadCloser(#203), func(#274))
    mux           mux.go:103                      (*Router).ServeHTTP(*Router(#42), ResponseWriter(#684), *Request(#315))
    server        router_swapper.go:29            (*routerSwapper).ServeHTTP(*routerSwapper(#238), ResponseWriter(#684), *Request(#315))
    http          server.go:2619                  serverHandler.ServeHTTP(*Server(#43), #684, #274, #315)
    http          server.go:1801                  (*conn).serve(#508, #686, #466)
1: semacquire [16 minutes] [Created by http.(*Server).Serve @ server.go:2720]
    sync          sema.go:71                      runtime_SemacquireMutex(*uint32(#491), bool(0x0))
    sync          mutex.go:134                    (*Mutex).Lock(*Mutex(#490))
    container     state.go:241                    (*State).IsRunning(*State(#490), 0x47)
    daemon        image_delete.go:378             (*Daemon).checkImageDeleteConflict.func1(*Daemon(#548), ID(#579))
    container     memory_store.go:62              (*memoryStore).First(*memoryStore(#188), StoreFilter(#547))
    daemon        image_delete.go:380             (*Daemon).checkImageDeleteConflict(*Daemon(#33), ID(#538), conflictType(0xf))
    daemon        image_delete.go:313             (*Daemon).imageDeleteHelper(*Daemon(#33), ID(#538), *<unknown>(#523), bool(0x10100), bool(0x1))
    daemon        image_delete.go:178             (*Daemon).ImageDelete(*Daemon(#33), string(#504, len=11), bool(0x100), bool(#616), #644, #554)
    image         image_routes.go:199             (*imageRouter).deleteImages(*imageRouter(#227), Context(#721), ResponseWriter(#684), *Request(#79), map[string]string(#476), 0x5)
    image         image.go:42                     deleteImages)-fm(*imageRouter(#721), #478, #684, #266, #79, #476, #721, #478)
    middleware    experimental.go:27              ExperimentalMiddleware.WrapHandler.func1(func(#721), #684, #266, #79, #476, #721, #478)
    middleware    version.go:62                   VersionMiddleware.WrapHandler.func1(func(#721), #684, #266, #79, #476, #700, 0x40)
    authorization middleware.go:59                (*Middleware).WrapHandler.func1(*Middleware(#721), func(#477), #266, #79, #476, #721, #477)
    server        server.go:137                   (*Server).makeHTTPHandler.func1(*Server(#684), APIFunc(#266))
    http          server.go:1918                  HandlerFunc.ServeHTTP(ReadCloser(#74), func(#266))
    mux           mux.go:103                      (*Router).ServeHTTP(*Router(#42), ResponseWriter(#684), *Request(#79))
    server        router_swapper.go:29            (*routerSwapper).ServeHTTP(*routerSwapper(#238), ResponseWriter(#684), *Request(#79))
    http          server.go:2619                  serverHandler.ServeHTTP(*Server(#43), #684, #266, #79)
    http          server.go:1801                  (*conn).serve(#503, #686, #568)

I looked around for related issues, but couldn't find anything recent.

$ docker version
Client:
 Version:      18.03.1-ce
 API version:  1.37
 Go version:   go1.9.5
 Git commit:   9ee9f40
 Built:        Thu Apr 26 07:18:46 2018
 OS/Arch:      linux/amd64
 Experimental: false
 Orchestrator: swarm

Server:
 Engine:
  Version:      18.03.1-ce
  API version:  1.37 (minimum version 1.12)
  Go version:   go1.9.5
  Git commit:   9ee9f40
  Built:        Thu Apr 26 07:16:59 2018
  OS/Arch:      linux/amd64
  Experimental: false
$ docker info
Containers: 40
 Running: 34
 Paused: 0
 Stopped: 6
Images: 392
Server Version: 18.03.1-ce
Storage Driver: aufs
 Backing Filesystem: extfs
 Dirs: 4383
 Dirperm1 Supported: true
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
Swarm: inactive
Runtimes: runc
Default Runtime: runc
Init Binary: docker-init
containerd version: 773c489c9c1b21a6d78b5c538cd395416ec50f88
runc version: 4fc53a81fb7c994640722ac585fa9ca548971871
init version: 949e6fa
Kernel Version: 4.4.0-124-generic
Operating System: Ubuntu 14.04.5 LTS
OSType: linux
Architecture: x86_64
CPUs: 32
Total Memory: 58.97GiB
@dnephin dnephin added area/runtime kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. version/18.03 labels May 15, 2018
@cpuguy83
Copy link
Member

Any chance you can get a containerd stack trace?

@dnephin
Copy link
Member Author

dnephin commented May 15, 2018

Ya, I'll do that now.

@thaJeztah thaJeztah added this to backlog in maintainers-session May 16, 2018
@dnephin
Copy link
Member Author

dnephin commented May 16, 2018

Attaching the containerd stack trace: containerd-stack-dump.log

And the shim stack trace (pulled from strace -e write -s 100000 since stderr is /dev/null):
shim.log

I can see the containerd goroutine that is stuck on kill is in a select communicating with the shim. Trying to run ctr task ls is also not responding.

Looking at the process list, this is the containerd-shim for the container that is locked:

root           \_ docker-containerd-shim -namespace moby -workdir ...
root               \_ [docker-runc] <defunct>

From the shim stack trace it seems like the shim is stuck waiting on a channel from runc, but runc is gone so it's going to be waiting a while. I guess there needs to be some timeout so that it can detect that runc is gone?

@thaJeztah thaJeztah added the status/needs-attention Calls for a collective discussion during a review session label May 17, 2018
@cpuguy83
Copy link
Member

Weird issue indeed.

The shim has issued a runc kill command and waiting for the exit code.
For some reason there is no goroutine actually waiting on the process (as in wait4 or waitid), but this is supposed to be called immediately after the process is started (in a goroutine): https://github.com/containerd/go-runc/blob/master/monitor.go#L55
The goroutine couldn't have already exited other wise docker-runc would not be defunct.

@thaJeztah
Copy link
Member

@dnephin @cpuguy83 what's the status on this one; is this still an issue?

@dnephin
Copy link
Member Author

dnephin commented Jul 21, 2018

I don't think it has been fixed, so it is still an issue.

@XiaolongHu
Copy link

Using 18.04 on centos, We meet similar problem. A container has exited, but docker ps show it is Up. When we exec "docker inspect id", the cmd had blocked and never return.

@dnephin
Copy link
Member Author

dnephin commented Jan 4, 2019

I took another look at the stack trace from version 18.06.1-ce which still has the problem. The shim stack trace looks exactly like the problem described here: containerd/containerd#2709 (comment)

This may have been fixed by containerd/containerd#2743 which seems to be in containerd 1.2.1. I see (docker-archive#129) backports containerd 1.2.1 to 18.09.1.

I will test again with that version to see if it is resolved.

@thaJeztah
Copy link
Member

Thanks @dnephin - sounds hopeful!

@dnephin
Copy link
Member Author

dnephin commented Jan 4, 2019

also @cpuguy83 in case you were curious (since you took a look into this issue a while ago) I think https://github.com/containerd/go-runc/blob/master/monitor.go#L55 was a red herring. That ProcessMonitor is not used, it is overridden by reaper/reaper.go:Monitor in cmd/containerd-shim/shim_linux.go. This bug isn't possible with the monitor in go-runc, but it is with the reaper.Monitor because it is blocked reading from a channel before calling cmd.Wait().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/runtime kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/needs-attention Calls for a collective discussion during a review session version/18.03
Projects
Development

No branches or pull requests

5 participants