Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker(containerd) miss container rm command #39047

Open
fdingiit opened this issue Apr 10, 2019 · 2 comments
Open

Docker(containerd) miss container rm command #39047

fdingiit opened this issue Apr 10, 2019 · 2 comments

Comments

@fdingiit
Copy link

fdingiit commented Apr 10, 2019

Description

Test group report docker rm command failed for about a half year, version from 17.03 to 18.09.0.

Steps to reproduce the issue:
No way to reproduce, but the problem happen nearly every day

Describe the results you received:
docker stuck in rm, like:

[root@71be7210-5445-11e9-b7b6-ecd68a2ef3bb docker]# ps ux | grep docker
root      3604  1.3  0.1 2007820 54912 ?       Ssl  Apr09  18:06 /usr/bin/dockerd -b=none --storage-driver=overlay -D
root      3624  0.9  0.1 2360104 54108 ?       Ssl  Apr09  12:51 containerd --config /var/run/docker/containerd/containerd.toml --log-level debug
root      3764  0.1  0.0 107984  9712 ?        Sl   Apr09   1:46 containerd-shim -namespace moby -workdir /var/lib/docker/containerd/daemon/io.containerd.runtime.v1.linux/moby/a2e79a761f20d607d4950dff24e6323a99a51fdac07c5d35f410f5a5c81f32f4 -address /var/run/docker/containerd/containerd.sock -containerd-binary /opt/midware/docker/containerd -runtime-root /var/run/docker/runtime-runc -debug
root      8263  0.0  0.0 112652   980 pts/7    S+   11:53   0:00 grep --color=auto docker
root     26907  0.0  0.0  47492 17340 pts/15   Sl+  09:30   0:00 docker rm -f de1382ef7869 4c9342f2a94c c8a42d84346a a4b64d7dbb60 a2e79a761f20 25ad3b49de5f d69eb06aa331
[root@71be7210-5445-11e9-b7b6-ecd68a2ef3bb docker]# 

and docker ps like:

[root@71be7210-5445-11e9-b7b6-ecd68a2ef3bb docker]# docker ps
CONTAINER ID        IMAGE               COMMAND             CREATED             STATUS              PORTS               NAMES
a2e79a761f20        kedacom_umm:v1      "/usr/bin/bash"     25 hours ago        Up 22 hours                             core_TnBAclHU
[root@71be7210-5445-11e9-b7b6-ecd68a2ef3bb docker]# 

Describe the results you expected:
rm container correctly

Additional information you deem important (e.g. issue happens only occasionally):
I've add some log info in docker/dockerd/containerd/containerd-shim/runc, here is what I got for now:

NOTE: the log msg which contains [fd], is added by myself.

1. docker:

correctly receive 7 docker rm commands, but only get 6 returns from func runRm(dockerCli command.Cli, opts *rmOptions) error:

time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in runRm, will call ContainerRemove: args: ctx: context.Background, container: a4b64d7dbb60, options%!(EXTRA types.ContainerRemoveOptions={false false true})"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in runRm, will call ContainerRemove: args: ctx: context.Background, container: d69eb06aa331, options%!(EXTRA types.ContainerRemoveOptions={false false true})"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in runRm, will call ContainerRemove: args: ctx: context.Background, container: c8a42d84346a, options%!(EXTRA types.ContainerRemoveOptions={false false true})"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in runRm, will call ContainerRemove: args: ctx: context.Background, container: a2e79a761f20, options%!(EXTRA types.ContainerRemoveOptions={false false true})"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in runRm, will call ContainerRemove: args: ctx: context.Background, container: de1382ef7869, options%!(EXTRA types.ContainerRemoveOptions={false false true})"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in runRm, will call ContainerRemove: args: ctx: context.Background, container: 4c9342f2a94c, options%!(EXTRA types.ContainerRemoveOptions={false false true})"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in runRm, will call ContainerRemove: args: ctx: context.Background, container: 25ad3b49de5f, options%!(EXTRA types.ContainerRemoveOptions={false false true})"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] runRm return err: <nil>"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] runRm return err: <nil>"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] runRm return err: <nil>"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] runRm return err: <nil>"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] runRm return err: <nil>"
time="2019-04-10T09:30:11+08:00" level=debug msg="[fd] runRm return err: <nil>"

2. dockerd:

get 7 calls of func (daemon *Daemon) Kill(container *containerpkg.Container) error, 6 returns will nil at the end of the fuction, 1 returns nil at here because of a failed KILL -9 command, as:

time="2019-04-10T09:30:20.436596506+08:00" level=info msg="Container a2e79a761f20 failed to exit within 10 seconds of kill - trying direct SIGKILL"
time="2019-04-10T09:30:20.437152328+08:00" level=debug msg="[fd] the pid is: 3892"
time="2019-04-10T09:30:20.437577519+08:00" level=debug msg="[fd] unix kill failed: no such process"
time="2019-04-10T09:30:20.438110802+08:00" level=debug msg="Cannot kill process (pid=3892) with signal 9: no such process."
time="2019-04-10T09:30:20.438557027+08:00" level=debug msg="[fd][a2e79a761f20d607d4950dff24e6323a99a51fdac07c5d35f410f5a5c81f32f4] killProcessDirectly failed: Cannot kill process (pid=3892) with signal 9: no such process."
time="2019-04-10T09:30:20.438992836+08:00" level=debug msg="Sending kill signal 15 to container a2e79a761f20d607d4950dff24e6323a99a51fdac07c5d35f410f5a5c81f32f4"
time="2019-04-10T09:30:20.439378328+08:00" level=debug msg="[fd] in func (daemon *Daemon) kill, c: Up 19 hours, sig: 15"

and, I got 6 events of both topic=/tasks/exit and topic=/tasks/delete:

time="2019-04-10T09:30:10.458303945+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/exit

x 6, and:

time="2019-04-10T09:30:10.623740835+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/delete
time="2019-04-10T09:30:10.624326860+08:00" level=info msg="ignoring event" module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"

x 6

3. containerd:

containerd get only 6 calls of func (l *local) Delete(ctx context.Context, req *api.DeleteContainerRequest, _ ...grpc.CallOption) (*ptypes.Empty, error) , all return successfully:

time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in func (l *local) Delete: req: &DeleteContainerRequest{ID:4c9342f2a94c697e875540e0b32e0aeefcca86aabfafab75a8e00d1aa3f3f016,}"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] no err, out func (l *local) Delete"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in func (l *local) Delete: req: &DeleteContainerRequest{ID:d69eb06aa331c60baefb08287c73f7641af866a42f430808514fd450bb42ede6,}"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] no err, out func (l *local) Delete"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in func (l *local) Delete: req: &DeleteContainerRequest{ID:a4b64d7dbb606fa01e7845b1aa73571830a65bc618a8ecd4efe3f92238dd6a48,}"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] no err, out func (l *local) Delete"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in func (l *local) Delete: req: &DeleteContainerRequest{ID:de1382ef7869fa773f69a5b6344bd46b547fb0d936abcea21c33f209e106a1cb,}"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] no err, out func (l *local) Delete"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] in func (l *local) Delete: req: &DeleteContainerRequest{ID:c8a42d84346ac753c3ec0f501cf1770da7d8949e3eee10c2fcd3d45fdd6528be,}"
time="2019-04-10T09:30:10+08:00" level=debug msg="[fd] no err, out func (l *local) Delete"
time="2019-04-10T09:30:11+08:00" level=debug msg="[fd] in func (l *local) Delete: req: &DeleteContainerRequest{ID:25ad3b49de5f1338b15d03a6d0e970731dd967c89ce128e02dc03e7233f312fc,}"
time="2019-04-10T09:30:11+08:00" level=debug msg="[fd] no err, out func (l *local) Delete"

4. containerd-shim:

same as containerd, only get 6 calls of func (s *Service) Delete(ctx context.Context, r *ptypes.Empty) (*shimapi.DeleteResponse, error), all succeed.

log skipped...

5. runc

get serval kill calls to that failing container.

{"level":"debug","msg":"[fd][a2e79a761f20d607d4950dff24e6323a99a51fdac07c5d35f410f5a5c81f32f4] in Signal(s os.Signal, all bool): s: killed, all: true","time":"2019-04-10T09:30:10+08:00"}
{"level":"debug","msg":"[fd][c8a42d84346ac753c3ec0f501cf1770da7d8949e3eee10c2fcd3d45fdd6528be] in Signal(s os.Signal, all bool): s: killed, all: true","time":"2019-04-10T09:30:10+08:00"}
{"level":"debug","msg":"[fd][a2e79a761f20d607d4950dff24e6323a99a51fdac07c5d35f410f5a5c81f32f4] out Signal(s os.Signal, all bool)","time":"2019-04-10T09:30:10+08:00"}

Output of docker version:

docker version is 18.09.0, but modified by myself for logging

Output of docker info:

docker info is modified by myself for logging

Additional environment details (AWS, VirtualBox, physical, etc.):

@fdingiit
Copy link
Author

My point of view and questions:

I think the function which surprise me is this:

func (daemon *Daemon) Kill(container *containerpkg.Container) error {
	if !container.IsRunning() {
		return errNotRunning(container.ID)
	}

	// 1. Send SIGKILL
	if err := daemon.killPossiblyDeadProcess(container, int(syscall.SIGKILL)); err != nil {
		// While normally we might "return err" here we're not going to
		// because if we can't stop the container by this point then
		// it's probably because it's already stopped. Meaning, between
		// the time of the IsRunning() call above and now it stopped.
		// Also, since the err return will be environment specific we can't
		// look for any particular (common) error that would indicate
		// that the process is already dead vs something else going wrong.
		// So, instead we'll give it up to 2 more seconds to complete and if
		// by that time the container is still running, then the error
		// we got is probably valid and so we return it to the caller.
		if isErrNoSuchProcess(err) {
			return nil
		}

		ctx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
		defer cancel()

		if status := <-container.Wait(ctx, containerpkg.WaitConditionNotRunning); status.Err() != nil {
			return err
		}
	}

	// 2. Wait for the process to die, in last resort, try to kill the process directly
	if err := killProcessDirectly(container); err != nil {
		if isErrNoSuchProcess(err) {
			return nil
		}
		return err
	}

	// Wait for exit with no timeout.
	// Ignore returned status.
	<-container.Wait(context.Background(), containerpkg.WaitConditionNotRunning)

	return nil
}

There are 2 things I cannot understand:

1.

For the failed docker rm command, it calls killPossiblyDeadProcess without error, but containerd miss (or whatever) the request according to its log.

2.

It continued to call killProcessDirectly, but killed the wrong pid, return nil at line 155

@fdingiit
Copy link
Author

Continued:

After this issue happened, any command of docker rm on that failed container will stuck, the log shows docker and dockerd received the rm instruction, while containerd/containd-shim/runc didn't. Last function dockerd called is in ContainerRm at engine/daemon/delete.go, but never returned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants