Docker restart failed on rare cases and then container never start again. #47549

zhuyuxn · 2024-03-12T05:22:17Z

Description

Docker restart failed on rare cases and then container never start again.

Reproduce

docker restart container.
On rare cases, SIGKILL take time.

Expected behavior

The container should be brought up.

docker version

Client:
 Version:           20.10.25
 API version:       1.41
 Go version:        go1.20.10
 Git commit:        b82b9f3a0e763304a250531cb9350aa6d93723c9
 Built:             Wed Oct 18 08:30:50 2023
 OS/Arch:           linux/amd64
 Context:           default
 Experimental:      true

Server:
 Engine:
  Version:          20.10.25
  API version:      1.41 (minimum version 1.12)
  Go version:       go1.20.10
  Git commit:       5df983c
  Built:            Wed Oct 18 08:32:37 2023
  OS/Arch:          linux/amd64
  Experimental:     false
 containerd:
  Version:          1.6.22
  GitCommit:        8165feabfdfe38c65b599c4993d227328c231fca
 runc:
  Version:          1.1.9
  GitCommit:        ccaecfcbc907d70a7aa870a6650887b901b25b82
 docker-init:
  Version:          0.19.0
  GitCommit:

docker info

Client:
 Context:    default
 Debug Mode: false

Server:
 Containers: 7
  Running: 7
  Paused: 0
  Stopped: 0
 Images: 16
 Server Version: 20.10.25
 Storage Driver: overlay2
  Backing Filesystem: extfs
  Supports d_type: true
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 1
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local logentries splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 io.containerd.runtime.v1.linux runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: 8165feabfdfe38c65b599c4993d227328c231fca
 runc version: ccaecfcbc907d70a7aa870a6650887b901b25b82
 init version:
 Security Options:
  apparmor
  seccomp
   Profile: default
 Kernel Version: 5.15.138.1-4.cm2
 Operating System: CBL-Mariner/Linux
 OSType: linux
 Architecture: x86_64
 CPUs: 2
 Total Memory: 7.815GiB
 Name: d02224a04ed0
 ID: ICZ5:M2ZI:PJYL:PVGE:722V:XWCT:QLFR:SKFZ:JF5K:VXFS:BWGT:PQPY
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: true

Additional Info

Log

time="2024-03-02T15:58:19.549568691Z" level=info msg="Container failed to exit within 1m30s of signal 15 - using the force" container=38d614925484c56efcba34aa3a3f25259e2f2acc4fbb00b6a550051942c7505f
time="2024-03-02T15:58:29.579026954Z" level=error msg="Container failed to exit within 10 seconds of kill - trying direct SIGKILL" container=38d614925484c56efcba34aa3a3f25259e2f2acc4fbb00b6a550051942c7505f error="context deadline exceeded"
time="2024-03-02T15:58:33.580895749Z" level=error msg="Error killing the container" container=38d614925484c56efcba34aa3a3f25259e2f2acc4fbb00b6a550051942c7505f error="tried to kill container, but did not receive an exit event"
time="2024-03-02T15:58:33.588485246Z" level=error msg="Handler for POST /containers/MySQL/restart returned error: Cannot restart container MySQL: tried to kill container, but did not receive an exit event"
time="2024-03-02T15:58:51.564412478Z" level=info msg="ignoring event" container=38d614925484c56efcba34aa3a3f25259e2f2acc4fbb00b6a550051942c7505f module=libcontainerd namespace=moby topic=/tasks/delete type="*events.TaskDelete"
time="2024-03-02T15:58:51.564448059Z" level=info msg="shim disconnected" id=38d614925484c56efcba34aa3a3f25259e2f2acc4fbb00b6a550051942c7505f
time="2024-03-02T15:58:51.564496272Z" level=warning msg="cleaning up after shim disconnected" id=38d614925484c56efcba34aa3a3f25259e2f2acc4fbb00b6a550051942c7505f namespace=moby
time="2024-03-02T15:58:51.564504721Z" level=info msg="cleaning up dead shim"
time="2024-03-02T15:58:51.570746324Z" level=warning msg="cleanup warnings time="2024-03-02T15:58:51Z" level=info msg="starting signal loop" namespace=moby pid=3996640 runtime=io.containerd.runc.v2\n"

Code path

In

// daemon/restart.go:35
func (daemon *Daemon) containerRestart(ctx context.Context, daemonCfg *configStore, container *container.Container, options containertypes.StopOptions) error {

It called

// daemon/restart.go:35
err := daemon.containerStop(ctx, container, options)

Then it called

//daemon/stop.go:48
func (daemon *Daemon) containerStop(ctx context.Context, ctr *container.Container, options containertypes.StopOptions) (retErr error) {

Then it called

// daemon/stop:113-124
    // Stop either failed or container didn't exit, so fallback to kill.
    if err := daemon.Kill(ctr); err != nil {
        // got a kill error, but give container 2 more seconds to exit just in case
        subCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
        defer cancel()
        status := <-ctr.Wait(subCtx, container.WaitConditionNotRunning)
        if status.Err() != nil {
            log.G(ctx).WithError(err).WithField("container", ctr.ID).Errorf("error killing container: %v", status.Err())
            return err
        }
        // container did exit, so ignore previous errors and continue
    }

Then it called

//daemon/kill.go:148
func (daemon *Daemon) Kill(container *containerpkg.Container) error {

Then it returns error

//daemon/kill.go:187-189
    if status := <-container.Wait(ctx2, containerpkg.WaitConditionNotRunning); status.Err() != nil {
        return errors.New("tried to kill container, but did not receive an exit event")
    }

In the restart error handling.

//daemon/restart.go:63-66
        err := daemon.containerStop(ctx, container, options)
        if err != nil {
            return err
        }

Proposal

The restart can fail due to kill container, the worst thing is that it will nerver bring up the container again.

So one solution may be:

When restart find the error is "tried to kill container, but did not receive an exit event", let it wait the container be killed then start the container

elemount · 2024-03-12T05:57:31Z

The SIGKILL may cannot completed in 2 seconds on some expected cases, such as the process is exiting, it was using a lot of swap space and when exiting the kernel needs to free the space.
On such cases, SIGKILL and wait 2 seconds is not enough.

vvoland · 2024-03-12T12:52:06Z

You're using quite a dated Docker version now. Can you reproduce this issue on the latest v25.0.4?

zhuyuxn · 2024-03-13T02:27:44Z

We can reproduce the issue on the latest version.

zhuyuxn added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. status/0-triage labels Mar 12, 2024

vvoland added the version/20.10 label Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Docker restart failed on rare cases and then container never start again. #47549

Docker restart failed on rare cases and then container never start again. #47549

zhuyuxn commented Mar 12, 2024 •

edited

elemount commented Mar 12, 2024

vvoland commented Mar 12, 2024

zhuyuxn commented Mar 13, 2024

Docker restart failed on rare cases and then container never start again. #47549

Docker restart failed on rare cases and then container never start again. #47549

Comments

zhuyuxn commented Mar 12, 2024 • edited

Description

Reproduce

Expected behavior

docker version

docker info

Additional Info

Log

Code path

Proposal

elemount commented Mar 12, 2024

vvoland commented Mar 12, 2024

zhuyuxn commented Mar 13, 2024

zhuyuxn commented Mar 12, 2024 •

edited