Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After Windows Server 2019 boot containers not starting #39623

Closed
robertmuehsig opened this issue Jul 29, 2019 · 5 comments · Fixed by #39703
Closed

After Windows Server 2019 boot containers not starting #39623

robertmuehsig opened this issue Jul 29, 2019 · 5 comments · Fixed by #39703
Labels
area/daemon kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. platform/windows version/19.03

Comments

@robertmuehsig
Copy link

Description

Windows Server 2019 does not start containers automatically on boot, even though the containers have been started with --restart always parameter.

Steps to reproduce the issue:

  1. Open up elevated PowerShell.
  2. Create a custom Docker network (driver nat), named blaa.
  3. Run a container like this: docker run -d --restart always --name testcontainer -p 8080:80 mcr.microsoft.com/windows/servercore:ltsc2019.
  4. Execute Restart-Computer.
    When the machine starts up, execute docker ps.

Describe the results you received:
docker ps shows that no containers are running.

Describe the results you expected:
docker ps lists the container named testcontainer.

Additional information you deem important (e.g. issue happens only occasionally):
Happens all the time.

Windows Event Log shows errors like this:
Failed to delete container 80e440a586087ed6dd1bf14bb152061eadf15d5343a2dcf6b5b2dd14e0ad98fd from containerd [error=no such container]

Output of docker version:

Client: Docker Engine - Enterprise
 Version:           19.03.1
 API version:       1.40
 Go version:        go1.12.5
 Git commit:        f660560464
 Built:             07/25/2019 20:59:52
 OS/Arch:           windows/amd64
 Experimental:      false

Server: Docker Engine - Enterprise
 Engine:
  Version:          19.03.0
  API version:      1.40 (minimum version 1.24)
  Go version:       go1.12.5
  Git commit:       87b1f470ad
  Built:            07/16/2019 23:39:21
  OS/Arch:          windows/amd64
  Experimental:     false

Output of docker info:

PS C:\Users\Administrator> docker info
Client:
 Debug Mode: false
 Plugins:
  cluster: Manage Docker clusters (Docker Inc., v1.0.1)

Server:
 Containers: 1
  Running: 1
  Paused: 0
  Stopped: 0
 Images: 5
 Server Version: 19.03.0
 Storage Driver: windowsfilter
  Windows:
 Logging Driver: json-file
 Plugins:
  Volume: local
  Network: ics l2bridge l2tunnel nat null overlay transparent
  Log: awslogs etwlogs fluentd gcplogs gelf json-file local logentries splunk syslog
 Swarm: inactive
 Default Isolation: process
 Kernel Version: 10.0 17763 (17763.1.amd64fre.rs5_release.180914-1434)
 Operating System: Windows Server 2019 Standard Version 1809 (OS Build 17763.615)
 OSType: windows
 Architecture: x86_64
 CPUs: 16
 Total Memory: 31.95GiB
 Name: 14450hostserveu
 ID: IJUZ:NS4Z:QY3W:DGOK:TTC4:6ZYV:Z4GW:QWS4:OFDV:XWCF:JUGI:PWW7
 Docker Root Dir: C:\ProgramData\docker
 Debug Mode: false
 Username: sevitechq
 Registry: https://index.docker.io/v1/
 Labels:
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

Additional environment details (AWS, VirtualBox, physical, etc.):

Windows Server 2019 (Version 1809)

The same stuff worked in a older Windows Server 2016 environment (with a much older docker version).

I guess this is related to this issue:
#38911

I also tried this, but this should already be fixed :/
#33542

My workaround (which only work in some scenarios): Use a scheduled task to start all stopped contains like this:

docker start $(docker ps -a -q -f status=exited)
@olljanat
Copy link
Contributor

olljanat commented Aug 5, 2019

We can also see this on all our Windows Server 2019 + Docker 19.03.1 servers.

I took debug logs from one of the by disabling Docker service auto start and started it manually with command dockerd -D:

  • Log when container was created using command: dockerd_create_container.log
    docker run -d --restart always --name testcontainer --network nat -p 8080:80 mcr.microsoft.com/windows/servercore:ltsc2019 ping -t 127.0.0.1
  • Log from daemon start after server reboot: dockerd_try_restart.log

On this case Docker was actually saying that container was running:

C:\>docker ps -a
CONTAINER ID        IMAGE                                           COMMAND               CREATED             STATUS              PORTS                  NAMES
a0e61ed95ad0        mcr.microsoft.com/windows/servercore:ltsc2019   "ping -t 127.0.0.1"   39 seconds ago      Up 35 seconds       0.0.0.0:8080->80/tcp   testcontainer

but in reality it was not. It can be seen by trying to execute something inside of it:

C:\>docker exec -it a0e61ed95ad0 cmd
no such container

which sounds same than #39354 (comment)

Interesting part of log is:

time="2019-08-05T15:54:51.836283600+02:00" level=debug msg="restoring container" container=a0e61ed95ad094995a1ab1a043ab603dd24ab15f870654279fdae254ae8d9443 paused=false running=true
time="2019-08-05T15:54:51.836283600+02:00" level=debug msg="restore()" container=a0e61ed95ad094995a1ab1a043ab603dd24ab15f870654279fdae254ae8d9443 module=libcontainerd namespace=moby
time="2019-08-05T15:54:51.836283600+02:00" level=debug msg="hcsshim::OpenComputeSystem - Begin Operation" cid=a0e61ed95ad094995a1ab1a043ab603dd24ab15f870654279fdae254ae8d9443
time="2019-08-05T15:54:51.837405100+02:00" level=debug msg="hcsshim::OpenComputeSystem - End Operation - Success" cid=a0e61ed95ad094995a1ab1a043ab603dd24ab15f870654279fdae254ae8d9443
time="2019-08-05T15:54:51.837405100+02:00" level=error msg="Failed to delete container a0e61ed95ad094995a1ab1a043ab603dd24ab15f870654279fdae254ae8d9443 from containerd" error="no such container"

because if it looks to be that containerd.Restore() gets called and it does not return any error but it also does not return alive flag:

moby/daemon/daemon.go

Lines 318 to 342 in bad0b4e

logrus.WithFields(logrus.Fields{
"container": c.ID,
"running": c.IsRunning(),
"paused": c.IsPaused(),
}).Debug("restoring container")
var (
err error
alive bool
ec uint32
exitedAt time.Time
process libcontainerdtypes.Process
)
alive, _, process, err = daemon.containerd.Restore(context.Background(), c.ID, c.InitializeStdio)
if err != nil && !errdefs.IsNotFound(err) {
logrus.Errorf("Failed to restore container %s with containerd: %s", c.ID, err)
return
}
if !alive && process != nil {
ec, exitedAt, err = process.Delete(context.Background())
if err != nil {
logrus.WithError(err).Errorf("Failed to delete container %s from containerd", c.ID)
return
}

so can it be that restore actually fails on containerd side but it does not get written to log?

@jhowardmsft any ideas?

cc @thaJeztah

@olljanat
Copy link
Contributor

olljanat commented Aug 6, 2019

And I just tested that same issue happens also on Windows Server 2016 + Docker 19.03.1 combination.

@derek derek bot added kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. area/daemon and removed area/networking labels Aug 6, 2019
@thaJeztah
Copy link
Member

thaJeztah commented Aug 6, 2019

ping @ddebroy @vikramhh @simonferquel PTAL

opened internal ticket ENGCORE-929 for this

@ktmitton
Copy link

ktmitton commented Aug 7, 2019

As mentioned in my ticket, I noticed the issue with restarts of the docker service in Windows 10, and these logs stood out to me during the stop portion of restart:

[07:07:52.895][WindowsDaemon     ][Info   ] debug: Sending kill signal 15 to container 6baa8ec6084e8f9379bece4c0e8356012a7cfbb0f606cac32350e1ae27dc5b11
[07:07:52.895][WindowsDaemon     ][Info   ] debug: Signal() [process=init pid=306 signal=15 module=libcontainerd namespace=moby container=6baa8ec6084e8f9379bece4c0e8356012a7cfbb0f606cac32350e1ae27dc5b11]
[07:07:52.895][WindowsDaemon     ][Info   ] debug: hcsshim::ComputeSystem::Shutdown - Begin Operation [cid=6baa8ec6084e8f9379bece4c0e8356012a7cfbb0f606cac32350e1ae27dc5b11]
[07:07:52.908][WindowsDaemon     ][Info   ] debug: hcsshim::ComputeSystem::Shutdown - End Operation - Success [cid=6baa8ec6084e8f9379bece4c0e8356012a7cfbb0f606cac32350e1ae27dc5b11]
[07:08:22.908][WindowsDaemon     ][Info   ] Container 6baa8ec6084e8f9379bece4c0e8356012a7cfbb0f606cac32350e1ae27dc5b11 failed to exit within 30 seconds of signal 15 - using the force
[07:08:22.908][WindowsDaemon     ][Info   ] debug: Sending kill signal 9 to container 6baa8ec6084e8f9379bece4c0e8356012a7cfbb0f606cac32350e1ae27dc5b11

@thaJeztah
Copy link
Member

Fix is in the works; #39703

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/daemon kind/bug Bugs are bugs. The cause may or may not be known at triage time so debugging may be needed. platform/windows version/19.03
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants