New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add --live-restore flag #23213

Merged
merged 1 commit into from Jun 14, 2016

Conversation

Projects
None yet
@crosbymichael
Copy link
Member

crosbymichael commented Jun 2, 2016

This flags enables full support of daemonless containers in docker. It
ensures that docker does not stop containers on shutdown or restore and
properly reconnects to the container when restarted.

This is not the default because of backwards compat but should be the
desired outcome for people running containers in prod.

Signed-off-by: Michael Crosby crosbymichael@gmail.com

@crosbymichael crosbymichael force-pushed the crosbymichael:restore-option branch from afce6a2 to 0f59209 Jun 2, 2016

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jun 2, 2016

can we make this option live reloadable (daemon.json)?

@crosbymichael

This comment has been minimized.

Copy link
Member

crosbymichael commented Jun 2, 2016

@thaJeztah is there something special that I should be doing for the config file?

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jun 2, 2016

@crosbymichael yeah, it's still a bit Hacky, but you need to explicitly allow it to be set in the Reload() function, and it needs a mention in the docs/reference/commandline/dockerd.md section about reloadable options.

Here's a PR that added an option that's reloadable at runtime; 7368e41

@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jun 2, 2016

Oh and the validateConfiguration() function should check if the option is not both set on the command line (through a flag) and in the daemon.json, because that's invalid

clnt.unlock(cont.Id)

if err := clnt.Signal(containerID, int(syscall.SIGTERM)); err != nil {
logrus.Errorf("error sending sigterm to %v: %v", containerID, err)

This comment has been minimized.

@AkihiroSuda

AkihiroSuda Jun 3, 2016

Member

perhaps %v --> %s for containerID?

select {
case <-time.After(10 * time.Second):
if err := clnt.Signal(containerID, int(syscall.SIGKILL)); err != nil {
logrus.Errorf("error sending sigkill to %v: %v", containerID, err)

This comment has been minimized.

@AkihiroSuda
on restart leaving the containers running.

While the Docker daemon is down logging will still be captured, however, it will be capped at 64kb before
the buffer fill up, blocking the process. Docker will need to be restarted to flush these buffers.

This comment has been minimized.

@cyphar

cyphar Jun 3, 2016

Contributor

This seems ... messy. And not very "production ready". Maybe we should add some way to redup the file descriptors of containerd-shim to point to some file that we can load and restore in Docker?

This comment has been minimized.

@crosbymichael

crosbymichael Jun 3, 2016

Member

You cannot do that because of hard crashes or if the daemon gets sigkill'd. Also I put that point in there because its not meant for people to start containers then stop docker. it is meant to not affect containers for crashes, upgrades, etc with the daemon being down small amounts of time

@@ -445,12 +447,43 @@ func (clnt *client) restore(cont *containerd.Container, options ...CreateOption)
}

func (clnt *client) Restore(containerID string, options ...CreateOption) error {
if clnt.liveRestore {

This comment has been minimized.

@cyphar

cyphar Jun 3, 2016

Contributor

I'm confused why we kill all container processes if they are still hanging around when we start? I understand that we need --live-restore in order to maintain back-compat when the daemon dies. But it seems odd that we go from the old code (which did "live restore" by default) to the new code that will kill all of the containers if they're still lying around. Does this play nicely when you have more than one consumer of containerd?

This comment has been minimized.

@crosbymichael

crosbymichael Jun 3, 2016

Member

you are reading it wrong, it was always like that, i just added the original code back from my previous PR because of backwards compat concerns

@crosbymichael crosbymichael force-pushed the crosbymichael:restore-option branch from 0f59209 to a5113ed Jun 4, 2016

@icecrime

This comment has been minimized.

Copy link
Contributor

icecrime commented Jun 4, 2016

11:21:33 ---> Making bundle: validate-lint (in bundles/1.12.0-dev/validate-lint)
11:21:33 Errors from golint:
11:21:33 libcontainerd/remote_solaris.go:31:1: exported function WithLiveRestore should have comment or be unexported
11:21:33 libcontainerd/remote_windows.go:33:1: exported function WithLiveRestore should have comment or be unexported
@LK4D4

This comment has been minimized.

Copy link
Contributor

LK4D4 commented Jun 13, 2016

experimental failures might be related. I'll restart CI

launching `dockerd`. This will ensure that Docker does not kill containers on graceful shutdown or
on restart leaving the containers running.

While the Docker daemon is down logging will still be captured, however, it will be capped at 64kb before

This comment has been minimized.

@mlaventure

mlaventure Jun 13, 2016

Contributor

Can we just say it'll be capped at the internal kernel buffer size?

clnt.lock(cont.Id)
container := clnt.newContainer(cont.BundlePath)
container.systemPid = systemPid(cont)
clnt.appendContainer(container)

This comment has been minimized.

@tonistiigi

tonistiigi Jun 13, 2016

Member

Need to make sure the container is also removed from client and that fifo data is discarded. See 1e36e34

This comment has been minimized.

@tonistiigi

tonistiigi Jun 13, 2016

Member

It's for the case where wait doesn't return and you just call setExited in line 488

@crosbymichael crosbymichael force-pushed the crosbymichael:restore-option branch from ae9dd00 to 07edd7c Jun 13, 2016

launching `dockerd`. This will ensure that Docker does not kill containers on graceful shutdown or
on restart leaving the containers running.

While the Docker daemon is down logging will still be captured, however, it will be capped at the kernel's pipe buffere size before the buffer fills up, blocking the process.

This comment has been minimized.

@tiborvass

tiborvass Jun 13, 2016

Collaborator

typo: s/buffere/buffer/

@thaJeztah thaJeztah added this to the 1.12.0 milestone Jun 13, 2016

f := ctr.fifo(i)
c := make(chan struct{})
go func() {
close(c) // this channel is used to not close the writer too early, before readonly open has been called.

This comment has been minimized.

@tiborvass

tiborvass Jun 13, 2016

Collaborator

is this a hack?

This comment has been minimized.

@crosbymichael

crosbymichael Jun 13, 2016

Member

it signals that the goroutine has been scheduled to the calling code to get rid of races if it returns before the goroutine's code has started executing

This comment has been minimized.

@orivej

orivej Jun 27, 2016

closeReaderFifo may still run before openReaderFromFifo (even though it is less likely with this synchronization point). Maybe this is better:

    r := openReaderFromFifo(f)
    go io.Copy(ioutil.Discard, r)
    closeReaderFifo(f)

This comment has been minimized.

@mlaventure

mlaventure Jun 27, 2016

Contributor

@orivej

openReaderFromFifo is a blocking operation: as we're opening it with O_RDONLY, it will block until another process opens the FIFO for writing.

This comment has been minimized.

@orivej

orivej Jun 27, 2016

In fact, openReaderFromFifo spawns a goroutine that opens the pipe, and immediately returns.

This means that my code sample would work, but still would not guarantee that closeReaderFifo runs after openReaderFromFifo opens the file.

@crosbymichael crosbymichael force-pushed the crosbymichael:restore-option branch from 07edd7c to d15a3d4 Jun 13, 2016

@icecrime

This comment has been minimized.

Copy link
Contributor

icecrime commented Jun 13, 2016

LGTM

}

clnt.deleteContainer(containerID)

This comment has been minimized.

@mlaventure

mlaventure Jun 13, 2016

Contributor

nitpick: this should be within the above if

@mlaventure

This comment has been minimized.

Copy link
Contributor

mlaventure commented Jun 13, 2016

aside my nit, LGTM (IANAM)

@icecrime

This comment has been minimized.

Copy link
Contributor

icecrime commented Jun 14, 2016

@crosbymichael I see a few failures on experimental:

14:03:11 FAIL: docker_cli_daemon_experimental_test.go:64: DockerDaemonSuite.TestCleanupMountsAfterDaemonCrash
14:03:11 
14:03:11 [d90042189] waiting for daemon to start
14:03:11 [d90042189] daemon started
14:03:11 [d90042189] exiting daemon
14:03:11 [d90042189] waiting for daemon to start
14:03:11 [d90042189] daemon started
14:03:11 docker_cli_daemon_experimental_test.go:90:
14:03:11     c.Fatalf("Container %s expected to stay alive after daemon restart", id)
14:03:11 ... Error: Container 1f90a3892d1280613d9a7221eefb640ae56e35e5c37fda7d6638e06690bba402 expected to stay alive after daemon restart
14:03:11 
14:03:11 [d90042189] exiting daemon

@crosbymichael crosbymichael force-pushed the crosbymichael:restore-option branch from d15a3d4 to cb80dcd Jun 14, 2016

Add --live-restore flag
This flags enables full support of daemonless containers in docker.  It
ensures that docker does not stop containers on shutdown or restore and
properly reconnects to the container when restarted.

This is not the default because of backwards compat but should be the
desired outcome for people running containers in prod.

Signed-off-by: Michael Crosby <crosbymichael@gmail.com>

@crosbymichael crosbymichael force-pushed the crosbymichael:restore-option branch from cb80dcd to d705dab Jun 14, 2016

@tonistiigi

This comment has been minimized.

Copy link
Member

tonistiigi commented Jun 14, 2016

LGTM

@icecrime

This comment has been minimized.

Copy link
Contributor

icecrime commented Jun 14, 2016

Ping @thaJeztah: please give the docs a read, and we'll fix what needs to in a followup PR.

@crosbymichael crosbymichael merged commit 3020081 into moby:master Jun 14, 2016

5 of 7 checks passed

experimental Jenkins build Docker-PRs-experimental 19902 is running
Details
gccgo Jenkins build Docker-PRs-gccgo 6769 is running
Details
docker/dco-signed All commits signed
Details
janky Jenkins build Docker-PRs 28696 has succeeded
Details
userns Jenkins build Docker-PRs-userns 10858 has succeeded
Details
win2lin Jenkins build Docker-PRs-Win2Lin 27281 has succeeded
Details
windowsTP5 Jenkins build Docker-PRs-WoW-TP5 3187 has succeeded
Details

@crosbymichael crosbymichael deleted the crosbymichael:restore-option branch Jun 14, 2016

@innocentme1

This comment has been minimized.

Copy link

innocentme1 commented Jul 27, 2016

@crosbymichael @thaJeztah

  1. Are there any updated docs regarding this feature? I could not find one - Can you help me in getting the doc?
  2. So, this is a flag to daemon and is enabled by default? Do users have the option of changing this default setting etc?
  3. Do we have this per container level?
@thaJeztah

This comment has been minimized.

Copy link
Member

thaJeztah commented Jul 27, 2016

@innocentme1 basic docs are part of this PR; see https://github.com/docker/docker/pull/23213/files#diff-fc5c21bb9d1d48db7d37c003bb62b566R283. It's a daemon-level flag, and not enabled by default. It's not something that can be set per-container

@HackToday

This comment has been minimized.

Copy link
Contributor

HackToday commented Apr 28, 2017

Hi @thaJeztah and @crosbymichael
Yes, with that. docker daemon can stop while container still running.

From my usage, I found since containerd daemon,if containerd dies, all containers would still not be available(in stopped status). It means containerd hit same issue as old docker daemon,

Is it ? Thanks

@HackToday

This comment has been minimized.

Copy link
Contributor

HackToday commented Apr 30, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment