docker start from checkpoint is slow due to unnecessary image file copies #40644

RELOAD22 · 2020-03-09T05:04:37Z

Description
docker start from checkpoint is slow due to unnecessary image file copies.
It took me 34s to start from checkpoint, but the process of restore only took less than 2s. Unnecessary file copy process took 32s.

Mar 05 21:00:34 k8s03 dockerd[228561]: time="2020-03-05T21:00:34.182584079+08:00" level=debug msg="event published" ns=moby topic="/containers/create" type=containerd.events.ContainerCreate
Mar 05 21:00:34 k8s03 dockerd[228561]: time="2020-03-05T21:00:34.184864540+08:00" level=debug msg="Using single walk diff for /var/lib/docker/containers/aba0d7b536ae0c3cd1308dee1fee68bb6095e0bad2f6e6f39eb703a600a8481c/checkpoints/dumpimages"
Mar 05 21:00:34 k8s03 dockerd[228561]: time="2020-03-05T21:00:34.185187083+08:00" level=debug msg="(*service).Write started" ref="/var/lib/docker/containers/aba0d7b536ae0c3cd1308dee1fee68bb6095e0bad2f6e6f39eb703a600a8481c/checkpoints/dumpimages"
Mar 05 21:01:06 k8s03 dockerd[228561]: time="2020-03-05T21:01:06.891921255+08:00" level=info msg="shim containerd-shim started" address="/containerd-shim/moby/aba0d7b536ae0c3cd1308dee1fee68bb6095e0bad2f6e6f39eb703a600a8481c/shim.sock" debug=true pid=230167
Mar 05 21:01:06 k8s03 dockerd[228561]: time="2020-03-05T21:01:06.936584694+08:00" level=debug msg="event published" ns=moby topic="/tasks/create" type=containerd.events.TaskCreate
Mar 05 21:01:06 k8s03 dockerd[228561]: time="2020-03-05T21:01:06.937001470+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/create
Mar 05 21:01:07 k8s03 dockerd[228561]: time="2020-03-05T21:01:07.277540630+08:00" level=debug msg="sandbox set key processing took 120.34559ms for container aba0d7b536ae0c3cd1308dee1fee68bb6095e0bad2f6e6f39eb703a600a8481c"
Mar 05 21:01:08 k8s03 dockerd[228561]: time="2020-03-05T21:01:08.840972430+08:00" level=debug msg="event published" ns=moby topic="/tasks/start" type=containerd.events.TaskStart
Mar 05 21:01:08 k8s03 dockerd[228561]: time="2020-03-05T21:01:08.841320486+08:00" level=debug msg=event module=libcontainerd namespace=moby topic=/tasks/start

I read the source code and found the problem.

moby/libcontainerd/remote/client.go

Line 177 in bcc50d6

if checkpointDir != "" {

	if checkpointDir != "" {
		// write checkpoint to the content store
		tar := archive.Diff(ctx, "", checkpointDir)
		cp, err = c.writeContent(ctx, images.MediaTypeContainerd1Checkpoint, checkpointDir, tar)

When create new task, another tmp dir will be created to store checkpoint files. This is another unnecessary copy.
https://github.com/containerd/containerd/blob/92cfc5b1fb91e0e09bd9ef18de082cd648e2bdc0/services/tasks/local.go#L144

	// jump get checkpointPath from checkpoint image
	if checkpointPath != "" && r.Checkpoint != nil {
		checkpointPath, err = ioutil.TempDir(os.Getenv("XDG_RUNTIME_DIR"), "ctrd-checkpoint")
		if err != nil {
			return nil, err
		}
		if r.Checkpoint.MediaType != images.MediaTypeContainerd1Checkpoint {
			return nil, fmt.Errorf("unsupported checkpoint type %q", r.Checkpoint.MediaType)
		}
		reader, err := l.store.ReaderAt(ctx, ocispec.Descriptor{
			MediaType:   r.Checkpoint.MediaType,
			Digest:      r.Checkpoint.Digest,
			Size:        r.Checkpoint.Size_,
			Annotations: r.Checkpoint.Annotations,
		})
		if err != nil {
			return nil, err
		}
		_, err = archive.Apply(ctx, checkpointPath, content.NewReader(reader))
		reader.Close()
		if err != nil {
			return nil, err
		}
	}

And because the file is read and written using its own framework. This will make this process slower.
As far as I understand, these copies are unnecessary. We could have recovered from the original location, which would save a lot of time.

The text was updated successfully, but these errors were encountered:

DerMistkaefer · 2023-11-05T15:55:02Z

@thaJeztah is there something to expect from this issue in the future?

ayushr2 · 2024-03-08T16:11:35Z

Any updates on this? When the generate checkpoint image is large (say >8GB), the checkpoint and restore durations are too long (sometimes >5 minutes).

thaJeztah added kind/experimental area/performance labels Mar 9, 2020

thaJeztah added the area/checkpoint Related to (experimental) checkpoint/restore (CRIU) label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docker start from checkpoint is slow due to unnecessary image file copies #40644

docker start from checkpoint is slow due to unnecessary image file copies #40644

RELOAD22 commented Mar 9, 2020

DerMistkaefer commented Nov 5, 2023

ayushr2 commented Mar 8, 2024

docker start from checkpoint is slow due to unnecessary image file copies #40644

docker start from checkpoint is slow due to unnecessary image file copies #40644

Comments

RELOAD22 commented Mar 9, 2020

DerMistkaefer commented Nov 5, 2023

ayushr2 commented Mar 8, 2024