Skip to content

runsc restore: "inconsistent private memory files" for a pod-shared overlay mounted by multiple containers #13608

Description

@mayur-tolexo

Problem

When a tmpfs mount is shared across multiple containers in a Pod (mount hint share=pod, or share=container) and the whole sandbox is checkpointed and then restored, restore fails for every container past the first. The first container restores and runs; the second dies at start:

OCI runtime restore failed: starting container: starting sub-container
[/counter --tick=1s --state-file=/workspace/b.state]:
inconsistent private memory files on restore:
savedMFOwners = [writer-a:/  writer-a:/workspace  writer-b:/],
mfmap = map[writer-a:/ ... writer-a:/workspace ... writer-b:/ ... writer-b:/workspace ...]

savedMFOwners has one entry for the shared /workspace overlay (owned by the first container), but mfmap on restore has one per mounting container, so the counts disagree and loadPrivateMemoryFiles (pkg/sentry/kernel/kernel_restore.go) aborts.

Root cause

A pod-shared overlay is backed by a single MemoryFile. At runtime the first container mounts the master via getSharedMount (runsc/boot/vfs.go); peer containers reuse that master and their extra filestore FD is closed, so only one private MemoryFile exists and PrepareSave registers exactly one owner.

On restore, configureRestore (runsc/boot/vfs.go) doesn't account for the sharing — it creates and registers a private MemoryFile for every container's submount that carries a filestore FD. So a Pod with the overlay mounted in two containers restores with two MemoryFile entries against one saved owner, and the counts mismatch.

Reproduce

A two-container Pod sharing a disk-backed tmpfs overlay via the mount hints dev.gvisor.spec.mount.<name>.share=pod and type=tmpfs (OCI mount kept as bind, so the overlay is on disk). Both containers write to the mount, runsc checkpoint the sandbox, then restore (e.g. a fresh Pod carrying the restore annotation). The first container resumes; the second fails with the error above. The same overlay mounted by a single container checkpoints and restores cleanly, which points at the per-container duplicate rather than the overlay itself.

Fix

Mirror getSharedMount in configureRestore: track the shared-overlay sources already registered and, for a peer container's shared mount, close the extra filestore FD and skip creating a duplicate MemoryFile, so the restored map holds exactly one entry per shared overlay and matches the saved owners. With that, both containers restore, the shared workspace comes back on a fresh emptyDir, and writes from either container stay coherent after restore.

One caveat: keying the restored MemoryFile by {container, mount-destination} makes correctness depend on containers being restored in creation order (fine for the guaranteed unnamed-container case, and for a kubelet restoring in Pod-spec order). A more robust variant keys the shared overlay's MemoryFile on the pod-global mount source so ordering stops mattering.

Environment

runsc built from the #13326 branch (Kubernetes pod checkpoint/restore), linux/arm64, containerd v2.x, single-node kind cluster.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions