runsc restore: "inconsistent private memory files" for a pod-shared overlay mounted by multiple containers

### Problem

When a tmpfs mount is shared across multiple containers in a Pod (mount hint `share=pod`, or `share=container`) and the whole sandbox is checkpointed and then restored, restore fails for every container past the first. The first container restores and runs; the second dies at start:

```
OCI runtime restore failed: starting container: starting sub-container
[/counter --tick=1s --state-file=/workspace/b.state]:
inconsistent private memory files on restore:
savedMFOwners = [writer-a:/  writer-a:/workspace  writer-b:/],
mfmap = map[writer-a:/ ... writer-a:/workspace ... writer-b:/ ... writer-b:/workspace ...]
```

`savedMFOwners` has one entry for the shared `/workspace` overlay (owned by the first container), but `mfmap` on restore has one per mounting container, so the counts disagree and `loadPrivateMemoryFiles` (`pkg/sentry/kernel/kernel_restore.go`) aborts.

### Root cause

A pod-shared overlay is backed by a single MemoryFile. At runtime the first container mounts the master via `getSharedMount` (`runsc/boot/vfs.go`); peer containers reuse that master and their extra filestore FD is closed, so only one private MemoryFile exists and `PrepareSave` registers exactly one owner.

On restore, `configureRestore` (`runsc/boot/vfs.go`) doesn't account for the sharing — it creates and registers a private MemoryFile for *every* container's submount that carries a filestore FD. So a Pod with the overlay mounted in two containers restores with two MemoryFile entries against one saved owner, and the counts mismatch.

### Reproduce

A two-container Pod sharing a disk-backed tmpfs overlay via the mount hints `dev.gvisor.spec.mount.<name>.share=pod` and `type=tmpfs` (OCI mount kept as `bind`, so the overlay is on disk). Both containers write to the mount, `runsc checkpoint` the sandbox, then restore (e.g. a fresh Pod carrying the restore annotation). The first container resumes; the second fails with the error above. The same overlay mounted by a *single* container checkpoints and restores cleanly, which points at the per-container duplicate rather than the overlay itself.

### Fix

Mirror `getSharedMount` in `configureRestore`: track the shared-overlay sources already registered and, for a peer container's shared mount, close the extra filestore FD and skip creating a duplicate MemoryFile, so the restored map holds exactly one entry per shared overlay and matches the saved owners. With that, both containers restore, the shared workspace comes back on a fresh emptyDir, and writes from either container stay coherent after restore.

One caveat: keying the restored MemoryFile by `{container, mount-destination}` makes correctness depend on containers being restored in creation order (fine for the guaranteed unnamed-container case, and for a kubelet restoring in Pod-spec order). A more robust variant keys the shared overlay's MemoryFile on the pod-global mount source so ordering stops mattering.

### Environment

runsc built from the #13326 branch (Kubernetes pod checkpoint/restore), linux/arm64, containerd v2.x, single-node kind cluster.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

runsc restore: "inconsistent private memory files" for a pod-shared overlay mounted by multiple containers #13608

Problem

Root cause

Reproduce

Fix

Environment

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

runsc restore: "inconsistent private memory files" for a pod-shared overlay mounted by multiple containers #13608

Description

Problem

Root cause

Reproduce

Fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions