Skip to content

daemon: retry secrets dir remount on transient EBUSY#52235

Merged
thaJeztah merged 1 commit intomoby:masterfrom
firecow:fix-secrets-remount-ebusy
Apr 16, 2026
Merged

daemon: retry secrets dir remount on transient EBUSY#52235
thaJeztah merged 1 commit intomoby:masterfrom
firecow:fix-secrets-remount-ebusy

Conversation

@firecow
Copy link
Copy Markdown
Contributor

@firecow firecow commented Mar 26, 2026

- What I did

Added retry logic to remountSecretDir for transient EBUSY errors when remounting the secrets/configs tmpfs as read-only.

- How I did it

Wrapped the mount.Mount("tmpfs", dir, "tmpfs", "remount,ro,...") call in a retry loop (up to 5 attempts, 100ms apart). Only EBUSY triggers a retry; other errors fail immediately. A debug-level log is emitted on each retry for observability.

This follows the existing retry-on-EBUSY pattern from EnsureRemoveAll in daemon/internal/containerfs/rm.go.

- How to verify it

  1. Set up a Docker Swarm node running many services that use configs or secrets
  2. Deploy a service with update_config: { order: start-first } and Docker configs
  3. Trigger repeated service updates — the EBUSY race during secrets remount should now be retried instead of failing the container start

The issue is a race condition that occurs intermittently on busy hosts, so verification requires a loaded environment. See #48783 for reproduction details.

- Description

The race condition in remountSecretDir has existed for a long time, but became noticeably worse after the containerd v1.7 → v2.x bump (Docker 28.x/29.x). The major containerd upgrade appears to have changed container lifecycle timing enough to widen the window where the tmpfs mount is held during secret/config setup, making the EBUSY collision significantly more likely on busy hosts. We observe this consistently on Docker 29.3.0 (containerd v2.2.1) but not on Docker 27.5.0 (containerd v1.7.25) with comparable workloads (~100 containers).

- Human readable description for the release notes

Fix intermittent container start failures (`EBUSY` on secrets/configs remount) on busy Swarm nodes by retrying the read-only remount.

Fixes #48783

Signed-off-by: Mads Jon Nielsen madsjon@gmail.com

@github-actions github-actions Bot added the area/daemon Core Engine label Mar 26, 2026
@thaJeztah thaJeztah added this to the 29.4.1 milestone Apr 8, 2026
Copy link
Copy Markdown
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!
Looks good, some input on the timing though

Comment thread daemon/container_operations_unix.go Outdated
@vvoland
Copy link
Copy Markdown
Contributor

vvoland commented Apr 10, 2026

Thanks @firecow!

Looks good, one thing remaining though - could you please rebase your branch on top of master and squash your commit into one?

On busy Docker hosts, the read-only remount of the secrets/configs
tmpfs can fail with EBUSY when the mount is momentarily held by
another process immediately after writing secret/config files.

This is particularly common with Swarm services using start-first
update order, where old and new containers coexist on the same node.

Retry the remount up to 5 times with progressive backoff
(10ms + retry * 25ms). Only EBUSY triggers a retry; other errors
fail immediately.

Fixes moby#48783

Signed-off-by: Mads Jon Nielsen <madsjon@gmail.com>
@firecow firecow force-pushed the fix-secrets-remount-ebusy branch from 5677a8d to 3b22c50 Compare April 12, 2026 18:06
@firecow
Copy link
Copy Markdown
Contributor Author

firecow commented Apr 12, 2026

Thanks @firecow!

Looks good, one thing remaining though - could you please rebase your branch on top of master and squash your commit into one?

Done

Copy link
Copy Markdown
Contributor

@vvoland vvoland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

Copy link
Copy Markdown
Member

@thaJeztah thaJeztah left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@thaJeztah
Copy link
Copy Markdown
Member

windows failures are unrelated.

@thaJeztah thaJeztah merged commit 48924f0 into moby:master Apr 16, 2026
339 of 350 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create service fail / starting container failed: unable to remount [secrets] dir as readonly

3 participants