Skip to content

Docker compose won't attach to network when monitoring is enabled during install#16

Merged
shayanb merged 3 commits into
logosnode:mainfrom
strudelPi:strudelpi/fix-err-handling
May 27, 2026
Merged

Docker compose won't attach to network when monitoring is enabled during install#16
shayanb merged 3 commits into
logosnode:mainfrom
strudelPi:strudelpi/fix-err-handling

Conversation

@strudelPi
Copy link
Copy Markdown
Contributor

The problem is this code. When during an installation user decides to enable monitoring the monitoring stack is ran before the node stack. This means the network won't exist and the this code creates the network manually.

But when later docker-compose tries to bring up the node, it won't attach to the network because the network is not a docker-compose managed network (doesn't have the compose labels). It gets refused and fails. The second commit basically just lets compose always decide whether to create/attach the network which will always be compose managed. (Another way to go would be to use external for both networks but that would mean you need to create the network on several different places in code, which I like less).

The first commit basically is only correctly surfacing the errors that might happen, because originally when I tried this on main I would only get the 120s timeout for spinning up the docker compose and no specific error:

  • the healthy check does not account for the container actually not being alive (which it isn't because the docker-compose fails on the network problem.
  • the network problem also is not surfaced because the function docker_up() returns the last call (the legacy cleanup) and ignored the previous errors.

Tested this with installing both w/ the monitoring stack during install and w/o (and enabling the monitoring stack later).

strudelPi and others added 3 commits May 27, 2026 16:04
  - docker_health_wait() - accounts for container exiting early
  - docker_up() - docker_cleanup_legacy_network() no longer hides return
    err from docker up cmd (just moved the execution above the
    docker-compose up)
…ually and docker-compose for node later won't attach (since it's missing compose labels)
…vive the update

Installs that hit the pre-fix network race created a 'logosnode-net' via a
bare `docker network create` (no compose labels), with the monitoring
containers already attached. Once both stacks declare the network with
`name: logosnode-net`, compose refuses to adopt that unlabeled network
("found but has incorrect label") and `start` breaks — and because the
network isn't orphaned, the legacy-network cleanup can't remove it.

Add docker_repair_unmanaged_network(): if a 'logosnode-net' exists without
compose's label, bring both stacks down to detach, drop it, and let the
normal bring-up recreate it labeled. Wired into both docker_up() and
monitoring_up() so it runs whichever stack starts first (start, install,
update, monitor start). No-op on healthy/compose-managed installs.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@shayanb
Copy link
Copy Markdown
Member

shayanb commented May 27, 2026

Pushed one follow-up commit (af1d400) to cover existing installs that already broke on the network race before this fix landed.

The forward path here is solid — once both stacks use name: logosnode-net, whichever comes up first creates it labeled and the other adopts it. But a box that already failed on current main has a logosnode-net created by the old docker network create (no compose labels) with the monitoring containers attached to it. That network isn't orphaned, so it can't just be removed in place, and the new labeled name: declaration would keep hitting found but has incorrect label after they update.

docker_repair_unmanaged_network(): if a logosnode-net exists without compose's com.docker.compose.network label, bring both stacks down to detach, drop it, and let the normal bring-up recreate it labeled. It's wired into both docker_up() and monitoring_up() so it runs whichever stack starts first (start, install, update, monitor start), and it's a no-op on healthy/compose-managed installs (single network inspect). Foreign workloads holding the name aren't force-disconnected — network rm just fails and the now-surfaced compose error explains it.

I left the crash-loop detection alone (a restart: unless-stopped container shows as Restarting in docker ps, so it lands on the rc=1 timeout path rather than rc=2) — that's diagnostics polish, not a breakage, so it can wait. LGTM to merge once this commit looks right to you.

@shayanb shayanb merged commit 098dc41 into logosnode:main May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants