fix: docker socket proxy + release infrastructure#132
Conversation
Add TODO comments to identify areas for future refactoring: - Consider replacing direct service dependencies with ports - Evaluate need for lazy imports in command handlers - Use SRN class for building SRN strings instead of string formatting - Consider using convention ID instead of SRN for better semantics - Evaluate using Pydantic objects for better type safety
…wn validators The published OSA image runs as `appuser`, which can't read the host's /var/run/docker.sock (root:docker 0660). Direct socket mount fails with EACCES on every ingester/validator spawn. Fix is purely deploy-layer: add a tecnativa/docker-socket-proxy sidecar that holds the socket bind mount and exposes a locked-down Docker API (CONTAINERS, IMAGES, POST only) over TCP. The OSA server reaches it via DOCKER_HOST, no longer mounts the socket, no longer needs filesystem access to the daemon. The proxy is on an internal-only Docker network with no external egress. The compose v2 environment-map merge means dev (-f docker-compose.dev.yml) inherits DOCKER_HOST and the proxy dependency without further changes. Verified end-to-end against the cultivarium pilot using the published sha-a747fc1 image: ingester containers spawn cleanly, "Cannot connect to Docker Engine" log line is gone. Closes #131
|
Greptile SummaryThis PR delivers two bundled changes: a Docker socket proxy sidecar that fixes an EACCES permission failure for the
Confidence Score: 5/5Safe to merge. The socket-proxy wiring is correct, the healthcheck + service_healthy dependency closes the cold-start race, and the OCI label plumbing is straightforward. The compose changes are well-scoped: the proxy is confined to an internal network, the server's direct socket mount is removed, and startup ordering is enforced via a TCP healthcheck. The Dockerfile ARG/LABEL additions and CI build-arg plumbing are mechanical and low-risk. The only outstanding behavioural question — whether container cleanup DELETE calls reach the daemon — was raised in a prior review thread and is a known, logged degradation rather than a silent failure after this PR. No files require special attention.
|
| Filename | Overview |
|---|---|
| deploy/docker-compose.yml | Adds docker-socket-proxy sidecar with healthcheck and service_healthy dependency; server loses direct socket mount and gains DOCKER_HOST. Network topology and internal network restriction look correct. |
| .github/workflows/image.yml | Passes OSA_VERSION (release tag or SHA) and OSA_REVISION (full SHA) as build-args; expression correctly falls back to github.sha for non-release triggers. |
| server/Dockerfile | Adds ARG/LABEL block in the runtime stage for six OCI image labels; variable substitution from ARG in LABEL is valid Docker syntax. |
| server/osa/infrastructure/oci/runner.py | Replaces silent exception swallow on container.delete() with a structured warning log; no logic change. |
| server/osa/infrastructure/oci/ingester_runner.py | Same silent-exception-to-warning-log change as runner.py for the ingester cleanup path. |
| server/pyproject.toml | Version bumped down from 0.1.0 to 0.0.1 to align with first semver release; uv.lock updated to match. |
Sequence Diagram
sequenceDiagram
participant H as Docker Host
participant P as docker-socket-proxy<br/>(internal network)
participant S as osa-server<br/>(default + docker-proxy nets)
participant D as Docker Daemon<br/>(/var/run/docker.sock)
Note over H,D: Startup sequence
H->>P: start container
P->>D: connect via /var/run/docker.sock:ro
P-->>H: healthcheck: GET /_ping → 200
H->>S: start container (service_healthy satisfied)
Note over S,D: Ingest / hook container lifecycle
S->>P: POST /containers/create (tcp://docker-socket-proxy:2375)
P->>D: forward POST /containers/create
D-->>P: 201 container ID
P-->>S: 201 container ID
S->>P: "POST /containers/{id}/start"
P->>D: "forward POST /containers/{id}/start"
S->>P: "DELETE /containers/{id}?force=true"
P->>D: forward DELETE (requires DELETE:1 env var)
Reviews (5): Last reviewed commit: "fix: log container cleanup failures inst..." | Re-trigger Greptile
…er start Greptile review flagged a race: depends_on with service_started only waits for the container process, not for haproxy to be listening on 2375. A cold boot with a queued ingest could hit the proxy before it's ready. Add a wget-based healthcheck against the proxy's /_ping endpoint (PING is on by default in the proxy's permission model) and gate the server on service_healthy. Verified live: server now blocks until proxy reports healthy in the compose boot sequence. wget is available in the haproxy:alpine base via busybox.
|
Want your agent to iterate on Greptile's feedback? Try greploops. |
Sets up the path for cutting GitHub Releases that produce semver-tagged images at ghcr.io/opensciencearchive/osa:vX.Y.Z. The release-publish trigger in image.yml already existed; this adds the surrounding polish so the resulting images are auditable and the process is documented. Changes: - server/Dockerfile: add OSA_VERSION + OSA_REVISION build args plus six OCI image labels (title, description, source, licenses, version, revision). Verified locally: docker inspect shows all labels populated. - .github/workflows/image.yml: pass OSA_VERSION (release tag or commit SHA) and OSA_REVISION (full SHA) as build args to populate the labels. - server/pyproject.toml: bump 0.1.0 -> 0.0.1 to match the chosen first release version. No registry tag exists for 0.1.0, so safe to align. - RELEASING.md: document the flow — release-from-green-main, gh release create vX.Y.Z, image lands at ghcr within ~5min. No floating tags (:latest, :vX, :vX.Y) — OSA is consumed via compose files with pinned OSA_IMAGE_TAG, not ad-hoc docker pull. First release (v0.0.1) will be cut after this PR merges.
|
@greptile |
| environment: | ||
| CONTAINERS: 1 | ||
| IMAGES: 1 | ||
| POST: 1 |
There was a problem hiding this comment.
Missing
DELETE permission causes silent container leaks on every hook run. In server/osa/infrastructure/oci/runner.py (line 228), _run_container calls await container.delete(force=True) in its finally block to clean up each hook container. That call maps to DELETE /containers/{id}?force=true — a DELETE HTTP method that the proxy blocks by default (only GET/HEAD are allowed without explicit opt-in). The except Exception: pass around the call silently swallows the resulting 403, so every completed hook container accumulates on the host until Docker runs out of resources.
| environment: | |
| CONTAINERS: 1 | |
| IMAGES: 1 | |
| POST: 1 | |
| environment: | |
| CONTAINERS: 1 | |
| IMAGES: 1 | |
| POST: 1 | |
| DELETE: 1 |
There was a problem hiding this comment.
I think POST meant "any non-GET" method, no? cc @greptile
There was a problem hiding this comment.
You're right. POST: 1 in the haproxy config maps to this rule:
http-request deny unless METH_GET || { env(POST) -m bool }
It's a master switch for all non-GET methods — setting it to 1 allows GET, POST, PUT, DELETE, PATCH, etc. through the initial filter. DELETE /containers/{id}?force=true would pass with POST: 1 set.
My original comment was wrong. No container leak risk here — withdrawing it.
Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.
The OCI runners had a bare `except Exception: pass` around the finally-block container.delete(force=True) cleanup. Any failure during cleanup was invisible — including, hypothetically, a misconfigured Docker socket proxy that blocked DELETE requests, which would leak hook or ingester containers indefinitely with no log trace. Today's proxy config (POST=1 unlocks all non-GET methods including DELETE) works correctly, but this defensive logging means any future misconfiguration surfaces immediately rather than as disk pressure. Verified empirically that the proxy currently passes DELETE through: `DELETE /containers/nonexistent` returns 404 (from Docker, "no such container") rather than 403 (from the proxy itself). Also syncs uv.lock to the 0.0.1 version bump from the previous commit. Left runner.py:183 alone — that one's a deliberate best-effort log enrichment before raising OOMError; empty-string fallback is the intended behaviour there.
Closes #131.
This PR bundles the socket-proxy fix with the first cut of release infrastructure, so the resulting
v0.0.1semver-tagged image carries the proxy fix from day one.Part 1 — Docker socket proxy (the original fix)
The published OSA image runs as
appuser, which can't read the host's/var/run/docker.sock(root:docker 0660). Direct socket mount fails with EACCES on every ingester/validator spawn. Fix is purely deploy-layer: add atecnativa/docker-socket-proxysidecar (v0.4.2) that holds the socket bind mount and exposes a locked-down Docker API (CONTAINERS, IMAGES, POST only) over TCP. The OSA server reaches it viaDOCKER_HOST, no longer mounts the socket, no longer needs filesystem access to the daemon.The proxy lives on an internal-only Docker network with no external egress. The compose v2 environment-map merge means dev (
-f docker-compose.dev.yml) inheritsDOCKER_HOSTand the proxy dependency without further changes.The server's
depends_onwaits forservice_healthyon the proxy (added on Greptile's review), so a fast worker spawning an ingest at cold-start can't race the proxy. Verified live — server boot blocks ondocker-socket-proxy Healthy.Verified end-to-end against the pilot using the published
sha-a747fc1image: ingester containers spawn cleanly, "Cannot connect to Docker Engine" log line is gone.Part 2 — Release infrastructure
Sets up the path for cutting GitHub Releases that produce semver-tagged images at
ghcr.io/opensciencearchive/osa:vX.Y.Z. Therelease: publishedtrigger inimage.ymlalready existed; this adds the surrounding polish so the resulting images are auditable and the process is documented.server/DockerfilegainsOSA_VERSION+OSA_REVISIONbuild args plus six OCI image labels (title, description, source, licenses, version, revision). Verified locally:docker inspectshows all labels populated..github/workflows/image.ymlpassesOSA_VERSION(release tag or commit SHA) andOSA_REVISION(full SHA) as build args.server/pyproject.tomlbumps0.1.0→0.0.1to match the chosen first release version. No registry tag exists for0.1.0, so safe to align.RELEASING.mddocuments the flow.No floating
:latest/:vX/:vX.Ytags — OSA is consumed via compose files with pinnedOSA_IMAGE_TAG, not ad-hocdocker pull.After merge
Cut
v0.0.1:Then bump pilot
.envtoOSA_IMAGE_TAG=v0.0.1and verify ingest end-to-end against the semver-pinned image.Verification
docker compose -f deploy/docker-compose.yml configand the same with-f docker-compose.dev.ymloverlaid: server hasDOCKER_HOST, no socket mount, both files inherit the proxy via env-map merge — no dev-side changes needed.ruff checkclean,pytest tests/unit1072 passed.docker build --build-arg OSA_VERSION=v0.0.1-test ...: all six OCI labels populated correctly underdocker inspect.Notes for reviewers
baf5fe4— TODO comments on four server files) authored before I started. Happy to drop it via interactive rebase if you'd rather land cleanly; let me know.Follow-ups (separate issues)
osa init/osa startCLI flow — tracked at CLI:osa initandosa startfor one-command local OSA setup osa-py#5.target: buildershortcut (the reason this bug was invisible in dev) — out of scope here, worth a separate issue.CLAUDE.mdopens with "Open Scientific Archive" — should be "Open Science Archive"; one-line fix worth a separate small PR.