Skip to content

fix: docker socket proxy + release infrastructure#132

Merged
rorybyrne merged 5 commits into
mainfrom
rory-fix-docker-socket-proxy
Jun 2, 2026
Merged

fix: docker socket proxy + release infrastructure#132
rorybyrne merged 5 commits into
mainfrom
rory-fix-docker-socket-proxy

Conversation

@rorybyrne

@rorybyrne rorybyrne commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Closes #131.

This PR bundles the socket-proxy fix with the first cut of release infrastructure, so the resulting v0.0.1 semver-tagged image carries the proxy fix from day one.

Part 1 — Docker socket proxy (the original fix)

The published OSA image runs as appuser, which can't read the host's /var/run/docker.sock (root:docker 0660). Direct socket mount fails with EACCES on every ingester/validator spawn. Fix is purely deploy-layer: add a tecnativa/docker-socket-proxy sidecar (v0.4.2) that holds the socket bind mount and exposes a locked-down Docker API (CONTAINERS, IMAGES, POST only) over TCP. The OSA server reaches it via DOCKER_HOST, no longer mounts the socket, no longer needs filesystem access to the daemon.

The proxy lives on an internal-only Docker network with no external egress. The compose v2 environment-map merge means dev (-f docker-compose.dev.yml) inherits DOCKER_HOST and the proxy dependency without further changes.

The server's depends_on waits for service_healthy on the proxy (added on Greptile's review), so a fast worker spawning an ingest at cold-start can't race the proxy. Verified live — server boot blocks on docker-socket-proxy Healthy.

Verified end-to-end against the pilot using the published sha-a747fc1 image: ingester containers spawn cleanly, "Cannot connect to Docker Engine" log line is gone.

Part 2 — Release infrastructure

Sets up the path for cutting GitHub Releases that produce semver-tagged images at ghcr.io/opensciencearchive/osa:vX.Y.Z. The release: published trigger in image.yml already existed; this adds the surrounding polish so the resulting images are auditable and the process is documented.

  • server/Dockerfile gains OSA_VERSION + OSA_REVISION build args plus six OCI image labels (title, description, source, licenses, version, revision). Verified locally: docker inspect shows all labels populated.
  • .github/workflows/image.yml passes OSA_VERSION (release tag or commit SHA) and OSA_REVISION (full SHA) as build args.
  • server/pyproject.toml bumps 0.1.00.0.1 to match the chosen first release version. No registry tag exists for 0.1.0, so safe to align.
  • RELEASING.md documents the flow.

No floating :latest / :vX / :vX.Y tags — OSA is consumed via compose files with pinned OSA_IMAGE_TAG, not ad-hoc docker pull.

After merge

Cut v0.0.1:

gh release create v0.0.1 --target main --title "v0.0.1" \
  --notes "First tagged release. Includes the Docker socket proxy fix (#131) and release infrastructure setup."

Then bump pilot .env to OSA_IMAGE_TAG=v0.0.1 and verify ingest end-to-end against the semver-pinned image.

Verification

  • docker compose -f deploy/docker-compose.yml config and the same with -f docker-compose.dev.yml overlaid: server has DOCKER_HOST, no socket mount, both files inherit the proxy via env-map merge — no dev-side changes needed.
  • ruff check clean, pytest tests/unit 1072 passed.
  • Local docker build --build-arg OSA_VERSION=v0.0.1-test ...: all six OCI labels populated correctly under docker inspect.
  • End-to-end smoke test against published image via pilot (see Part 1 above).

Notes for reviewers

  • Branch also contains a pre-existing unrelated commit (baf5fe4 — TODO comments on four server files) authored before I started. Happy to drop it via interactive rebase if you'd rather land cleanly; let me know.

Follow-ups (separate issues)

  • Productize as osa init / osa start CLI flow — tracked at CLI: osa init and osa start for one-command local OSA setup osa-py#5.
  • Dev compose's target: builder shortcut (the reason this bug was invisible in dev) — out of scope here, worth a separate issue.
  • CLAUDE.md opens with "Open Scientific Archive" — should be "Open Science Archive"; one-line fix worth a separate small PR.

rorybyrne added 2 commits June 1, 2026 16:00
Add TODO comments to identify areas for future refactoring:
- Consider replacing direct service dependencies with ports
- Evaluate need for lazy imports in command handlers
- Use SRN class for building SRN strings instead of string formatting
- Consider using convention ID instead of SRN for better semantics
- Evaluate using Pydantic objects for better type safety
…wn validators

The published OSA image runs as `appuser`, which can't read the host's
/var/run/docker.sock (root:docker 0660). Direct socket mount fails with
EACCES on every ingester/validator spawn. Fix is purely deploy-layer:
add a tecnativa/docker-socket-proxy sidecar that holds the socket bind
mount and exposes a locked-down Docker API (CONTAINERS, IMAGES, POST
only) over TCP. The OSA server reaches it via DOCKER_HOST, no longer
mounts the socket, no longer needs filesystem access to the daemon.

The proxy is on an internal-only Docker network with no external egress.
The compose v2 environment-map merge means dev (-f docker-compose.dev.yml)
inherits DOCKER_HOST and the proxy dependency without further changes.

Verified end-to-end against the cultivarium pilot using the published
sha-a747fc1 image: ingester containers spawn cleanly, "Cannot connect
to Docker Engine" log line is gone.

Closes #131
@rorybyrne rorybyrne linked an issue Jun 2, 2026 that may be closed by this pull request
4 tasks
@github-actions

github-actions Bot commented Jun 2, 2026

Copy link
Copy Markdown

Code Coverage

Package Line Rate Complexity Health
. 81% 0
application 100% 0
application.api 100% 0
application.api.rest 85% 0
application.api.v1 82% 0
application.api.v1.routes 64% 0
application.event 100% 0
domain 100% 0
domain.auth 100% 0
domain.auth.command 90% 0
domain.auth.event 100% 0
domain.auth.model 93% 0
domain.auth.port 99% 0
domain.auth.query 93% 0
domain.auth.service 91% 0
domain.auth.util 100% 0
domain.auth.util.di 79% 0
domain.curation 100% 0
domain.curation.adapter 100% 0
domain.curation.command 100% 0
domain.curation.event 100% 0
domain.curation.handler 92% 0
domain.curation.model 100% 0
domain.curation.port 100% 0
domain.curation.query 100% 0
domain.curation.service 100% 0
domain.deposition 100% 0
domain.deposition.adapter 100% 0
domain.deposition.command 87% 0
domain.deposition.event 100% 0
domain.deposition.handler 100% 0
domain.deposition.model 98% 0
domain.deposition.port 100% 0
domain.deposition.query 86% 0
domain.deposition.service 97% 0
domain.deposition.util.di 94% 0
domain.discovery 100% 0
domain.discovery.model 96% 0
domain.discovery.port 100% 0
domain.discovery.query 100% 0
domain.discovery.service 84% 0
domain.discovery.util 100% 0
domain.discovery.util.di 95% 0
domain.export 100% 0
domain.export.adapter 100% 0
domain.export.command 100% 0
domain.export.event 100% 0
domain.export.model 100% 0
domain.export.port 100% 0
domain.export.query 100% 0
domain.export.service 100% 0
domain.feature 100% 0
domain.feature.event 0% 0
domain.feature.handler 68% 0
domain.feature.model 0% 0
domain.feature.port 100% 0
domain.feature.service 96% 0
domain.feature.util 100% 0
domain.feature.util.di 100% 0
domain.index 100% 0
domain.index.event 100% 0
domain.index.handler 75% 0
domain.index.model 84% 0
domain.index.service 100% 0
domain.ingest 100% 0
domain.ingest.command 79% 0
domain.ingest.event 100% 0
domain.ingest.handler 52% 0
domain.ingest.model 100% 0
domain.ingest.port 100% 0
domain.ingest.service 76% 0
domain.metadata 100% 0
domain.metadata.event 100% 0
domain.metadata.handler 100% 0
domain.metadata.model 0% 0
domain.metadata.port 100% 0
domain.metadata.service 93% 0
domain.metadata.util 100% 0
domain.metadata.util.di 100% 0
domain.record 100% 0
domain.record.adapter 100% 0
domain.record.command 100% 0
domain.record.event 100% 0
domain.record.handler 100% 0
domain.record.model 100% 0
domain.record.port 100% 0
domain.record.query 100% 0
domain.record.service 66% 0
domain.search 100% 0
domain.search.adapter 100% 0
domain.search.command 100% 0
domain.search.event 100% 0
domain.search.model 100% 0
domain.search.port 100% 0
domain.search.query 100% 0
domain.search.service 100% 0
domain.semantics 100% 0
domain.semantics.command 94% 0
domain.semantics.event 100% 0
domain.semantics.handler 100% 0
domain.semantics.model 100% 0
domain.semantics.port 100% 0
domain.semantics.query 90% 0
domain.semantics.service 100% 0
domain.semantics.util 100% 0
domain.semantics.util.di 93% 0
domain.shared 94% 0
domain.shared.authorization 87% 0
domain.shared.model 93% 0
domain.shared.port 100% 0
domain.validation 100% 0
domain.validation.adapter 100% 0
domain.validation.command 0% 0
domain.validation.event 100% 0
domain.validation.handler 90% 0
domain.validation.model 91% 0
domain.validation.port 100% 0
domain.validation.query 100% 0
domain.validation.service 90% 0
domain.validation.util.di 93% 0
infrastructure 80% 0
infrastructure.auth 56% 0
infrastructure.event 75% 0
infrastructure.http 92% 0
infrastructure.index 82% 0
infrastructure.ingest 83% 0
infrastructure.k8s 76% 0
infrastructure.messaging 100% 0
infrastructure.oci 56% 0
infrastructure.persistence 68% 0
infrastructure.persistence.adapter 52% 0
infrastructure.persistence.mappers 62% 0
infrastructure.persistence.repository 31% 0
infrastructure.s3 26% 0
infrastructure.storage 100% 0
sdk 100% 0
sdk.index 100% 0
util 100% 0
util.di 77% 0
Summary 75% (6956 / 9273) 0

@greptile-apps

greptile-apps Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR delivers two bundled changes: a Docker socket proxy sidecar that fixes an EACCES permission failure for the appuser-run server, and release infrastructure (OCI labels in the Dockerfile, build-arg plumbing in the CI workflow, and RELEASING.md).

  • Socket proxy: Replaces the direct /var/run/docker.sock bind-mount with a tecnativa/docker-socket-proxy sidecar on an internal: true network; server reaches it via DOCKER_HOST=tcp://docker-socket-proxy:2375. The service_healthy dependency condition on the proxy closes the cold-start race.
  • Release infrastructure: server/Dockerfile gains OSA_VERSION/OSA_REVISION build args and six OCI labels; image.yml passes those args at build time using github.event.release.tag_name || github.sha; pyproject.toml is realigned to 0.0.1 to match the first planned semver tag.
  • Logging improvement: Silent except Exception: pass around container cleanup is replaced with log.warning(...) in both runner.py and ingester_runner.py.

Confidence Score: 5/5

Safe to merge. The socket-proxy wiring is correct, the healthcheck + service_healthy dependency closes the cold-start race, and the OCI label plumbing is straightforward.

The compose changes are well-scoped: the proxy is confined to an internal network, the server's direct socket mount is removed, and startup ordering is enforced via a TCP healthcheck. The Dockerfile ARG/LABEL additions and CI build-arg plumbing are mechanical and low-risk. The only outstanding behavioural question — whether container cleanup DELETE calls reach the daemon — was raised in a prior review thread and is a known, logged degradation rather than a silent failure after this PR.

No files require special attention.

Important Files Changed

Filename Overview
deploy/docker-compose.yml Adds docker-socket-proxy sidecar with healthcheck and service_healthy dependency; server loses direct socket mount and gains DOCKER_HOST. Network topology and internal network restriction look correct.
.github/workflows/image.yml Passes OSA_VERSION (release tag or SHA) and OSA_REVISION (full SHA) as build-args; expression correctly falls back to github.sha for non-release triggers.
server/Dockerfile Adds ARG/LABEL block in the runtime stage for six OCI image labels; variable substitution from ARG in LABEL is valid Docker syntax.
server/osa/infrastructure/oci/runner.py Replaces silent exception swallow on container.delete() with a structured warning log; no logic change.
server/osa/infrastructure/oci/ingester_runner.py Same silent-exception-to-warning-log change as runner.py for the ingester cleanup path.
server/pyproject.toml Version bumped down from 0.1.0 to 0.0.1 to align with first semver release; uv.lock updated to match.

Sequence Diagram

sequenceDiagram
    participant H as Docker Host
    participant P as docker-socket-proxy<br/>(internal network)
    participant S as osa-server<br/>(default + docker-proxy nets)
    participant D as Docker Daemon<br/>(/var/run/docker.sock)

    Note over H,D: Startup sequence
    H->>P: start container
    P->>D: connect via /var/run/docker.sock:ro
    P-->>H: healthcheck: GET /_ping → 200
    H->>S: start container (service_healthy satisfied)

    Note over S,D: Ingest / hook container lifecycle
    S->>P: POST /containers/create (tcp://docker-socket-proxy:2375)
    P->>D: forward POST /containers/create
    D-->>P: 201 container ID
    P-->>S: 201 container ID
    S->>P: "POST /containers/{id}/start"
    P->>D: "forward POST /containers/{id}/start"
    S->>P: "DELETE /containers/{id}?force=true"
    P->>D: forward DELETE (requires DELETE:1 env var)
Loading

Reviews (5): Last reviewed commit: "fix: log container cleanup failures inst..." | Re-trigger Greptile

Comment thread deploy/docker-compose.yml
Comment thread deploy/docker-compose.yml Outdated
…er start

Greptile review flagged a race: depends_on with service_started only waits
for the container process, not for haproxy to be listening on 2375. A cold
boot with a queued ingest could hit the proxy before it's ready.

Add a wget-based healthcheck against the proxy's /_ping endpoint (PING is
on by default in the proxy's permission model) and gate the server on
service_healthy. Verified live: server now blocks until proxy reports
healthy in the compose boot sequence.

wget is available in the haproxy:alpine base via busybox.
@greptile-apps

greptile-apps Bot commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Want your agent to iterate on Greptile's feedback? Try greploops.

Sets up the path for cutting GitHub Releases that produce semver-tagged
images at ghcr.io/opensciencearchive/osa:vX.Y.Z. The release-publish
trigger in image.yml already existed; this adds the surrounding polish
so the resulting images are auditable and the process is documented.

Changes:
- server/Dockerfile: add OSA_VERSION + OSA_REVISION build args plus
  six OCI image labels (title, description, source, licenses, version,
  revision). Verified locally: docker inspect shows all labels populated.
- .github/workflows/image.yml: pass OSA_VERSION (release tag or commit
  SHA) and OSA_REVISION (full SHA) as build args to populate the labels.
- server/pyproject.toml: bump 0.1.0 -> 0.0.1 to match the chosen first
  release version. No registry tag exists for 0.1.0, so safe to align.
- RELEASING.md: document the flow — release-from-green-main, gh release
  create vX.Y.Z, image lands at ghcr within ~5min.

No floating tags (:latest, :vX, :vX.Y) — OSA is consumed via compose
files with pinned OSA_IMAGE_TAG, not ad-hoc docker pull.

First release (v0.0.1) will be cut after this PR merges.
@rorybyrne rorybyrne changed the title fix: route Docker API calls through socket proxy fix: docker socket proxy + release infrastructure Jun 2, 2026
@rorybyrne

Copy link
Copy Markdown
Contributor Author

@greptile

Comment thread deploy/docker-compose.yml
Comment on lines +50 to +53
environment:
CONTAINERS: 1
IMAGES: 1
POST: 1

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Missing DELETE permission causes silent container leaks on every hook run. In server/osa/infrastructure/oci/runner.py (line 228), _run_container calls await container.delete(force=True) in its finally block to clean up each hook container. That call maps to DELETE /containers/{id}?force=true — a DELETE HTTP method that the proxy blocks by default (only GET/HEAD are allowed without explicit opt-in). The except Exception: pass around the call silently swallows the resulting 403, so every completed hook container accumulates on the host until Docker runs out of resources.

Suggested change
environment:
CONTAINERS: 1
IMAGES: 1
POST: 1
environment:
CONTAINERS: 1
IMAGES: 1
POST: 1
DELETE: 1

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think POST meant "any non-GET" method, no? cc @greptile

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. POST: 1 in the haproxy config maps to this rule:

http-request deny unless METH_GET || { env(POST) -m bool }

It's a master switch for all non-GET methods — setting it to 1 allows GET, POST, PUT, DELETE, PATCH, etc. through the initial filter. DELETE /containers/{id}?force=true would pass with POST: 1 set.

My original comment was wrong. No container leak risk here — withdrawing it.

Tip: You can customize Greptile's behavior for this repo with .greptile/rules.md and .greptile/config.json.

The OCI runners had a bare `except Exception: pass` around the
finally-block container.delete(force=True) cleanup. Any failure during
cleanup was invisible — including, hypothetically, a misconfigured
Docker socket proxy that blocked DELETE requests, which would leak hook
or ingester containers indefinitely with no log trace.

Today's proxy config (POST=1 unlocks all non-GET methods including
DELETE) works correctly, but this defensive logging means any future
misconfiguration surfaces immediately rather than as disk pressure.

Verified empirically that the proxy currently passes DELETE through:
`DELETE /containers/nonexistent` returns 404 (from Docker, "no such
container") rather than 403 (from the proxy itself).

Also syncs uv.lock to the 0.0.1 version bump from the previous commit.

Left runner.py:183 alone — that one's a deliberate best-effort log
enrichment before raising OOMError; empty-string fallback is the
intended behaviour there.
@rorybyrne rorybyrne merged commit 491fe47 into main Jun 2, 2026
9 checks passed
@rorybyrne rorybyrne deleted the rory-fix-docker-socket-proxy branch June 2, 2026 20:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix: route Docker API calls through socket proxy instead of raw socket mount

1 participant