SRE-728: Mirror images to GHCR alongside ECR#8781
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## t/sre-726-overhaul-external-services-dockerize-apps-for-self-hosting #8781 +/- ##
=======================================================================================================
Coverage ? 59.01%
=======================================================================================================
Files ? 1342
Lines ? 129455
Branches ? 5849
=======================================================================================================
Hits ? 76395
Misses ? 52159
Partials ? 901 Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Two latent bugs in the runner stage that surfaced once compose started pulling the published image (SRE-726): - ENTRYPOINT invoked `yarn --cache-folder X --global-folder Y` — these are yarn-classic flags that yarn berry (the repo is on yarn@4.12.0) silently rejects, so `start` was parsed as a command name and yarn errored with "Unknown Syntax Error: Command not found". Skip yarn at runtime and invoke `next` directly from the hoisted workspace-root `node_modules`. WORKDIR still points at the app so next picks up its own next.config.js. - `install -d` on `.next` was a no-op because the directory already exists from the builder stage's `next build`, leaving the root-owned contents (including `.next/trace`) unwritable by the runtime `frontend` user — next start aborted with EACCES on the trace file. Replace with an explicit `chown -R`.
…tion Image scan flagged a security footgun in `hash-ai-worker-ts`, a 2 GB build-cache leak in `hash-frontend`, and a stray debug knob in `hash-integration-worker`. None caused a live secret leak in the currently published images, but the patterns were "wrong build args away from leaking". hash-ai-worker-ts: Drop the `ARG GOOGLE_CLOUD_WORKLOAD_IDENTITY_FEDERATION_CONFIG_JSON` / `ENV` pair from the runner stage along with the `if [ -n … ]` block that wrote the JSON to `/tmp/...` and `export`-ed `GOOGLE_APPLICATION_CREDENTIALS`. The `export` was shell-scoped (never reached the ENTRYPOINT) and the ARG/ENV combination would bake the JSON into both `docker history` and the final image as soon as any caller passed `--build-arg ...=$(cat creds.json)`. GCP credentials are runtime concerns — mount the file and set `GOOGLE_APPLICATION_CREDENTIALS` at runtime instead. hash-frontend: `rm -rf /usr/local/src/apps/hash-frontend/.next/cache` after `turbo build`. The webpack/swc/fetch caches are runtime artefacts and shipped ~2 GB of dead bytes per architecture. hash-integration-worker: Drop `ENV MISE_VERBOSE=1` (leftover from Dockerfile iteration). Plus delete `.github/actions/build-docker-images/` — last caller (the `test.yml` build matrix) was removed in the same change set, and the action's `--build-arg GOOGLE_CLOUD_...` invocation was the structural counterpart to the footgun above.
Reverts the direct-`next` invocation from 54dd725. Running next straight from `node_modules/.bin/next` started the server but broke SSR — request handling hit JSX-runtime / module-resolution mismatches (`jsxDEV is not a function` from `@hashintel/design-system`, `EvalError` in the edge runtime). `yarn start` keeps yarn's resolution pipeline, which the build apparently relies on, and serves requests correctly. The original `--cache-folder` / `--global-folder` classic-yarn flags that caused the "Unknown Syntax Error: Command not found" on yarn berry stay gone; they're replaced with the berry equivalents `YARN_CACHE_FOLDER` and `YARN_GLOBAL_FOLDER` as ENV. The `chown -R .next` from the previous commit stays — that was an orthogonal permission fix needed regardless of which binary launches next.
Fold all backend image build/manifest/deploy jobs into the existing `Deploy` workflow alongside the sourcemaps flow, so one workflow owns the post-PR deployment surface end-to-end. - Add `workflow_dispatch:` trigger. - Apply the backend-cd concurrency model to the whole workflow: PR / merge_group cancel-in-progress on the per-ref group, push and workflow_dispatch share a flat `publish` group. The previous unconditional `cancel-in-progress: true` is incompatible with publishing — concurrent main pushes would race on the moving GHCR `:latest-<arch>` tags. - Pull in the `build` / `manifest` / `deploy` jobs from hash-backend-cd.yml verbatim (Compute-targets policy, `push: [ecr, ghcr]` per-service, native per-arch runners, manifest merge). - Consolidate the collector: `Deployments passed` aggregates `setup` / `sourcemaps` / `build` / `manifest` / `deploy`. Stable name suitable as a `Required` branch-protection check. - Slack notification split by failure type: the standalone `notify-slack-deploy` fires on backend deploy failures on main; the in-passed step fires on merge-queue failures. Different audiences, different conditions. `hash-backend-cd.yml` removed — the merged workflow replaces it. A turbo-driven affected-services matrix for the backend build (similar to the existing sourcemaps pattern) is the natural follow-up; the current static include lays the groundwork.
PR SummaryHigh Risk Overview CI: Native Dockerfiles: Removed: Reviewed by Cursor Bugbot for commit e55a185. Bugbot is set up for automated code reviews on this repo. Configure here. |
🤖 Augment PR SummarySummary: This PR mirrors HASH backend Docker images to GHCR alongside the existing ECR publishing to improve self-hosting support. Changes:
Technical Notes: GHCR images are published under 🤖 Was this summary useful? React with 👍 or 👎 |
| @@ -93,20 +106,293 @@ jobs: | |||
| - name: Build sourcemaps | |||
There was a problem hiding this comment.
The sourcemaps job’s if: condition references github.event.pull_request..., which evaluates false on push / workflow_dispatch events; that means sourcemaps won’t run on main publishes (see .github/workflows/deploy.yml around the sourcemaps job if: at ~L68). Consider guarding the fork-only check so it still runs on non-PR events.
Severity: medium
🤖 Was this useful? React with 👍 or 👎, or 🚀 if it prevented an incident/outage.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ad008ba. Configure here.
Two unrelated tweaks on the merged deploy workflow: - Skip amd64 builds on `pull_request` events. arm64 + native runner catches almost all real bugs; arch-specific divergence in Rust/Node code is rare. merge_group still builds amd64 before any main push, so the safety gate stays intact. Halves PR runtime / runner cost. - The `sourcemaps` job's `if:` referenced `github.event.pull_request. head.repo.full_name`, which is null on non-PR events — meaning the sourcemaps upload was silently skipped on main pushes (and dispatch / merge_group). Guard the fork-only check behind `github.event_name == 'pull_request'` so push/dispatch/merge_group always run; PRs still apply the fork filter.
Benchmark results
|
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2002 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 1001 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 3314 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 1526 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 2078 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 1033 | Flame Graph |
policy_resolution_medium
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 102 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 51 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 269 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 107 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 133 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 63 | Flame Graph |
policy_resolution_none
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 8 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 3 | Flame Graph |
policy_resolution_small
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 52 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 25 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 94 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 26 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 66 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 29 | Flame Graph |
read_scaling_complete
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id;one_depth | 1 entities | Flame Graph | |
| entity_by_id;one_depth | 10 entities | Flame Graph | |
| entity_by_id;one_depth | 25 entities | Flame Graph | |
| entity_by_id;one_depth | 5 entities | Flame Graph | |
| entity_by_id;one_depth | 50 entities | Flame Graph | |
| entity_by_id;two_depth | 1 entities | Flame Graph | |
| entity_by_id;two_depth | 10 entities | Flame Graph | |
| entity_by_id;two_depth | 25 entities | Flame Graph | |
| entity_by_id;two_depth | 5 entities | Flame Graph | |
| entity_by_id;two_depth | 50 entities | Flame Graph | |
| entity_by_id;zero_depth | 1 entities | Flame Graph | |
| entity_by_id;zero_depth | 10 entities | Flame Graph | |
| entity_by_id;zero_depth | 25 entities | Flame Graph | |
| entity_by_id;zero_depth | 5 entities | Flame Graph | |
| entity_by_id;zero_depth | 50 entities | Flame Graph |
read_scaling_linkless
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | 1 entities | Flame Graph | |
| entity_by_id | 10 entities | Flame Graph | |
| entity_by_id | 100 entities | Flame Graph | |
| entity_by_id | 1000 entities | Flame Graph | |
| entity_by_id | 10000 entities | Flame Graph |
representative_read_entity
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1
|
Flame Graph |
representative_read_entity_type
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| get_entity_type_by_id | Account ID: bf5a9ef5-dc3b-43cf-a291-6210c0321eba
|
Flame Graph |
representative_read_multiple_entities
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_property | traversal_paths=0 | 0 | |
| entity_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=0 | 0 | |
| link_by_source_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true |
scenarios
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| full_test | query-limited | Flame Graph | |
| full_test | query-unlimited | Flame Graph | |
| linked_queries | query-limited | Flame Graph | |
| linked_queries | query-unlimited | Flame Graph |

🌟 What is the purpose of this PR?
Publish HASH backend images to GHCR alongside the existing ECR push so self-hosters can
docker pulldirectly. The build matrix is also restructured to build natively per-arch (arm64 + amd64), the backend CD jobs are merged into the existingdeploy.ymlworkflow, and a handful of latent Docker-image fixes that surfaced once compose started actually pulling and running the published images are folded in.🔗 Related links
hash-frontendGHCR image is not yet origin-portable; see Known issues)🚫 Blocked by
🔍 What does this change?
deploy.ymlworkflow (merged fromhash-backend-cd.yml)Deployworkflow — sourcemaps + backend image build/manifest/deploy now share one workflow, onesetupjob, one collector.hash-backend-cd.ymldeleted.workflow_dispatchandmerge_group.publishgroup so all publishing runs serialize regardless of ref. The previouscancel-in-progress: trueon every event was incompatible with multi-arch GHCR pushes — concurrent main pushes would race on the moving:latest-<arch>tags.service × arch(cartesian, native runners per arch —ubuntu-24.04-armandubuntu-24.04; no QEMU). Per-service config (paths, push targets, build args) lives ininclude:withpush: [ecr, ghcr]declaring registries explicitly. ACompute targetsstep centralises the push policy and emits per-jobshould_build,push,ecr_tags,ghcr_tagsoutputs — eliminating the inline${{ ... && ... || '' }}ternaries.:latest-<arch>,:sha-<sha>-<arch>); a follow-upmanifestjob joins them into multi-arch:latest/:sha-<sha>viadocker buildx imagetools create. Image namespace:ghcr.io/hashintel/hash/<service>(canonical<owner>/<repo>/<image>pattern). Kratos and Hydra publish to ECR only.Deployments passedcollector aggregatessetup/sourcemaps/build/manifest/deploy. Stable name suitable as aRequiredbranch-protection check.notify-slack-deployfires on backend deploy failures on main; an in-passed step fires on merge-queue failures (audience: @devops vs @infra).469596578827account ID.docker-build-pushcomposite actionECR_TAGS/GHCR_TAGS(each may be empty) and aPLATFORM. Conditional ECR / GHCR login on tag-set non-emptiness.BUILD_ARGSinput restored for non-secret values (e.g.NEXT_PUBLIC_*URLs), with a description warning against using it for secrets.Dockerfile fixes
Latent bugs uncovered once compose actually ran the published images:
hash-frontend: ENTRYPOINT used yarn-classic flags (--cache-folder/--global-folder) the repo's yarn berry (4.12.0) rejected —yarn starterrored withUnknown Syntax Error: Command not found. Swap to the yarn-berry equivalentsYARN_CACHE_FOLDER/YARN_GLOBAL_FOLDERas ENV. Pluschown -R .nextbecauseinstall -dwas a no-op on the pre-existing directory (created bynext buildas root) and.next/tracewas unwritable by the runtime user. Plusrm -rf .next/cacheafterturbo build(~2 GB of webpack cache that doesn't belong in the published image).hash-ai-worker-ts: drop theARG GOOGLE_CLOUD_WORKLOAD_IDENTITY_FEDERATION_CONFIG_JSON/ENV/if [ -n … ]block. The pattern would bake the JSON into image history and ENV the moment any caller passed it as a build-arg; the in-RUNexport GOOGLE_APPLICATION_CREDENTIALS=was shell-scoped and never reached runtime anyway. GCP creds are now a runtime concern (mount file + setGOOGLE_APPLICATION_CREDENTIALS).hash-integration-worker: drop strayENV MISE_VERBOSE=1.Cleanup
test.ymlno longer needs its docker-build matrix — the mergeddeploy.ymlvalidates the same images on every PR. Removes thebuildjob, thedockerssetup output, and the related affected-package query..github/actions/build-docker-images/action — it was the structural counterpart to the ai-worker-ts secret footgun above.Pre-Merge Checklist 🚀
🚢 Has this modified a publishable library?
📜 Does this require a change to the docs?
🕸️ Does this require a change to the Turbo Graph?
hash-frontendimage still bakesNEXT_PUBLIC_*at build time (FE-752). Today the image hard-codeshttp://localhost:5001/http://localhost:3000, which is fine for the staging/dev origin and the GHCR mirror's primary audience (self-hosters running it via compose). Proper origin portability is FE-752.🐾 Next steps
sourcemapspattern in the same workflow. The currentCompute targetsstep is structured so a future prep-job can feed it a pre-filtered matrix list.🛡 What tests cover this?
No new tests — the workflow itself is the test surface. PR-mode runs all 12 (service, arch) build jobs without pushing; the
Deployments passedcollector aggregates the results.❓ How to test this?
This PR (build-only validation):
Deployments passedcollector should be green;manifest,deploy, and the standalone deploy-notify job skipped.Ad-hoc validation from any branch (feature-branch dispatch):
ECR push is short-circuited on non-main (matches the OIDC trust policy). GHCR push and manifest creation run end-to-end. Verify:
docker pull ghcr.io/hashintel/hash/frontend:latest docker manifest inspect ghcr.io/hashintel/hash/frontend:latest # → arm64 + amd64 entries in the manifest listOr via compose (after #8758 merges):
docker compose --profile hash up -d --pull alwaysAfter merge to
main::staging/:latest/:sha-<sha>tags. ECS rolling-redeploys the eight backend services. Slack-notify fires on deploy failure.