SRE-739: Fix backend Deploy concurrency β dropped builds and out-of-order mutable tags#8801
Conversation
PR SummaryHigh Risk Overview Concurrency is now one group per run (
New Reviewed by Cursor Bugbot for commit 4c62534. Bugbot is set up for automated code reviews on this repo. Configure here. |
π€ Augment PR SummarySummary: This PR hardens the backend Deploy workflow against concurrency-related dropped builds and mutable-tag races. Changes:
Technical Notes: Images are labeled with π€ Was this summary useful? React with π or π |
Codecov Reportβ
All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8801 +/- ##
=======================================
Coverage 59.07% 59.07%
=======================================
Files 1344 1344
Lines 129750 129750
Branches 5868 5868
=======================================
Hits 76653 76653
Misses 52193 52193
Partials 904 904 Flags with carried forward coverage won't be shown. Click here to find out more. β View full report in Codecov by Sentry. π New features to boost your workflow:
|
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
β Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 16728da. Configure here.
β¦sient instead of advancing
There was a problem hiding this comment.
Pull request overview
Reworks the backend Deploy GitHub Actions workflow to prevent dropped publishing runs and races around mutable Docker tags by switching to per-run concurrency at the workflow level and per-service serialization at the job level, assembling GHCR multi-arch manifests from pushed digests, and adding a βmove mutable tag only if newerβ guard.
Changes:
- Update deploy workflow concurrency and job structure (new
stagejob for ECR:staging, per-service serialization for mutable tags and ECS redeploys). - Push GHCR images by digest and assemble
:sha-<sha>manifests from collected digests; advance:latestvia a guard action. - Move build cache to GHCR registry cache refs and introduce a new
tag-if-newercomposite action.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
.github/workflows/deploy.yml |
Changes concurrency model; adds digest-based manifest assembly, a staging tag job, and guarded mutable-tag advancement. |
.github/actions/tag-if-newer/action.yml |
New composite action to advance mutable tags only when the candidate commit is a strict descendant of the currently-tagged commit. |
.github/actions/docker-build-push/action.yml |
Updates build/push behavior for GHCR digest pushes and GHCR registry-backed caching; adds digest output for downstream manifest assembly. |
π‘ Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
β¦he private cache)
Benchmark results
|
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2002 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 1001 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 3314 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 1526 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 2078 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 1033 | Flame Graph |
policy_resolution_medium
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 102 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 51 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 269 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 107 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 133 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 63 | Flame Graph |
policy_resolution_none
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 2 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 8 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 3 | Flame Graph |
policy_resolution_small
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| resolve_policies_for_actor | user: empty, selectivity: high, policies: 52 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: empty, selectivity: medium, policies: 25 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: high, policies: 94 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: seeded, selectivity: medium, policies: 26 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: high, policies: 66 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: low, policies: 1 | Flame Graph | |
| resolve_policies_for_actor | user: system, selectivity: medium, policies: 29 | Flame Graph |
read_scaling_complete
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id;one_depth | 1 entities | Flame Graph | |
| entity_by_id;one_depth | 10 entities | Flame Graph | |
| entity_by_id;one_depth | 25 entities | Flame Graph | |
| entity_by_id;one_depth | 5 entities | Flame Graph | |
| entity_by_id;one_depth | 50 entities | Flame Graph | |
| entity_by_id;two_depth | 1 entities | Flame Graph | |
| entity_by_id;two_depth | 10 entities | Flame Graph | |
| entity_by_id;two_depth | 25 entities | Flame Graph | |
| entity_by_id;two_depth | 5 entities | Flame Graph | |
| entity_by_id;two_depth | 50 entities | Flame Graph | |
| entity_by_id;zero_depth | 1 entities | Flame Graph | |
| entity_by_id;zero_depth | 10 entities | Flame Graph | |
| entity_by_id;zero_depth | 25 entities | Flame Graph | |
| entity_by_id;zero_depth | 5 entities | Flame Graph | |
| entity_by_id;zero_depth | 50 entities | Flame Graph |
read_scaling_linkless
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | 1 entities | Flame Graph | |
| entity_by_id | 10 entities | Flame Graph | |
| entity_by_id | 100 entities | Flame Graph | |
| entity_by_id | 1000 entities | Flame Graph | |
| entity_by_id | 10000 entities | Flame Graph |
representative_read_entity
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/block/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/book/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/building/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/organization/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/page/v/2
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/person/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/playlist/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/song/v/1
|
Flame Graph | |
| entity_by_id | entity type ID: https://blockprotocol.org/@alice/types/entity-type/uk-address/v/1
|
Flame Graph |
representative_read_entity_type
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| get_entity_type_by_id | Account ID: bf5a9ef5-dc3b-43cf-a291-6210c0321eba
|
Flame Graph |
representative_read_multiple_entities
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| entity_by_property | traversal_paths=0 | 0 | |
| entity_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| entity_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=0 | 0 | |
| link_by_source_by_property | traversal_paths=255 | 1,resolve_depths=inherit:1;values:255;properties:255;links:127;link_dests:126;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:0;link_dests:0;type:false | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:0;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:0;properties:2;links:1;link_dests:0;type:true | |
| link_by_source_by_property | traversal_paths=2 | 1,resolve_depths=inherit:0;values:2;properties:2;links:1;link_dests:0;type:true |
scenarios
| Function | Value | Mean | Flame graphs |
|---|---|---|---|
| full_test | query-limited | Flame Graph | |
| full_test | query-unlimited | Flame Graph | |
| linked_queries | query-limited | Flame Graph | |
| linked_queries | query-unlimited | Flame Graph |

π What is the purpose of this PR?
The backend
Deployworkflow was intermittently cancelling Docker builds onmain(e.g.Build kratos (arm64)β "The operation was canceled"). Root cause: the sharedβ¦-publishconcurrency group dropped pending main-push runs (GitHub keeps only one pending run per group). Combined with the per-committurbo affectedchange detection and the promote walk-back, a dropped run for a service-changing commit could letpromoteship a stale image to production with no error.This reworks the workflow's concurrency, manifest assembly, mutable-tag handling, and build cache so that no run is dropped and mutable tags only ever move forward.
π Related links
π« Blocked by
π What does this change?
github.run_id(unique per run β never evicted); serialization moved to job-levelqueue: maxper resource. Builds run unserialized (their pushes are immutable), so the slow Rust builds no longer block each other. Only the mutable tags/rollouts serialize β:stagingin a dedicatedstagejob,:latestinmanifest, ECS indeploy(which runsneeds: stage, so the redeploy pulls the freshly-advanced:stagingrather than the previous image).:latest-<arch>tags. Removes the "Frankenstein manifest" race (arm64 from one commit + amd64 from another) and drops the per-arch GHCR tags.tag-if-neweraction)::staging/:latestadvance onto a commit only if it is a strict git-descendant of the currently-tagged commit (resolved via theorg.opencontainers.image.revisionlabel). Closes the "queue: maxFIFO not guaranteed" gap and protects a manual:stagingrollback against older/parallel runs.type=gha,mode=maxwas 504-ing on the large Rust builds) to a dedicated privateβ¦/<service>/cache:<arch>package. Written onpush/workflow_dispatch(trusted events) only; PR/merge_group read it but never write it.nick-fields/retry, repo convention): wraps theimagetools createcalls (manifest assemble + guard) against transient GHCR 5xx.Pre-Merge Checklist π
π’ Has this modified a publishable library?
This PR:
π Does this require a change to the docs?
The changes in this PR:
πΈοΈ Does this require a change to the Turbo Graph?
The changes in this PR:
is_mainpath is not exercised by CI before merge. ECR push (:sha/:run/:staging), the:stagingguard, thestagejob, and the ECSdeployjob all requireis_main, so they are skipped onpull_request(PUSH=false) andworkflow_dispatch(non-main). They run for the first time on the firstmainpush β that run should be watched. The riskiest unverified assumption isimagetools createagainst ECR for:staging(ECR's image-index media-type quirk); the fallback isaws ecr put-image.:latestguard only ran single-run).mainrun: existing:staging/:latestimages have norevisionlabel yet β the guard advances them once unconditionally (by design).πΎ Next steps
:stagingforward, so a manual rollback isn't durable on its own (the next merge overtakes it). Needs a separate pin/freeze mechanism if "freeze staging" is ever required.:stagingtagging intointernal-infraviaworkflow_call(keep the:latestguard in this OSS repo β it can't go closed-source).π‘ What tests cover this?
workflow_dispatchrun on this branch: all 12 builds parallel, GHCR by-digest + registry cache (no 504), 5 multi-arch manifests assembled from digests,:latestguard. A separate throwaway probe confirmed job-levelqueue: maxserializes without dropping runs. Theis_mainpath is covered only by the firstmainrun after merge.β How to test this?
gh workflow run deploy.yml --ref <this branch>β exercises the GHCR path (build-parallel, by-digest, cache, manifest,:latestguard).stage/deployare skipped (non-main).mainrun for: ECR push, thestagejob (:stagingguard), and the ECS redeploy β these cannot be exercised beforehand.πΉ Demo
n/a β CI / infra change.