feat(phase-24b): ACI hybrid burst overflow (dormant)#5
Merged
Conversation
When concurrent runner sessions exceed the local Docker cap of 14, the 15th+ session routes to Azure Container Instances instead of returning 503. Local stays for the first 14 (sub-second spawn); ACI absorbs overflow at 5–15s cold start. Default-off; admin flips master toggle via Settings → Admin → Project caps post-deploy. Backend: AciExecutionBackend (full ExecutionBackend impl + sidecar HTTP agent for exec/writeFiles/etc), HybridBackend wrapper that dispatches on handle.__kind, dynamic effectiveCap that collapses to local on kill-switch trip, per-session event-based cost tracker with UTC midnight rollover, warm pool (default-disabled, hysteresis at local=12/10), 4 admin-editable system_config keys for runtime tuning. Sidecar agent: 305-LOC ESM in the runner image, bearer-token auth piped via stdin (token never enters /proc/<pid>/environ), bash -c ulimit wrap for PidsLimit + nofile parity with LocalDocker, full symlink defense + 1MB output cap + SIGKILL timeout. Infra: dedicated ACI subnet (10.20.2.0/24, delegated, NSG-locked egress-deny), custom "Codetutor ACI Executor" role (no Contributor fanout), cost-cap scheduled-query alert. 81 new Phase 24B tests (792 backend total, 298 frontend, 13 Bicep modules build clean). All security-suite claims (C1-C13) preserved on ACI via container spec or agent-level enforcement; C2 (network=none) via subnet NSG egress-deny. Operator activation runbook in ops/PHASE-24B-RUNBOOK.md (gitignored). Code is safe to merge ahead of Azure-side IAM + KV seeding — factory falls back to local-only when config is incomplete (logs aci_overflow_off, reason:missing_config; behavior identical to today).
Staff-level Security + SRE audit pass on the ACI hybrid burst overflow. Closes every P0/P1 in code; P2/P3 either landed or deferred with rationale (post-launch metrics, runbook-only, or e2e infra needed). P0 — block activation - P0-1 persist aciCostTracker to a singleton Postgres row (boot hydrate + write-through with coalescing). Hydrated active sessions are rolled into completedTodayUsd at init since they're orphans by definition post-restart. - P0-2 atomic tryReserve/cancelReservation closes the concurrent-spawn cap-bypass race. Admin upper-bound on aci_max_overflow lowered to 50. - P0-3 tryDelete now awaits ARM via beginDeleteAndWait (30s budget); boot-time orphan reaper deletes codetutor-ai-aci-* groups, with a race-protection check against the live active map + warm pool. - P0-4 awaitFirstRefresh blocks boot until the operational-config mirror has DB-confirmed values; stale-refresh watchdog forces enabled=false. aciConfigRefreshAgeMs surfaces on deep-health. .env.production default flipped to ENABLE_ACI_OVERFLOW=0. - P0-5 NSG-drift Activity Log alert (Bicep). Deleted the redundant AllowReplyToVmSubnet egress rule + 4 static NSG-invariant tests as the CI counterpart of the requested S1f runtime probe. P1 — must fix before sustained activation - P1-1 BEFORE trigger guard_system_config_writes + dedicated app_system_config_writer role. Admin route opts in via SET LOCAL inside a transaction. - P1-2 bearer-token compare uses SHA-256 hash on both sides — closes the constant-time length-leak side channel. - P1-3 agent safeResolveAbs aligned with backend's safeResolve: NUL byte reject, backslash normalization, segment-leading-`-` reject. - P1-4 post-mkdir realpath assertion closes the parent-dir TOCTOU window. WORKSPACE canonicalized at boot for symmetric comparison. - P1-5 split into RG-scoped Executor + subnet-scoped Joiner roles; networkProfiles drops delete; subnets/join scoped to ACI subnet not RG-wide. - P1-6/P3-6 aciCostSampler reads live operational cap, not env. - P1-7 HybridBackend counter race fix (increment before await); aci_counter_drift gauge surfaces lifecycle-invariant breakage. - P1-8 getAciStatus 30s cache + 3s ARM timeout. - P1-9 aci_spawn_attempts counter + aci_spawn_duration histogram; isAlive enriched with agent /health probe (1s budget). P2/P3 — landed - P2-2 warm-pool watermarks (high/low/max) as system_config keys. - P2-3 aci_op_config_loaded log on first refresh shows db|env source. - P2-5 warm-pool tick exponential backoff on ARM 429. - P2-6 RG monthly budget $80 → $400 to absorb ACI overflow envelope. - P3-1 per-route body caps in agent + JSON reviver against prototype pollution + body.env capped at 64×4KB. - P3-2 alive-cache for ACI handles (30s TTL) reduces ARM call pile-on under kickReap fan-out. - P3-5 warm-pool tick gates on overflow_enabled. P2/P3 — deferred (with rationale) - P2-1 cold-start budget tuning, P2-4 LA daily-ingest cap — need post-launch P99 metrics from the histogram landed in P1-9. - P2-7 runbook math fix — runbook is gitignored. - P3-3 environ regression test, P3-4 webtest body-shape — need Azure-backed e2e or post-deploy verification. New migrations to apply alongside this commit: - 20260430050000_aci_cost_state.sql - 20260430060000_system_config_writer_gate.sql Backend tests 798 → 820 (+22). All green; typecheck + Bicep + agent smoke clean.
Second staff-level Security + SRE audit pass on the post-fix state of the ACI hybrid burst overflow. Closes every P0–P3 finding either with code or with documentation rationale. P0 — block activation - P0-1 absolute /usr/bin/timeout + denylist PATH/IFS/LD_*/DYLD_* in body.env so a learner shim under /workspace can't hijack the spawn-target lookup. C12 (wall-clock kill) parity restored. - P0-2 hydrationState: "hydrated" | "degraded" on AciCostTracker. init() failure flips to "degraded"; exceedsDailyCap + tryReserve fail-closed when not hydrated. Pre-fix, a DB outage at boot left spend at $0 with the gate open, letting overflow spawn unbounded. Surfaced as `aciCostTrackerState` on /api/health/deep. P1 — must fix before sustained activation - P1-1 process.stdin.destroy() after readTokenFromStdin so fd 0 is closed and the heredoc-backed token can't be recovered via /proc/<agent_pid>/fd/0 by same-UID learner code. - P1-2 BEFORE TRUNCATE statement-level trigger on system_config (row triggers don't fire on TRUNCATE) + REVOKE TRUNCATE from PUBLIC and service_role. Migration 20260430070000. - P1-3 watchdog now also zeros dailyUsdCap + maxOverflow when stale, not just enabled. Sampler/factory/admin reads stay aligned during refresh-stale windows. - P1-4 hard 8 s timeout on refreshOnce so a wedged DB query can't pin inFlightRefresh and block subsequent invalidate() calls (operator emergency kill switch). - P1-5 SET LOCAL statement_timeout = 5s inside system_config writes + outer Promise.race 5 s budget so admin route can't hang on pool stall. - P1-6 cross-midnight billing math fix in init(): pre-midnight portion of orphan elapsed time correctly stays in yesterday's closed bucket, post-midnight portion credited to today. - P1-7 dropped the dead static absoluteSessionCap return path in the hybrid-wired case; the hybrid backend's effectiveCap() is the source of truth post-admin-tune. P2 — should fix in follow-up - P2-1 agent-side /tmp size cap (64 MB) precheck on every /exec — C10 parity backfill since ACI emptyDir has no per-mount byte cap. - P2-2 admin route rejects watermark writes that would invert low > high; the warm-pool tick would otherwise flap every cycle. - P2-3 isAlive's /health probe no longer sends Authorization header to the unauth endpoint (forward-looking risk if agent ever logs request headers). - P2-4 periodic prune of aliveCache (5-min sweep, 60 s TTL × 2 cutoff) + clear-on-reap so the map doesn't grow monotonically across long-lived processes. - P2-5 reapOrphans delayed 30 s post-boot + adaptive concurrency (1-wide if active sessions, 5-wide if idle) so the reaper's ARM delete pressure doesn't overlap initial traffic. - P2-6 listen-first bootstrap order: app.listen() binds before the init awaits; /api/health returns 503 with `ready: false` until bootReady flips. Webtest probes get a graceful answer instead of connection-refused during the 1–5 s init window. P3 — backlog - P3-1 metrics endpoint comment updated to reflect the existing routes/metrics.ts gateMetrics (token or loopback). Label-based fingerprint is bounded by the existing access gate. - P3-2 sanitizeForArm: lowercase + underscore→hyphen on session ids used in container-group names. Azure ACI requires [a-z0-9-]+; pre-fix nanoid uppercase IDs would have been rejected by ARM as InvalidContainerGroupName. - P3-3 + P3-5 new aciHealthSampler emits aci_counter_drift and aci_op_config_watchdog_engaged log lines per minute when active. - P3-4 structured `evt:aci_spawn_failure` log on every cold-spawn fail (alongside the existing prom counter). - New Bicep alerts: aci_counter_drift, aci_spawn_failure_rate, aci_op_config_watchdog (all sev-2, scheduled-query rules). New migration to deploy alongside this commit: - 20260430070000_system_config_truncate_gate.sql — TRUNCATE gate + REVOKE on system_config. Backend tests 820 → 826 (+6 across hydration states, watchdog zeroing, refresh hard-timeout, cross-midnight billing). All green; typecheck + Bicep + agent smoke clean.
Third-pass zero-trust audit produced 19 findings; landing the two
that are quick code wins. The other 17 (TLS for agent traffic, dual
DB pool, HMAC-with-nonce on /exec, etc.) are architectural and
deferred to a product-wide zero-trust audit later.
Z-P1-1 — Eliminate dash heredoc temp-file path in runner entrypoint.
- runner-image/agent/runner-entrypoint.sh: shebang #!/bin/sh →
#!/bin/bash. dash backs heredocs with a real /tmp temp file
(mkstemp), creating a brief filesystem-accessible path during the
unset → exec window. bash 5+ uses anonymous pipes for heredocs (no
filesystem path). The /proc/<sh_pid>/fd/<n> window still exists for
microseconds during exec setup but is unreachable in our deployment
(no concurrent same-UID processes at entrypoint stage).
- runner-image/Dockerfile already installs bash; runner user shell is
/bin/bash, so the runtime dependency is already there.
Z-P2-1 — Drop bearer-bearing /fileExists auth probe from waitForAgent.
- aci.ts:874-913: previously polled unauth /health, then once a 200
landed, fired an authed GET to /fileExists?path=__health_probe to
prove the bearer was wired correctly. The plaintext-HTTP bearer
shipped at the moment the container's network namespace is most
likely to be in a transitional state (DNS race during cold-start,
potentially visible to ARM diagnostic infrastructure).
- Drop the auth probe entirely: spawnContainerGroup returns once the
agent's unauth /health is 200. Bearer-mismatch detection moves from
cold-start to the first real /exec, which surfaces as a sidecarFetch
401 + spawn-failure log line. Acceptable trade — token-mismatch is a
regression-class issue caught in CI before deploy, not an expected
runtime condition.
Test fix (unrelated to zero-trust): the existing exceedsDailyCap test
called the gate with no explicit `now`, so it used Date.now(). Once
the wall-clock UTC date crossed past the forced currentDateUtc
("2026-04-30"), maybeRollover zeroed completedTodayUsd and the test
flapped. Pass an explicit `now` so the assertion is date-stable.
All 826 backend tests + 298 frontend tests + agent smoke + bicep
clean.
Phase 24B's bicep + backend code shipped in the merge, but refresh-env.sh didn't know about the new env vars. Without these fetch_optional lines, KV-seeded values for ENABLE_ACI_OVERFLOW, AZURE_*, ACI_* would never propagate into /opt/codetutor/.env, so the operator's eventual KV-seeding step would silently no-op. All keys are fetch_optional: missing → empty string → backend's factory short-circuits to LocalDocker. Seeding the secrets early (with ENABLE-ACI-OVERFLOW=0) is now safe; activation flips that single secret to 1 alongside the IAM bootstrap step.
|
Azure Static Web Apps: Your stage site is ready! Visit it here: https://gentle-flower-093ba7e0f-5.eastus2.7.azurestaticapps.net |
Azure scheduledQueryRules reject windowSize > 2880 minutes (48 hours). The previous `P3D` value caused deploy-infra.yml apply to fail with InvalidRequestContent. Lower to `P2D` and adjust the evaluationFrequency comment to match — PT12H × P2D = 4 fires max, threshold=3 still represents "persistent catch-up" semantically. Pre-existing bug surfaced when Phase 24B's deploy-infra workflow ran for the first time in a while; this alert was likely never applied because of the same error in earlier deploys.
|
Azure Static Web Apps: Your stage site is ready! Visit it here: https://gentle-flower-093ba7e0f-5.eastus2.7.azurestaticapps.net |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
.env.production.exampledefaultsENABLE_ACI_OVERFLOW=0. Factory falls back to plain LocalDocker; HybridBackend never constructed when flag off or Azure config missing.manageRoleAssignments=true(currently MFA-gated for the operator); KV secrets already seeded.Test plan
deploy.ymlbuilds + redeploys VM/api/health/deepreturnsaci: "disabled",aciCostTrackerState: "hydrated"deploy-infra.ymlwithmanageRoleAssignments=true, seedACI-SUBNET-IDin KV, flipENABLE-ACI-OVERFLOW=1, refresh-env, admin-panel toggle