Skip to content

feat(phase-24b): ACI hybrid burst overflow (dormant)#5

Merged
msrivas-7 merged 6 commits intomainfrom
phase-24b/aci-hybrid-overflow
May 1, 2026
Merged

feat(phase-24b): ACI hybrid burst overflow (dormant)#5
msrivas-7 merged 6 commits intomainfrom
phase-24b/aci-hybrid-overflow

Conversation

@msrivas-7
Copy link
Copy Markdown
Owner

@msrivas-7 msrivas-7 commented May 1, 2026

Summary

  • ACI overflow feature: 14 local sessions → spillover to Azure Container Instances. Three audit passes (28 + 20 + 2 zero-trust findings, all addressed).
  • Ships dormant: .env.production.example defaults ENABLE_ACI_OVERFLOW=0. Factory falls back to plain LocalDocker; HybridBackend never constructed when flag off or Azure config missing.
  • Activation requires one-time Owner-level Bicep apply with manageRoleAssignments=true (currently MFA-gated for the operator); KV secrets already seeded.

Test plan

  • CI green (826 backend + 298 frontend + agent smoke + bicep)
  • Apply 3 new Supabase migrations to dev → smoke admin route
  • Apply 3 new Supabase migrations to prod
  • Squash-merge → deploy.yml builds + redeploys VM
  • Verify /api/health/deep returns aci: "disabled", aciCostTrackerState: "hydrated"
  • When IAM unblocks: run deploy-infra.yml with manageRoleAssignments=true, seed ACI-SUBNET-ID in KV, flip ENABLE-ACI-OVERFLOW=1, refresh-env, admin-panel toggle

When concurrent runner sessions exceed the local Docker cap of 14, the
15th+ session routes to Azure Container Instances instead of returning
503. Local stays for the first 14 (sub-second spawn); ACI absorbs
overflow at 5–15s cold start. Default-off; admin flips master toggle
via Settings → Admin → Project caps post-deploy.

Backend: AciExecutionBackend (full ExecutionBackend impl + sidecar HTTP
agent for exec/writeFiles/etc), HybridBackend wrapper that dispatches
on handle.__kind, dynamic effectiveCap that collapses to local on
kill-switch trip, per-session event-based cost tracker with UTC
midnight rollover, warm pool (default-disabled, hysteresis at
local=12/10), 4 admin-editable system_config keys for runtime tuning.

Sidecar agent: 305-LOC ESM in the runner image, bearer-token auth
piped via stdin (token never enters /proc/<pid>/environ), bash -c
ulimit wrap for PidsLimit + nofile parity with LocalDocker, full
symlink defense + 1MB output cap + SIGKILL timeout.

Infra: dedicated ACI subnet (10.20.2.0/24, delegated, NSG-locked
egress-deny), custom "Codetutor ACI Executor" role (no Contributor
fanout), cost-cap scheduled-query alert.

81 new Phase 24B tests (792 backend total, 298 frontend, 13 Bicep
modules build clean). All security-suite claims (C1-C13) preserved
on ACI via container spec or agent-level enforcement; C2 (network=none)
via subnet NSG egress-deny.

Operator activation runbook in ops/PHASE-24B-RUNBOOK.md (gitignored).
Code is safe to merge ahead of Azure-side IAM + KV seeding — factory
falls back to local-only when config is incomplete (logs
aci_overflow_off, reason:missing_config; behavior identical to today).
Staff-level Security + SRE audit pass on the ACI hybrid burst overflow.
Closes every P0/P1 in code; P2/P3 either landed or deferred with
rationale (post-launch metrics, runbook-only, or e2e infra needed).

P0 — block activation
- P0-1 persist aciCostTracker to a singleton Postgres row (boot
  hydrate + write-through with coalescing). Hydrated active sessions
  are rolled into completedTodayUsd at init since they're orphans
  by definition post-restart.
- P0-2 atomic tryReserve/cancelReservation closes the concurrent-spawn
  cap-bypass race. Admin upper-bound on aci_max_overflow lowered to 50.
- P0-3 tryDelete now awaits ARM via beginDeleteAndWait (30s budget);
  boot-time orphan reaper deletes codetutor-ai-aci-* groups, with a
  race-protection check against the live active map + warm pool.
- P0-4 awaitFirstRefresh blocks boot until the operational-config
  mirror has DB-confirmed values; stale-refresh watchdog forces
  enabled=false. aciConfigRefreshAgeMs surfaces on deep-health.
  .env.production default flipped to ENABLE_ACI_OVERFLOW=0.
- P0-5 NSG-drift Activity Log alert (Bicep). Deleted the redundant
  AllowReplyToVmSubnet egress rule + 4 static NSG-invariant tests
  as the CI counterpart of the requested S1f runtime probe.

P1 — must fix before sustained activation
- P1-1 BEFORE trigger guard_system_config_writes + dedicated
  app_system_config_writer role. Admin route opts in via SET LOCAL
  inside a transaction.
- P1-2 bearer-token compare uses SHA-256 hash on both sides — closes
  the constant-time length-leak side channel.
- P1-3 agent safeResolveAbs aligned with backend's safeResolve: NUL
  byte reject, backslash normalization, segment-leading-`-` reject.
- P1-4 post-mkdir realpath assertion closes the parent-dir TOCTOU
  window. WORKSPACE canonicalized at boot for symmetric comparison.
- P1-5 split into RG-scoped Executor + subnet-scoped Joiner roles;
  networkProfiles drops delete; subnets/join scoped to ACI subnet
  not RG-wide.
- P1-6/P3-6 aciCostSampler reads live operational cap, not env.
- P1-7 HybridBackend counter race fix (increment before await);
  aci_counter_drift gauge surfaces lifecycle-invariant breakage.
- P1-8 getAciStatus 30s cache + 3s ARM timeout.
- P1-9 aci_spawn_attempts counter + aci_spawn_duration histogram;
  isAlive enriched with agent /health probe (1s budget).

P2/P3 — landed
- P2-2 warm-pool watermarks (high/low/max) as system_config keys.
- P2-3 aci_op_config_loaded log on first refresh shows db|env source.
- P2-5 warm-pool tick exponential backoff on ARM 429.
- P2-6 RG monthly budget $80 → $400 to absorb ACI overflow envelope.
- P3-1 per-route body caps in agent + JSON reviver against prototype
  pollution + body.env capped at 64×4KB.
- P3-2 alive-cache for ACI handles (30s TTL) reduces ARM call pile-on
  under kickReap fan-out.
- P3-5 warm-pool tick gates on overflow_enabled.

P2/P3 — deferred (with rationale)
- P2-1 cold-start budget tuning, P2-4 LA daily-ingest cap — need
  post-launch P99 metrics from the histogram landed in P1-9.
- P2-7 runbook math fix — runbook is gitignored.
- P3-3 environ regression test, P3-4 webtest body-shape — need
  Azure-backed e2e or post-deploy verification.

New migrations to apply alongside this commit:
- 20260430050000_aci_cost_state.sql
- 20260430060000_system_config_writer_gate.sql

Backend tests 798 → 820 (+22). All green; typecheck + Bicep + agent
smoke clean.
Second staff-level Security + SRE audit pass on the post-fix state of
the ACI hybrid burst overflow. Closes every P0–P3 finding either with
code or with documentation rationale.

P0 — block activation
- P0-1 absolute /usr/bin/timeout + denylist PATH/IFS/LD_*/DYLD_* in
  body.env so a learner shim under /workspace can't hijack the
  spawn-target lookup. C12 (wall-clock kill) parity restored.
- P0-2 hydrationState: "hydrated" | "degraded" on AciCostTracker.
  init() failure flips to "degraded"; exceedsDailyCap + tryReserve
  fail-closed when not hydrated. Pre-fix, a DB outage at boot left
  spend at $0 with the gate open, letting overflow spawn unbounded.
  Surfaced as `aciCostTrackerState` on /api/health/deep.

P1 — must fix before sustained activation
- P1-1 process.stdin.destroy() after readTokenFromStdin so fd 0 is
  closed and the heredoc-backed token can't be recovered via
  /proc/<agent_pid>/fd/0 by same-UID learner code.
- P1-2 BEFORE TRUNCATE statement-level trigger on system_config (row
  triggers don't fire on TRUNCATE) + REVOKE TRUNCATE from PUBLIC and
  service_role. Migration 20260430070000.
- P1-3 watchdog now also zeros dailyUsdCap + maxOverflow when stale,
  not just enabled. Sampler/factory/admin reads stay aligned during
  refresh-stale windows.
- P1-4 hard 8 s timeout on refreshOnce so a wedged DB query can't
  pin inFlightRefresh and block subsequent invalidate() calls
  (operator emergency kill switch).
- P1-5 SET LOCAL statement_timeout = 5s inside system_config writes
  + outer Promise.race 5 s budget so admin route can't hang on
  pool stall.
- P1-6 cross-midnight billing math fix in init(): pre-midnight
  portion of orphan elapsed time correctly stays in yesterday's
  closed bucket, post-midnight portion credited to today.
- P1-7 dropped the dead static absoluteSessionCap return path in the
  hybrid-wired case; the hybrid backend's effectiveCap() is the
  source of truth post-admin-tune.

P2 — should fix in follow-up
- P2-1 agent-side /tmp size cap (64 MB) precheck on every /exec —
  C10 parity backfill since ACI emptyDir has no per-mount byte cap.
- P2-2 admin route rejects watermark writes that would invert
  low > high; the warm-pool tick would otherwise flap every cycle.
- P2-3 isAlive's /health probe no longer sends Authorization header
  to the unauth endpoint (forward-looking risk if agent ever logs
  request headers).
- P2-4 periodic prune of aliveCache (5-min sweep, 60 s TTL × 2
  cutoff) + clear-on-reap so the map doesn't grow monotonically
  across long-lived processes.
- P2-5 reapOrphans delayed 30 s post-boot + adaptive concurrency
  (1-wide if active sessions, 5-wide if idle) so the reaper's ARM
  delete pressure doesn't overlap initial traffic.
- P2-6 listen-first bootstrap order: app.listen() binds before the
  init awaits; /api/health returns 503 with `ready: false` until
  bootReady flips. Webtest probes get a graceful answer instead of
  connection-refused during the 1–5 s init window.

P3 — backlog
- P3-1 metrics endpoint comment updated to reflect the existing
  routes/metrics.ts gateMetrics (token or loopback). Label-based
  fingerprint is bounded by the existing access gate.
- P3-2 sanitizeForArm: lowercase + underscore→hyphen on session ids
  used in container-group names. Azure ACI requires
  [a-z0-9-]+; pre-fix nanoid uppercase IDs would have been rejected
  by ARM as InvalidContainerGroupName.
- P3-3 + P3-5 new aciHealthSampler emits aci_counter_drift and
  aci_op_config_watchdog_engaged log lines per minute when active.
- P3-4 structured `evt:aci_spawn_failure` log on every cold-spawn
  fail (alongside the existing prom counter).
- New Bicep alerts: aci_counter_drift, aci_spawn_failure_rate,
  aci_op_config_watchdog (all sev-2, scheduled-query rules).

New migration to deploy alongside this commit:
- 20260430070000_system_config_truncate_gate.sql — TRUNCATE gate +
  REVOKE on system_config.

Backend tests 820 → 826 (+6 across hydration states, watchdog
zeroing, refresh hard-timeout, cross-midnight billing). All green;
typecheck + Bicep + agent smoke clean.
Third-pass zero-trust audit produced 19 findings; landing the two
that are quick code wins. The other 17 (TLS for agent traffic, dual
DB pool, HMAC-with-nonce on /exec, etc.) are architectural and
deferred to a product-wide zero-trust audit later.

Z-P1-1 — Eliminate dash heredoc temp-file path in runner entrypoint.
- runner-image/agent/runner-entrypoint.sh: shebang #!/bin/sh →
  #!/bin/bash. dash backs heredocs with a real /tmp temp file
  (mkstemp), creating a brief filesystem-accessible path during the
  unset → exec window. bash 5+ uses anonymous pipes for heredocs (no
  filesystem path). The /proc/<sh_pid>/fd/<n> window still exists for
  microseconds during exec setup but is unreachable in our deployment
  (no concurrent same-UID processes at entrypoint stage).
- runner-image/Dockerfile already installs bash; runner user shell is
  /bin/bash, so the runtime dependency is already there.

Z-P2-1 — Drop bearer-bearing /fileExists auth probe from waitForAgent.
- aci.ts:874-913: previously polled unauth /health, then once a 200
  landed, fired an authed GET to /fileExists?path=__health_probe to
  prove the bearer was wired correctly. The plaintext-HTTP bearer
  shipped at the moment the container's network namespace is most
  likely to be in a transitional state (DNS race during cold-start,
  potentially visible to ARM diagnostic infrastructure).
- Drop the auth probe entirely: spawnContainerGroup returns once the
  agent's unauth /health is 200. Bearer-mismatch detection moves from
  cold-start to the first real /exec, which surfaces as a sidecarFetch
  401 + spawn-failure log line. Acceptable trade — token-mismatch is a
  regression-class issue caught in CI before deploy, not an expected
  runtime condition.

Test fix (unrelated to zero-trust): the existing exceedsDailyCap test
called the gate with no explicit `now`, so it used Date.now(). Once
the wall-clock UTC date crossed past the forced currentDateUtc
("2026-04-30"), maybeRollover zeroed completedTodayUsd and the test
flapped. Pass an explicit `now` so the assertion is date-stable.

All 826 backend tests + 298 frontend tests + agent smoke + bicep
clean.
Phase 24B's bicep + backend code shipped in the merge, but
refresh-env.sh didn't know about the new env vars. Without these
fetch_optional lines, KV-seeded values for ENABLE_ACI_OVERFLOW,
AZURE_*, ACI_* would never propagate into /opt/codetutor/.env, so
the operator's eventual KV-seeding step would silently no-op.

All keys are fetch_optional: missing → empty string → backend's
factory short-circuits to LocalDocker. Seeding the secrets early
(with ENABLE-ACI-OVERFLOW=0) is now safe; activation flips that
single secret to 1 alongside the IAM bootstrap step.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Azure Static Web Apps: Your stage site is ready! Visit it here: https://gentle-flower-093ba7e0f-5.eastus2.7.azurestaticapps.net

Azure scheduledQueryRules reject windowSize > 2880 minutes
(48 hours). The previous `P3D` value caused
deploy-infra.yml apply to fail with InvalidRequestContent. Lower
to `P2D` and adjust the evaluationFrequency comment to match —
PT12H × P2D = 4 fires max, threshold=3 still represents
"persistent catch-up" semantically.

Pre-existing bug surfaced when Phase 24B's deploy-infra workflow
ran for the first time in a while; this alert was likely never
applied because of the same error in earlier deploys.
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 1, 2026

Azure Static Web Apps: Your stage site is ready! Visit it here: https://gentle-flower-093ba7e0f-5.eastus2.7.azurestaticapps.net

@msrivas-7 msrivas-7 merged commit dbe7023 into main May 1, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant