Skip to content

v0.3.0: Fix runtime protocol, add production features and API complet…#2

Merged
ajit-zer07 merged 1 commit intomainfrom
fix-runtime-protocol
Mar 21, 2026
Merged

v0.3.0: Fix runtime protocol, add production features and API complet…#2
ajit-zer07 merged 1 commit intomainfrom
fix-runtime-protocol

Conversation

@ajit-zer07
Copy link
Contributor


PR description:

Summary

  • Fix fundamental StreamSession protocol mismatch (P0): The control plane was using separate
    unary Send + read-only StreamSession RPCs, but the runtime expects a single bidirectional stream.
    Introduced openSession() which opens one bidi stream for session creation, kickoff messages,
    and event consumption.
  • Fix Signal rejection (P0): Runtime requires empty session_id and mode for Signal messages.
    Removed unsupported POST /runs/:id/context endpoint.
  • Add production correctness fixes (P1): Stream health tracking, schema versioning on events,
    timer cleanup, session TTL persistence, runtime capability storage.
  • Add API completeness (P2): Pagination metadata, run deletion/archival, audit log read endpoint,
    circuit breaker admin reset, run cloning.
  • Add observability & notifications (P2-P3): Webhook notifications with HMAC-SHA256 signing and
    retry on run state changes, batch cancel/export, JSONL export format, Prometheus circuit breaker
    metrics.
  • Clean up tech debt (P3): Proto sync script, fix artifact retrieval, improve event normalizer
    entity ID extraction.

Changes

42 files changed, 1249 insertions, 140 deletions

New files (12)

File Purpose
drizzle/0004_session_ttl.sql Add expires_at to runtime_sessions
drizzle/0005_webhooks.sql Create webhooks table
scripts/proto-sync.sh Sync proto files from runtime repo
src/controllers/admin.controller.ts Circuit breaker reset endpoint
src/controllers/audit.controller.ts Audit log read endpoint
src/controllers/webhook.controller.ts Webhook CRUD endpoints
src/dto/clone-run.dto.ts Clone run request DTO
src/dto/list-audit-query.dto.ts Audit list query DTO
src/dto/paginated-response.dto.ts Generic paginated response DTO
src/dto/webhook.dto.ts Webhook creation DTO
src/webhooks/webhook.repository.ts Webhook persistence
src/webhooks/webhook.service.ts Webhook dispatch with HMAC + retry

Key modified files

File Changes
src/contracts/runtime.ts Add RuntimeSessionHandle, openSession(), RuntimeCapabilities
src/runtime/rust-runtime.provider.ts Implement openSession(), add ack error checking
src/runs/run-executor.service.ts Use openSession() flow, fix signal sending, add clone
src/runs/stream-consumer.service.ts Accept session handle, track connected state
src/runs/run-manager.service.ts Add delete/archive, pagination, webhook firing
src/storage/run.repository.ts Add listCount, delete, addTag, includeArchived filter
src/controllers/runs.controller.ts Remove context endpoint, add delete/archive/clone

Test plan

  • npx tsc --noEmit — 0 errors
  • npm run lint — 0 errors (98 pre-existing warnings)
  • npm test — 247 tests passed across 20 suites (up from 244)
  • Integration test with actual Rust runtime (Phase 1 — critical path)
  • Verify SSE stream receives events end-to-end
  • Verify signal delivery with empty session_id/mode
  • Run database migrations (0004, 0005) against staging
  • Test webhook delivery with a real endpoint

…eness

  Phase 1 (P0) — Fix StreamSession integration:
  - Add RuntimeSessionHandle and openSession() for unified bidirectional gRPC streams
  - Rewrite RunExecutor to create session + send kickoff through single stream
  - Update StreamConsumer to accept session handle, fall back to streamSession() on reconnect
  - Update MockRuntimeProvider with openSession() implementation

  Phase 2 (P0) — Fix Signal & ContextUpdate:
  - Send signals with empty session_id/mode per runtime validation rules
  - Remove POST /runs/:id/context (runtime doesn't support ContextUpdate)
  - Add Ack error checking in send() and startSession()

  Phase 3 (P1) — Production correctness:
  - Fix StreamConsumer.isHealthy() to track per-stream connected state
  - Set schemaVersion on control-plane-emitted events
  - Add .unref() to StreamHub cleanup timer
  - Add RuntimeCapabilities storage in provider registry
  - Add expires_at column to runtime_sessions (migration 0004)

  Phase 4 (P2) — API completeness:
  - Add pagination metadata (data/total/limit/offset) to GET /runs and GET /audit
  - Add DELETE /runs/:id (terminal only), POST /runs/:id/archive
  - Add GET /audit with filtering
  - Add POST /admin/circuit-breaker/reset
  - Add POST /runs/:id/clone

  Phase 5 (P2-P3) — Observability & notifications:
  - Add webhook system with HMAC-SHA256 signing, retry, and management endpoints
  - Fire webhooks on run.started/completed/failed/cancelled
  - Add batch cancel/export endpoints
  - Add JSONL export format (?format=jsonl)
  - Add Prometheus circuit breaker metrics (failures_total, success_total)

  Phase 6 (P3) — Tech debt:
  - Add npm run proto:sync script
  - Fix InlineArtifactStorageProvider.retrieve() to read from DB
  - Improve normalizer deriveSubject() to extract entity IDs from decoded payloads
  - Update normalizer schemaVersion to use PROJECTION_SCHEMA_VERSION

  Migrations: 0004_session_ttl.sql, 0005_webhooks.sql
@ajit-zer07 ajit-zer07 merged commit 176b836 into main Mar 21, 2026
5 checks passed
@ajit-zer07 ajit-zer07 deleted the fix-runtime-protocol branch March 21, 2026 00:55
@ajit-zer07 ajit-zer07 restored the fix-runtime-protocol branch March 21, 2026 01:23
@ajit-zer07 ajit-zer07 deleted the fix-runtime-protocol branch March 22, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant