docs(adr): ECS Control Plane — CRD+Operator pattern#850
Conversation
Proposes a Kubernetes-style reconciliation loop targeting ECS: - Declarative YAML manifests in S3 as desired state - Controller reconciles against ECS API - DynamoDB for status tracking and leader election - Phased MVP approach starting with poll-based single instance
4e3137f to
83c3d31
Compare
|
All PRs must reference a prior Discord discussion to ensure community alignment before implementation. Please edit the PR description to include a link like: This PR will be automatically closed in 3 days if the link is not added. |
OpenAB PR ScreeningThis is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Screening reportdone. posted screening comment and moved the project item to `PR-Screening`.GitHub comment: #850 (comment) IntentThis PR proposes an ADR for an ECS-native control plane that gives teams a Kubernetes-style declarative deployment model without requiring EKS. The operator-visible problem is that ECS deployments currently lack a first-class CRD/operator reconciliation pattern for desired state, drift detection, and self-healing behavior. FeatDocs / architecture work. It adds Who It ServesDeployers and agent runtime operators. It also serves maintainers by creating a concrete architecture target before implementation work begins. Rewritten PromptAdd an ADR for an ECS control plane that models services as declarative manifests and reconciles them into ECS resources. Define manifest shape, storage model, reconciliation loop, observed-state sources, status persistence, leader election, failure handling, rollout phases, and alternatives considered. Clearly mark MVP scope versus later work. Merge PitchThis should move forward as an ADR if the team wants ECS to be a serious deployment substrate for OpenAB workloads. Code risk is low because this is documentation only, but architecture-direction risk is medium. Best-Practice ComparisonOpenClaw and Hermes Agent both apply. The ADR aligns with durable state, reconciliation, leader ownership, and poll-based daemon behavior, but should be sharper on retry/backoff, run logs, atomic state transitions, delivery routing, and restart-safe reconcile work units. Implementation Options
Comparison Table
RecommendationTake the balanced path. Advance the ADR, but ask for a focused revision adding reconcile boundaries, idempotency rules, failure/status states, retry/backoff expectations, and minimal audit/run-log requirements. |
- Remove DynamoDB dependency from Phase 1 (S3 strong consistency is sufficient) - Add oabctl CLI section (apply, get, delete, diff, logs, wait) - Clarify S3 bucket layout: manifests/ + status/ prefixes - Restructure phases: Phase 1 pure S3, Phase 2 adds DDB for HA
…startup OAuth tokens, config, steering, and memory all live in the bootstrap archive. No need for controller to provision secrets via external APIs.
… HOME Controller only manages infra (container, compute, networking). Backend/model/channel config is inside the bootstrap archive's config.toml.
No abstraction layer — values map 1:1 to ECS task definition. Users already know Fargate cpu/memory combos.
Each agent owns its secrets (1:1). Controller wires SSM/Secrets Manager references into ECS TaskDefinition native secrets field. IAM scoped per-agent path. No token sharing between agents.
…rates config.toml Controller understands config schema (channels, backend, steering, features). spec.config takes precedence over bootstrap archive's config.toml.
Addresses feedback from 擺渡/普渡/口渡: - Fixed API identity: oab.dev/v1 OABService (no more inconsistency) - Clear separation: bootstrapFrom=state, spec.config=config, secrets=SSM - Tombstone deletion pattern (not raw S3 delete) - Entrypoint wrapper (not init container — ECS has no such thing) - Controller renders config to S3 artifact (not volume mount) - Explicit generation/observedGeneration (not just S3 VersionId) - replicas:1 enforced for bot-type agents - Schema versioning & validation section - Tightened Phase 1 scope with explicit out-of-scope list
385fd65 to
7b0dcd8
Compare
Full structured config in spec.config — controller renders to TOML. Enables schema validation at apply time, clean diffs, and controller decisions based on config (e.g. adapter ports).
Major changes: - Unified API identity: oab.dev/v1 / OABService throughout - bootstrapFrom = mutable state only (memory, KB); secrets never in archive - Config delivery: controller renders config.toml to S3 artifact, startup wrapper downloads - Replicas: reject >1 for WebSocket adapters (validation error) - Secret rotation lifecycle with failure handling - Controller upgrade strategy (Phase 1: brief gap; Phase 2: leader election) - Explicit metadata.generation / status.observedGeneration (not S3 VersionId) - Narrowed Phase 1 scope per review feedback - Removed answered items from Open Questions
Scale horizontally by deploying more agents (each with own token), not by replicating one agent.
Enterprise fleet provisioning: one YAML → N agents auto-provisioned with Discord Bot registration, token storage, and ECS service creation. User only needs to paste OAuth URL to add bots to server.
2a7eba5 to
5a8dcda
Compare
Same YAML deploys to both ECS and K8s. Core spec is platform-agnostic; platform-specific config lives in optional platform: overlay. Each controller reads only its own key, ignores the other.
ec36b0a to
e9b25e1
Compare
|
<@845835116920307722> <@1490365068863606784> Review findings for PR #850:
The previous API identity, bootstrap/config split, and generation fixes look much better. I would keep this ADR focused on the ECS Phase 1 contract and move |
e9b25e1 to
ef7beb2
Compare
- Add executionRole/taskRole to platform.ecs overlay - Clarify Discord auto-register requires user OAuth2 bearer token - Mark OABFleet + autoRegister as Phase 2 - Add IAM requirements for startup wrapper (s3:GetObject)
a7db20c to
2754de2
Compare
…lidation
- Config artifacts now immutable per generation: artifacts/{ns}/{name}/{gen}/config.toml
- TaskDefinition pins CONFIG_ARTIFACT_PATH to exact generation (safe rolling updates)
- Platform overlay: strict-validate own keys, ignore other platform keys
2754de2 to
a5e5f7b
Compare
- Fix startup order: bootstrap first, then config.toml overlay (prevents stale config) - Discord auto-register: marked as future research (API doesn't exist publicly) - Fleet uses pre-created bot credentials from SSM - Controller observes ECS status and writes back (enables oabctl get/wait) - oabctl snapshot must exclude config.toml
* docs(adr): add ECS Control Plane (CRD+Operator pattern)
Proposes a Kubernetes-style reconciliation loop targeting ECS:
- Declarative YAML manifests in S3 as desired state
- Controller reconciles against ECS API
- DynamoDB for status tracking and leader election
- Phased MVP approach starting with poll-based single instance
* docs(adr): simplify to S3-only state store, add oabctl CLI UX
- Remove DynamoDB dependency from Phase 1 (S3 strong consistency is sufficient)
- Add oabctl CLI section (apply, get, delete, diff, logs, wait)
- Clarify S3 bucket layout: manifests/ + status/ prefixes
- Restructure phases: Phase 1 pure S3, Phase 2 adds DDB for HA
* docs(adr): add capacityProvider (FARGATE/FARGATE_SPOT) and instance size selection
* docs(adr): add per-agent secrets via SSM Parameter Store reference
* docs(adr): controller provisions Discord bot token via API, stores in Secrets Manager
* docs(adr): add bootstrapFrom — restore agent HOME from S3 archive on startup
OAuth tokens, config, steering, and memory all live in the bootstrap
archive. No need for controller to provision secrets via external APIs.
* docs(adr): remove backend field — agent config lives in bootstrapFrom HOME
Controller only manages infra (container, compute, networking).
Backend/model/channel config is inside the bootstrap archive's config.toml.
* docs(adr): use direct cpu/memory instead of named sizes
No abstraction layer — values map 1:1 to ECS task definition.
Users already know Fargate cpu/memory combos.
* docs(adr): add multi-agent fleet example (5 Kiro + 3 CC + 2 Codex)
* docs(adr): add per-agent secret injection section
Each agent owns its secrets (1:1). Controller wires SSM/Secrets Manager
references into ECS TaskDefinition native secrets field. IAM scoped
per-agent path. No token sharing between agents.
* docs(adr): add structured spec.config — controller validates and generates config.toml
Controller understands config schema (channels, backend, steering, features).
spec.config takes precedence over bootstrap archive's config.toml.
* docs(adr): rewrite ECS Control Plane ADR incorporating team review
Addresses feedback from 擺渡/普渡/口渡:
- Fixed API identity: oab.dev/v1 OABService (no more inconsistency)
- Clear separation: bootstrapFrom=state, spec.config=config, secrets=SSM
- Tombstone deletion pattern (not raw S3 delete)
- Entrypoint wrapper (not init container — ECS has no such thing)
- Controller renders config to S3 artifact (not volume mount)
- Explicit generation/observedGeneration (not just S3 VersionId)
- replicas:1 enforced for bot-type agents
- Schema versioning & validation section
- Tightened Phase 1 scope with explicit out-of-scope list
* docs(adr): no LB for agents — they are outbound-only connections
* docs(adr): model config.toml as structured YAML in spec
Full structured config in spec.config — controller renders to TOML.
Enables schema validation at apply time, clean diffs, and controller
decisions based on config (e.g. adapter ports).
* docs(adr): incorporate all review feedback from 法師團隊
Major changes:
- Unified API identity: oab.dev/v1 / OABService throughout
- bootstrapFrom = mutable state only (memory, KB); secrets never in archive
- Config delivery: controller renders config.toml to S3 artifact, startup wrapper downloads
- Replicas: reject >1 for WebSocket adapters (validation error)
- Secret rotation lifecycle with failure handling
- Controller upgrade strategy (Phase 1: brief gap; Phase 2: leader election)
- Explicit metadata.generation / status.observedGeneration (not S3 VersionId)
- Narrowed Phase 1 scope per review feedback
- Removed answered items from Open Questions
* docs(adr): agents are always single-instance, no LB
Scale horizontally by deploying more agents (each with own token),
not by replicating one agent.
* docs(adr): add OABFleet kind and Discord auto-registration flow
Enterprise fleet provisioning: one YAML → N agents auto-provisioned
with Discord Bot registration, token storage, and ECS service creation.
User only needs to paste OAuth URL to add bots to server.
* docs(adr): add Multi-Runtime Support section (ECS + K8s)
Same YAML deploys to both ECS and K8s. Core spec is platform-agnostic;
platform-specific config lives in optional platform: overlay. Each
controller reads only its own key, ignores the other.
* docs(adr): address 普渡 review feedback
- Add executionRole/taskRole to platform.ecs overlay
- Clarify Discord auto-register requires user OAuth2 bearer token
- Mark OABFleet + autoRegister as Phase 2
- Add IAM requirements for startup wrapper (s3:GetObject)
* docs(adr): address 口渡 review - immutable config artifacts + strict validation
- Config artifacts now immutable per generation: artifacts/{ns}/{name}/{gen}/config.toml
- TaskDefinition pins CONFIG_ARTIFACT_PATH to exact generation (safe rolling updates)
- Platform overlay: strict-validate own keys, ignore other platform keys
* docs(adr): address 擺渡 review feedback
- Fix startup order: bootstrap first, then config.toml overlay (prevents stale config)
- Discord auto-register: marked as future research (API doesn't exist publicly)
- Fleet uses pre-created bot credentials from SSM
- Controller observes ECS status and writes back (enables oabctl get/wait)
- oabctl snapshot must exclude config.toml
---------
Co-authored-by: Kiro <kiro@openab.dev>
Co-authored-by: 超渡法師 <chaodu@openab.dev>
Summary
Proposes an Architecture Decision Record for running a Kubernetes-style CRD + Operator reconciliation loop on Amazon ECS.
What's in this ADR
Motivation
Give ECS-native teams the same declarative, self-healing deployment experience that K8s operators provide — without requiring a Kubernetes cluster.
Open Questions
cc @pahud