docs(adr): ECS Control Plane — CRD+Operator pattern by chaodu-agent · Pull Request #850 · openabdev/openab

chaodu-agent · 2026-05-18T22:54:31Z

Summary

Proposes an Architecture Decision Record for running a Kubernetes-style CRD + Operator reconciliation loop on Amazon ECS.

What's in this ADR

Manifest schema — declarative YAML (OABService kind) stored in S3
Reconciler design — poll/event-driven controller that diffs desired vs observed ECS state
State store — S3 for manifests, DynamoDB for status + leader election
Phased MVP — starts with single-instance poll-based controller, evolves to event-driven multi-replica
Alternatives considered — Proton, Copilot, CDK Pipelines, Step Functions, EKS

Motivation

Give ECS-native teams the same declarative, self-healing deployment experience that K8s operators provide — without requiring a Kubernetes cluster.

Open Questions

Secrets management strategy
Multi-region topology
Controller self-upgrade path

cc @pahud

Proposes a Kubernetes-style reconciliation loop targeting ECS: - Declarative YAML manifests in S3 as desired state - Controller reconciles against ECS API - DynamoDB for status tracking and leader election - Phased MVP approach starting with poll-based single instance

github-actions · 2026-05-18T22:54:42Z

⚠️ This PR is missing a Discord Discussion URL in the body.

All PRs must reference a prior Discord discussion to ensure community alignment before implementation.

Please edit the PR description to include a link like:

Discord Discussion URL: https://discord.com/channels/...

This PR will be automatically closed in 3 days if the link is not added.

shaun-agent · 2026-05-18T23:00:30Z

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Title: docs(adr): ECS Control Plane — CRD+Operator pattern
Source: docs(adr): ECS Control Plane — CRD+Operator pattern #850
Status: moved to PR-Screening
Generated at: 2026-05-18T23:00:57.649Z
Discord thread: not available

Screening report

done. posted screening comment and moved the project item to `PR-Screening`.

GitHub comment: #850 (comment)
Project action: openabdev/1 item PVTI_lADOEFbZWM4BUUALzgtHdKA moved Incoming -> PR-Screening

Intent

This PR proposes an ADR for an ECS-native control plane that gives teams a Kubernetes-style declarative deployment model without requiring EKS. The operator-visible problem is that ECS deployments currently lack a first-class CRD/operator reconciliation pattern for desired state, drift detection, and self-healing behavior.

Feat

Docs / architecture work. It adds docs/adr/ecs-control-plane.md, describing an OABService YAML manifest stored in S3, a reconciler that compares desired state with observed ECS state, DynamoDB-backed status and leader election, and a phased path from single-controller polling to event-driven multi-replica operation.

Who It Serves

Deployers and agent runtime operators. It also serves maintainers by creating a concrete architecture target before implementation work begins.

Rewritten Prompt

Add an ADR for an ECS control plane that models services as declarative manifests and reconciles them into ECS resources. Define manifest shape, storage model, reconciliation loop, observed-state sources, status persistence, leader election, failure handling, rollout phases, and alternatives considered. Clearly mark MVP scope versus later work.

Merge Pitch

This should move forward as an ADR if the team wants ECS to be a serious deployment substrate for OpenAB workloads. Code risk is low because this is documentation only, but architecture-direction risk is medium.

Best-Practice Comparison

OpenClaw and Hermes Agent both apply. The ADR aligns with durable state, reconciliation, leader ownership, and poll-based daemon behavior, but should be sharper on retry/backoff, run logs, atomic state transitions, delivery routing, and restart-safe reconcile work units.

Implementation Options

Conservative: merge as design direction, then open follow-up issues.
Balanced: add an MVP contract section before merge.
Ambitious: expand into a full control-plane spec before merge.

Comparison Table

Option	Speed	Complexity	Reliability	Maintainability	User Impact	Fit for OpenAB Now
Conservative	High	Low	Medium	Medium	Medium	Good if only capturing direction
Balanced	Medium	Medium	High	High	High	Best fit
Ambitious	Low	High	High	Medium	High	Useful later, heavy now

Recommendation

Take the balanced path. Advance the ADR, but ask for a focused revision adding reconcile boundaries, idempotency rules, failure/status states, retry/backoff expectations, and minimal audit/run-log requirements.

- Remove DynamoDB dependency from Phase 1 (S3 strong consistency is sufficient) - Add oabctl CLI section (apply, get, delete, diff, logs, wait) - Clarify S3 bucket layout: manifests/ + status/ prefixes - Restructure phases: Phase 1 pure S3, Phase 2 adds DDB for HA

…ize selection

… Secrets Manager

…startup OAuth tokens, config, steering, and memory all live in the bootstrap archive. No need for controller to provision secrets via external APIs.

… HOME Controller only manages infra (container, compute, networking). Backend/model/channel config is inside the bootstrap archive's config.toml.

No abstraction layer — values map 1:1 to ECS task definition. Users already know Fargate cpu/memory combos.

Each agent owns its secrets (1:1). Controller wires SSM/Secrets Manager references into ECS TaskDefinition native secrets field. IAM scoped per-agent path. No token sharing between agents.

…rates config.toml Controller understands config schema (channels, backend, steering, features). spec.config takes precedence over bootstrap archive's config.toml.

Addresses feedback from 擺渡/普渡/口渡: - Fixed API identity: oab.dev/v1 OABService (no more inconsistency) - Clear separation: bootstrapFrom=state, spec.config=config, secrets=SSM - Tombstone deletion pattern (not raw S3 delete) - Entrypoint wrapper (not init container — ECS has no such thing) - Controller renders config to S3 artifact (not volume mount) - Explicit generation/observedGeneration (not just S3 VersionId) - replicas:1 enforced for bot-type agents - Schema versioning & validation section - Tightened Phase 1 scope with explicit out-of-scope list

Full structured config in spec.config — controller renders to TOML. Enables schema validation at apply time, clean diffs, and controller decisions based on config (e.g. adapter ports).

Major changes: - Unified API identity: oab.dev/v1 / OABService throughout - bootstrapFrom = mutable state only (memory, KB); secrets never in archive - Config delivery: controller renders config.toml to S3 artifact, startup wrapper downloads - Replicas: reject >1 for WebSocket adapters (validation error) - Secret rotation lifecycle with failure handling - Controller upgrade strategy (Phase 1: brief gap; Phase 2: leader election) - Explicit metadata.generation / status.observedGeneration (not S3 VersionId) - Narrowed Phase 1 scope per review feedback - Removed answered items from Open Questions

Scale horizontally by deploying more agents (each with own token), not by replicating one agent.

Enterprise fleet provisioning: one YAML → N agents auto-provisioned with Discord Bot registration, token storage, and ECS service creation. User only needs to paste OAuth URL to add bots to server.

Same YAML deploys to both ECS and K8s. Core spec is platform-agnostic; platform-specific config lives in optional platform: overlay. Each controller reads only its own key, ignores the other.

chaodu-agent · 2026-05-19T00:46:55Z

<@845835116920307722> <@1490365068863606784> Review findings for PR #850:

docs/adr/ecs-control-plane.md:200-210 defines Discord auto-registration as POST /applications and POST /applications/{id}/bot, but Discord's public Application Resource docs only expose current-application management endpoints such as Get/Edit Current Application; app/bot creation is still a Developer Portal flow. This makes OABFleet auto-provisioning unreleasable as written. Please either remove auto-registration from this ADR, or scope it to pre-created application credentials and make fleet expansion bind existing bot tokens from SSM/Secrets Manager. Source checked: https://docs.discord.com/developers/resources/application
docs/adr/ecs-control-plane.md:353-369 downloads the controller-rendered config.toml first, then extracts bootstrapFrom into /home/agent. Any snapshot/archive containing an older config.toml will overwrite the desired config, violating the ADR's own rule that spec.config owns live config. Please reverse the order so bootstrap is restored first and rendered config is written last, and specify that oabctl snapshot excludes generated config artifacts.
docs/adr/ecs-control-plane.md:245 says the controller does not monitor agent health and only acts on manifest changes, but the same ADR promises oabctl get, wait, status.phase, Available, and secret-rollout failure states based on ECS task/service state. ECS can restart tasks, but the control plane still needs to observe ECS service deployments/tasks and refresh status on each poll. Please change this to “controller does not restart tasks directly; ECS owns restart, controller observes and reports health/drift.”

The previous API identity, bootstrap/config split, and generation fixes look much better. I would keep this ADR focused on the ECS Phase 1 contract and move OABFleet/multi-runtime to follow-up ADRs unless they are tightened to the same level.

- Add executionRole/taskRole to platform.ecs overlay - Clarify Discord auto-register requires user OAuth2 bearer token - Mark OABFleet + autoRegister as Phase 2 - Add IAM requirements for startup wrapper (s3:GetObject)

…lidation - Config artifacts now immutable per generation: artifacts/{ns}/{name}/{gen}/config.toml - TaskDefinition pins CONFIG_ARTIFACT_PATH to exact generation (safe rolling updates) - Platform overlay: strict-validate own keys, ignore other platform keys

- Fix startup order: bootstrap first, then config.toml overlay (prevents stale config) - Discord auto-register: marked as future research (API doesn't exist publicly) - Fleet uses pre-created bot credentials from SSM - Controller observes ECS status and writes back (enables oabctl get/wait) - oabctl snapshot must exclude config.toml

* docs(adr): add ECS Control Plane (CRD+Operator pattern) Proposes a Kubernetes-style reconciliation loop targeting ECS: - Declarative YAML manifests in S3 as desired state - Controller reconciles against ECS API - DynamoDB for status tracking and leader election - Phased MVP approach starting with poll-based single instance * docs(adr): simplify to S3-only state store, add oabctl CLI UX - Remove DynamoDB dependency from Phase 1 (S3 strong consistency is sufficient) - Add oabctl CLI section (apply, get, delete, diff, logs, wait) - Clarify S3 bucket layout: manifests/ + status/ prefixes - Restructure phases: Phase 1 pure S3, Phase 2 adds DDB for HA * docs(adr): add capacityProvider (FARGATE/FARGATE_SPOT) and instance size selection * docs(adr): add per-agent secrets via SSM Parameter Store reference * docs(adr): controller provisions Discord bot token via API, stores in Secrets Manager * docs(adr): add bootstrapFrom — restore agent HOME from S3 archive on startup OAuth tokens, config, steering, and memory all live in the bootstrap archive. No need for controller to provision secrets via external APIs. * docs(adr): remove backend field — agent config lives in bootstrapFrom HOME Controller only manages infra (container, compute, networking). Backend/model/channel config is inside the bootstrap archive's config.toml. * docs(adr): use direct cpu/memory instead of named sizes No abstraction layer — values map 1:1 to ECS task definition. Users already know Fargate cpu/memory combos. * docs(adr): add multi-agent fleet example (5 Kiro + 3 CC + 2 Codex) * docs(adr): add per-agent secret injection section Each agent owns its secrets (1:1). Controller wires SSM/Secrets Manager references into ECS TaskDefinition native secrets field. IAM scoped per-agent path. No token sharing between agents. * docs(adr): add structured spec.config — controller validates and generates config.toml Controller understands config schema (channels, backend, steering, features). spec.config takes precedence over bootstrap archive's config.toml. * docs(adr): rewrite ECS Control Plane ADR incorporating team review Addresses feedback from 擺渡/普渡/口渡: - Fixed API identity: oab.dev/v1 OABService (no more inconsistency) - Clear separation: bootstrapFrom=state, spec.config=config, secrets=SSM - Tombstone deletion pattern (not raw S3 delete) - Entrypoint wrapper (not init container — ECS has no such thing) - Controller renders config to S3 artifact (not volume mount) - Explicit generation/observedGeneration (not just S3 VersionId) - replicas:1 enforced for bot-type agents - Schema versioning & validation section - Tightened Phase 1 scope with explicit out-of-scope list * docs(adr): no LB for agents — they are outbound-only connections * docs(adr): model config.toml as structured YAML in spec Full structured config in spec.config — controller renders to TOML. Enables schema validation at apply time, clean diffs, and controller decisions based on config (e.g. adapter ports). * docs(adr): incorporate all review feedback from 法師團隊 Major changes: - Unified API identity: oab.dev/v1 / OABService throughout - bootstrapFrom = mutable state only (memory, KB); secrets never in archive - Config delivery: controller renders config.toml to S3 artifact, startup wrapper downloads - Replicas: reject >1 for WebSocket adapters (validation error) - Secret rotation lifecycle with failure handling - Controller upgrade strategy (Phase 1: brief gap; Phase 2: leader election) - Explicit metadata.generation / status.observedGeneration (not S3 VersionId) - Narrowed Phase 1 scope per review feedback - Removed answered items from Open Questions * docs(adr): agents are always single-instance, no LB Scale horizontally by deploying more agents (each with own token), not by replicating one agent. * docs(adr): add OABFleet kind and Discord auto-registration flow Enterprise fleet provisioning: one YAML → N agents auto-provisioned with Discord Bot registration, token storage, and ECS service creation. User only needs to paste OAuth URL to add bots to server. * docs(adr): add Multi-Runtime Support section (ECS + K8s) Same YAML deploys to both ECS and K8s. Core spec is platform-agnostic; platform-specific config lives in optional platform: overlay. Each controller reads only its own key, ignores the other. * docs(adr): address 普渡 review feedback - Add executionRole/taskRole to platform.ecs overlay - Clarify Discord auto-register requires user OAuth2 bearer token - Mark OABFleet + autoRegister as Phase 2 - Add IAM requirements for startup wrapper (s3:GetObject) * docs(adr): address 口渡 review - immutable config artifacts + strict validation - Config artifacts now immutable per generation: artifacts/{ns}/{name}/{gen}/config.toml - TaskDefinition pins CONFIG_ARTIFACT_PATH to exact generation (safe rolling updates) - Platform overlay: strict-validate own keys, ignore other platform keys * docs(adr): address 擺渡 review feedback - Fix startup order: bootstrap first, then config.toml overlay (prevents stale config) - Discord auto-register: marked as future research (API doesn't exist publicly) - Fleet uses pre-created bot credentials from SSM - Controller observes ECS status and writes back (enables oabctl get/wait) - oabctl snapshot must exclude config.toml --------- Co-authored-by: Kiro <kiro@openab.dev> Co-authored-by: 超渡法師 <chaodu@openab.dev>

chaodu-agent requested a review from thepagent as a code owner May 18, 2026 22:54

chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from 4e3137f to 83c3d31 Compare May 18, 2026 22:54

github-actions Bot added pending-screening closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 18, 2026

chaodu-agent changed the title ~~docs(adr): ECS Control Plane~~ docs(adr): ECS Control Plane — CRD+Operator pattern May 18, 2026

Kiro and others added 11 commits May 18, 2026 23:06

docs(adr): add capacityProvider (FARGATE/FARGATE_SPOT) and instance s…

96b6dc0

…ize selection

docs(adr): add per-agent secrets via SSM Parameter Store reference

0223b3c

docs(adr): controller provisions Discord bot token via API, stores in…

0935f19

… Secrets Manager

docs(adr): add bootstrapFrom — restore agent HOME from S3 archive on …

158e5b4

…startup OAuth tokens, config, steering, and memory all live in the bootstrap archive. No need for controller to provision secrets via external APIs.

docs(adr): remove backend field — agent config lives in bootstrapFrom…

73ce754

… HOME Controller only manages infra (container, compute, networking). Backend/model/channel config is inside the bootstrap archive's config.toml.

docs(adr): use direct cpu/memory instead of named sizes

cddb434

No abstraction layer — values map 1:1 to ECS task definition. Users already know Fargate cpu/memory combos.

docs(adr): add multi-agent fleet example (5 Kiro + 3 CC + 2 Codex)

ec7624f

docs(adr): add per-agent secret injection section

fff7bb3

Each agent owns its secrets (1:1). Controller wires SSM/Secrets Manager references into ECS TaskDefinition native secrets field. IAM scoped per-agent path. No token sharing between agents.

docs(adr): add structured spec.config — controller validates and gene…

6152806

…rates config.toml Controller understands config schema (channels, backend, steering, features). spec.config takes precedence over bootstrap archive's config.toml.

chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from 385fd65 to 7b0dcd8 Compare May 19, 2026 00:27

Kiro and others added 5 commits May 19, 2026 00:27

docs(adr): no LB for agents — they are outbound-only connections

5ba269f

docs(adr): model config.toml as structured YAML in spec

a5dfedf

Full structured config in spec.config — controller renders to TOML. Enables schema validation at apply time, clean diffs, and controller decisions based on config (e.g. adapter ports).

docs(adr): agents are always single-instance, no LB

e773a4f

Scale horizontally by deploying more agents (each with own token), not by replicating one agent.

docs(adr): add OABFleet kind and Discord auto-registration flow

2a7eba5

Enterprise fleet provisioning: one YAML → N agents auto-provisioned with Discord Bot registration, token storage, and ECS service creation. User only needs to paste OAuth URL to add bots to server.

chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from 2a7eba5 to 5a8dcda Compare May 19, 2026 00:38

docs(adr): add Multi-Runtime Support section (ECS + K8s)

e9b25e1

Same YAML deploys to both ECS and K8s. Core spec is platform-agnostic; platform-specific config lives in optional platform: overlay. Each controller reads only its own key, ignores the other.

chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from ec36b0a to e9b25e1 Compare May 19, 2026 00:43

chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from e9b25e1 to ef7beb2 Compare May 19, 2026 00:47

docs(adr): address 普渡 review feedback

a7db20c

- Add executionRole/taskRole to platform.ecs overlay - Clarify Discord auto-register requires user OAuth2 bearer token - Mark OABFleet + autoRegister as Phase 2 - Add IAM requirements for startup wrapper (s3:GetObject)

chaodu-agent force-pushed the docs/adr-ecs-control-plane branch 2 times, most recently from a7db20c to 2754de2 Compare May 19, 2026 00:48

chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from 2754de2 to a5e5f7b Compare May 19, 2026 00:48

chaodu-agent enabled auto-merge (squash) May 19, 2026 01:18

thepagent approved these changes May 19, 2026

View reviewed changes

thepagent disabled auto-merge May 19, 2026 01:57

thepagent merged commit 6afe59a into main May 19, 2026
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(adr): ECS Control Plane — CRD+Operator pattern#850

docs(adr): ECS Control Plane — CRD+Operator pattern#850
thepagent merged 21 commits into
mainfrom
docs/adr-ecs-control-plane

chaodu-agent commented May 18, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

shaun-agent commented May 18, 2026 •

edited

Loading

Intent

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

chaodu-agent commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

chaodu-agent commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What's in this ADR

Motivation

Open Questions

Uh oh!

github-actions Bot commented May 18, 2026

Uh oh!

shaun-agent commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

OpenAB PR Screening

Intent

Feat

Who It Serves

Rewritten Prompt

Merge Pitch

Best-Practice Comparison

Implementation Options

Comparison Table

Recommendation

Uh oh!

chaodu-agent commented May 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chaodu-agent commented May 18, 2026 •

edited

Loading

shaun-agent commented May 18, 2026 •

edited

Loading