Skip to content

docs(adr): ECS Control Plane — CRD+Operator pattern#850

Merged
thepagent merged 21 commits into
mainfrom
docs/adr-ecs-control-plane
May 19, 2026
Merged

docs(adr): ECS Control Plane — CRD+Operator pattern#850
thepagent merged 21 commits into
mainfrom
docs/adr-ecs-control-plane

Conversation

@chaodu-agent
Copy link
Copy Markdown
Collaborator

@chaodu-agent chaodu-agent commented May 18, 2026

Summary

Proposes an Architecture Decision Record for running a Kubernetes-style CRD + Operator reconciliation loop on Amazon ECS.

What's in this ADR

  • Manifest schema — declarative YAML (OABService kind) stored in S3
  • Reconciler design — poll/event-driven controller that diffs desired vs observed ECS state
  • State store — S3 for manifests, DynamoDB for status + leader election
  • Phased MVP — starts with single-instance poll-based controller, evolves to event-driven multi-replica
  • Alternatives considered — Proton, Copilot, CDK Pipelines, Step Functions, EKS

Motivation

Give ECS-native teams the same declarative, self-healing deployment experience that K8s operators provide — without requiring a Kubernetes cluster.

Open Questions

  • Secrets management strategy
  • Multi-region topology
  • Controller self-upgrade path

cc @pahud

Proposes a Kubernetes-style reconciliation loop targeting ECS:
- Declarative YAML manifests in S3 as desired state
- Controller reconciles against ECS API
- DynamoDB for status tracking and leader election
- Phased MVP approach starting with poll-based single instance
@chaodu-agent chaodu-agent requested a review from thepagent as a code owner May 18, 2026 22:54
@chaodu-agent chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from 4e3137f to 83c3d31 Compare May 18, 2026 22:54
@github-actions github-actions Bot added pending-screening closing-soon PR missing Discord Discussion URL — will auto-close in 3 days labels May 18, 2026
@github-actions
Copy link
Copy Markdown

⚠️ This PR is missing a Discord Discussion URL in the body.

All PRs must reference a prior Discord discussion to ensure community alignment before implementation.

Please edit the PR description to include a link like:

Discord Discussion URL: https://discord.com/channels/...

This PR will be automatically closed in 3 days if the link is not added.

@chaodu-agent chaodu-agent changed the title docs(adr): ECS Control Plane docs(adr): ECS Control Plane — CRD+Operator pattern May 18, 2026
@shaun-agent
Copy link
Copy Markdown
Contributor

shaun-agent commented May 18, 2026

OpenAB PR Screening

This is auto-generated by the OpenAB project-screening flow for context collection and reviewer handoff.
Click 👍 if you find this useful. Human review will be done within 24 hours. We appreciate your support and contribution 🙏

Screening report done. posted screening comment and moved the project item to `PR-Screening`.

GitHub comment: #850 (comment)
Project action: openabdev/1 item PVTI_lADOEFbZWM4BUUALzgtHdKA moved Incoming -> PR-Screening

Intent

This PR proposes an ADR for an ECS-native control plane that gives teams a Kubernetes-style declarative deployment model without requiring EKS. The operator-visible problem is that ECS deployments currently lack a first-class CRD/operator reconciliation pattern for desired state, drift detection, and self-healing behavior.

Feat

Docs / architecture work. It adds docs/adr/ecs-control-plane.md, describing an OABService YAML manifest stored in S3, a reconciler that compares desired state with observed ECS state, DynamoDB-backed status and leader election, and a phased path from single-controller polling to event-driven multi-replica operation.

Who It Serves

Deployers and agent runtime operators. It also serves maintainers by creating a concrete architecture target before implementation work begins.

Rewritten Prompt

Add an ADR for an ECS control plane that models services as declarative manifests and reconciles them into ECS resources. Define manifest shape, storage model, reconciliation loop, observed-state sources, status persistence, leader election, failure handling, rollout phases, and alternatives considered. Clearly mark MVP scope versus later work.

Merge Pitch

This should move forward as an ADR if the team wants ECS to be a serious deployment substrate for OpenAB workloads. Code risk is low because this is documentation only, but architecture-direction risk is medium.

Best-Practice Comparison

OpenClaw and Hermes Agent both apply. The ADR aligns with durable state, reconciliation, leader ownership, and poll-based daemon behavior, but should be sharper on retry/backoff, run logs, atomic state transitions, delivery routing, and restart-safe reconcile work units.

Implementation Options

  1. Conservative: merge as design direction, then open follow-up issues.
  2. Balanced: add an MVP contract section before merge.
  3. Ambitious: expand into a full control-plane spec before merge.

Comparison Table

Option Speed Complexity Reliability Maintainability User Impact Fit for OpenAB Now
Conservative High Low Medium Medium Medium Good if only capturing direction
Balanced Medium Medium High High High Best fit
Ambitious Low High High Medium High Useful later, heavy now

Recommendation

Take the balanced path. Advance the ADR, but ask for a focused revision adding reconcile boundaries, idempotency rules, failure/status states, retry/backoff expectations, and minimal audit/run-log requirements.

Kiro and others added 11 commits May 18, 2026 23:06
- Remove DynamoDB dependency from Phase 1 (S3 strong consistency is sufficient)
- Add oabctl CLI section (apply, get, delete, diff, logs, wait)
- Clarify S3 bucket layout: manifests/ + status/ prefixes
- Restructure phases: Phase 1 pure S3, Phase 2 adds DDB for HA
…startup

OAuth tokens, config, steering, and memory all live in the bootstrap
archive. No need for controller to provision secrets via external APIs.
… HOME

Controller only manages infra (container, compute, networking).
Backend/model/channel config is inside the bootstrap archive's config.toml.
No abstraction layer — values map 1:1 to ECS task definition.
Users already know Fargate cpu/memory combos.
Each agent owns its secrets (1:1). Controller wires SSM/Secrets Manager
references into ECS TaskDefinition native secrets field. IAM scoped
per-agent path. No token sharing between agents.
…rates config.toml

Controller understands config schema (channels, backend, steering, features).
spec.config takes precedence over bootstrap archive's config.toml.
Addresses feedback from 擺渡/普渡/口渡:
- Fixed API identity: oab.dev/v1 OABService (no more inconsistency)
- Clear separation: bootstrapFrom=state, spec.config=config, secrets=SSM
- Tombstone deletion pattern (not raw S3 delete)
- Entrypoint wrapper (not init container — ECS has no such thing)
- Controller renders config to S3 artifact (not volume mount)
- Explicit generation/observedGeneration (not just S3 VersionId)
- replicas:1 enforced for bot-type agents
- Schema versioning & validation section
- Tightened Phase 1 scope with explicit out-of-scope list
@chaodu-agent chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from 385fd65 to 7b0dcd8 Compare May 19, 2026 00:27
Kiro and others added 5 commits May 19, 2026 00:27
Full structured config in spec.config — controller renders to TOML.
Enables schema validation at apply time, clean diffs, and controller
decisions based on config (e.g. adapter ports).
Major changes:
- Unified API identity: oab.dev/v1 / OABService throughout
- bootstrapFrom = mutable state only (memory, KB); secrets never in archive
- Config delivery: controller renders config.toml to S3 artifact, startup wrapper downloads
- Replicas: reject >1 for WebSocket adapters (validation error)
- Secret rotation lifecycle with failure handling
- Controller upgrade strategy (Phase 1: brief gap; Phase 2: leader election)
- Explicit metadata.generation / status.observedGeneration (not S3 VersionId)
- Narrowed Phase 1 scope per review feedback
- Removed answered items from Open Questions
Scale horizontally by deploying more agents (each with own token),
not by replicating one agent.
Enterprise fleet provisioning: one YAML → N agents auto-provisioned
with Discord Bot registration, token storage, and ECS service creation.
User only needs to paste OAuth URL to add bots to server.
@chaodu-agent chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from 2a7eba5 to 5a8dcda Compare May 19, 2026 00:38
Same YAML deploys to both ECS and K8s. Core spec is platform-agnostic;
platform-specific config lives in optional platform: overlay. Each
controller reads only its own key, ignores the other.
@chaodu-agent chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from ec36b0a to e9b25e1 Compare May 19, 2026 00:43
@chaodu-agent
Copy link
Copy Markdown
Collaborator Author

<@845835116920307722> <@1490365068863606784> Review findings for PR #850:

  1. docs/adr/ecs-control-plane.md:200-210 defines Discord auto-registration as POST /applications and POST /applications/{id}/bot, but Discord's public Application Resource docs only expose current-application management endpoints such as Get/Edit Current Application; app/bot creation is still a Developer Portal flow. This makes OABFleet auto-provisioning unreleasable as written. Please either remove auto-registration from this ADR, or scope it to pre-created application credentials and make fleet expansion bind existing bot tokens from SSM/Secrets Manager. Source checked: https://docs.discord.com/developers/resources/application

  2. docs/adr/ecs-control-plane.md:353-369 downloads the controller-rendered config.toml first, then extracts bootstrapFrom into /home/agent. Any snapshot/archive containing an older config.toml will overwrite the desired config, violating the ADR's own rule that spec.config owns live config. Please reverse the order so bootstrap is restored first and rendered config is written last, and specify that oabctl snapshot excludes generated config artifacts.

  3. docs/adr/ecs-control-plane.md:245 says the controller does not monitor agent health and only acts on manifest changes, but the same ADR promises oabctl get, wait, status.phase, Available, and secret-rollout failure states based on ECS task/service state. ECS can restart tasks, but the control plane still needs to observe ECS service deployments/tasks and refresh status on each poll. Please change this to “controller does not restart tasks directly; ECS owns restart, controller observes and reports health/drift.”

The previous API identity, bootstrap/config split, and generation fixes look much better. I would keep this ADR focused on the ECS Phase 1 contract and move OABFleet/multi-runtime to follow-up ADRs unless they are tightened to the same level.

@chaodu-agent chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from e9b25e1 to ef7beb2 Compare May 19, 2026 00:47
- Add executionRole/taskRole to platform.ecs overlay
- Clarify Discord auto-register requires user OAuth2 bearer token
- Mark OABFleet + autoRegister as Phase 2
- Add IAM requirements for startup wrapper (s3:GetObject)
@chaodu-agent chaodu-agent force-pushed the docs/adr-ecs-control-plane branch 2 times, most recently from a7db20c to 2754de2 Compare May 19, 2026 00:48
…lidation

- Config artifacts now immutable per generation: artifacts/{ns}/{name}/{gen}/config.toml
- TaskDefinition pins CONFIG_ARTIFACT_PATH to exact generation (safe rolling updates)
- Platform overlay: strict-validate own keys, ignore other platform keys
@chaodu-agent chaodu-agent force-pushed the docs/adr-ecs-control-plane branch from 2754de2 to a5e5f7b Compare May 19, 2026 00:48
- Fix startup order: bootstrap first, then config.toml overlay (prevents stale config)
- Discord auto-register: marked as future research (API doesn't exist publicly)
- Fleet uses pre-created bot credentials from SSM
- Controller observes ECS status and writes back (enables oabctl get/wait)
- oabctl snapshot must exclude config.toml
@chaodu-agent chaodu-agent enabled auto-merge (squash) May 19, 2026 01:18
@thepagent thepagent disabled auto-merge May 19, 2026 01:57
@thepagent thepagent merged commit 6afe59a into main May 19, 2026
2 checks passed
apple8409 pushed a commit to apple8409/openab that referenced this pull request May 19, 2026
* docs(adr): add ECS Control Plane (CRD+Operator pattern)

Proposes a Kubernetes-style reconciliation loop targeting ECS:
- Declarative YAML manifests in S3 as desired state
- Controller reconciles against ECS API
- DynamoDB for status tracking and leader election
- Phased MVP approach starting with poll-based single instance

* docs(adr): simplify to S3-only state store, add oabctl CLI UX

- Remove DynamoDB dependency from Phase 1 (S3 strong consistency is sufficient)
- Add oabctl CLI section (apply, get, delete, diff, logs, wait)
- Clarify S3 bucket layout: manifests/ + status/ prefixes
- Restructure phases: Phase 1 pure S3, Phase 2 adds DDB for HA

* docs(adr): add capacityProvider (FARGATE/FARGATE_SPOT) and instance size selection

* docs(adr): add per-agent secrets via SSM Parameter Store reference

* docs(adr): controller provisions Discord bot token via API, stores in Secrets Manager

* docs(adr): add bootstrapFrom — restore agent HOME from S3 archive on startup

OAuth tokens, config, steering, and memory all live in the bootstrap
archive. No need for controller to provision secrets via external APIs.

* docs(adr): remove backend field — agent config lives in bootstrapFrom HOME

Controller only manages infra (container, compute, networking).
Backend/model/channel config is inside the bootstrap archive's config.toml.

* docs(adr): use direct cpu/memory instead of named sizes

No abstraction layer — values map 1:1 to ECS task definition.
Users already know Fargate cpu/memory combos.

* docs(adr): add multi-agent fleet example (5 Kiro + 3 CC + 2 Codex)

* docs(adr): add per-agent secret injection section

Each agent owns its secrets (1:1). Controller wires SSM/Secrets Manager
references into ECS TaskDefinition native secrets field. IAM scoped
per-agent path. No token sharing between agents.

* docs(adr): add structured spec.config — controller validates and generates config.toml

Controller understands config schema (channels, backend, steering, features).
spec.config takes precedence over bootstrap archive's config.toml.

* docs(adr): rewrite ECS Control Plane ADR incorporating team review

Addresses feedback from 擺渡/普渡/口渡:
- Fixed API identity: oab.dev/v1 OABService (no more inconsistency)
- Clear separation: bootstrapFrom=state, spec.config=config, secrets=SSM
- Tombstone deletion pattern (not raw S3 delete)
- Entrypoint wrapper (not init container — ECS has no such thing)
- Controller renders config to S3 artifact (not volume mount)
- Explicit generation/observedGeneration (not just S3 VersionId)
- replicas:1 enforced for bot-type agents
- Schema versioning & validation section
- Tightened Phase 1 scope with explicit out-of-scope list

* docs(adr): no LB for agents — they are outbound-only connections

* docs(adr): model config.toml as structured YAML in spec

Full structured config in spec.config — controller renders to TOML.
Enables schema validation at apply time, clean diffs, and controller
decisions based on config (e.g. adapter ports).

* docs(adr): incorporate all review feedback from 法師團隊

Major changes:
- Unified API identity: oab.dev/v1 / OABService throughout
- bootstrapFrom = mutable state only (memory, KB); secrets never in archive
- Config delivery: controller renders config.toml to S3 artifact, startup wrapper downloads
- Replicas: reject >1 for WebSocket adapters (validation error)
- Secret rotation lifecycle with failure handling
- Controller upgrade strategy (Phase 1: brief gap; Phase 2: leader election)
- Explicit metadata.generation / status.observedGeneration (not S3 VersionId)
- Narrowed Phase 1 scope per review feedback
- Removed answered items from Open Questions

* docs(adr): agents are always single-instance, no LB

Scale horizontally by deploying more agents (each with own token),
not by replicating one agent.

* docs(adr): add OABFleet kind and Discord auto-registration flow

Enterprise fleet provisioning: one YAML → N agents auto-provisioned
with Discord Bot registration, token storage, and ECS service creation.
User only needs to paste OAuth URL to add bots to server.

* docs(adr): add Multi-Runtime Support section (ECS + K8s)

Same YAML deploys to both ECS and K8s. Core spec is platform-agnostic;
platform-specific config lives in optional platform: overlay. Each
controller reads only its own key, ignores the other.

* docs(adr): address 普渡 review feedback

- Add executionRole/taskRole to platform.ecs overlay
- Clarify Discord auto-register requires user OAuth2 bearer token
- Mark OABFleet + autoRegister as Phase 2
- Add IAM requirements for startup wrapper (s3:GetObject)

* docs(adr): address 口渡 review - immutable config artifacts + strict validation

- Config artifacts now immutable per generation: artifacts/{ns}/{name}/{gen}/config.toml
- TaskDefinition pins CONFIG_ARTIFACT_PATH to exact generation (safe rolling updates)
- Platform overlay: strict-validate own keys, ignore other platform keys

* docs(adr): address 擺渡 review feedback

- Fix startup order: bootstrap first, then config.toml overlay (prevents stale config)
- Discord auto-register: marked as future research (API doesn't exist publicly)
- Fleet uses pre-created bot credentials from SSM
- Controller observes ECS status and writes back (enables oabctl get/wait)
- oabctl snapshot must exclude config.toml

---------

Co-authored-by: Kiro <kiro@openab.dev>
Co-authored-by: 超渡法師 <chaodu@openab.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

closing-soon PR missing Discord Discussion URL — will auto-close in 3 days

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants