proposal: agents map + S3/MinIO auth backup for multi-agent Helm chart

##### Background

The current `agent.preset` approach only supports one agent per Helm release and couples agent selection to hidden template logic. After deeper analysis (including multi-AZ EKS behaviour), this proposal has evolved to address two problems together: **multi-agent support** and **auth persistence across pod restarts**.

---

##### Problem 1 — Single-agent limitation (`preset`)

Current design:
```yaml
agent:
  preset: "kiro"
```

- Only one agent per release
- Adding a new preset requires editing `_helpers.tpl`
- Cannot run kiro + claude simultaneously on different Discord channels

---

##### Problem 2 — PVC + StatefulSet is wrong for auth-only persistence

The obvious fix for stable storage across pod restarts is StatefulSet + `volumeClaimTemplates`. However this creates a **cross-AZ problem on EKS** (and any multi-AZ cluster):

- EBS volumes are AZ-scoped
- If a node fails and the pod reschedules to a different AZ, the EBS volume cannot be attached → pod stuck in `Pending`
- EFS avoids this but is overkill for a few KB of auth token files

The data that actually needs persistence is tiny and infrequently changing:

| Path | Content | Changes when |
|---|---|---|
| `.kiro/settings/` | OAuth login token | Only on re-login |
| `.kiro/steering/` | Steering config | User changes |
| `~/.claude/` | Claude OAuth token | Only on re-login |
| `~/.codex/` | Codex auth | Only on re-login |
| `.kiro/sessions/` | ACP session cache | Every restart (ephemeral) |
| `.semantic_search/` | Search index | Rebuild on restart is fine |

A generic **S3/MinIO backup** approach handles all CLI types without AZ coupling.

---

##### Proposed design

**`values.yaml` structure — `agents` map (one entry = one Deployment)**

```yaml
agents:
  kiro:
    image:
      repository: ghcr.io/thepagent/agent-broker
      tag: ""
    command: kiro-cli
    args: [acp, --trust-all-tools]
    discord:
      botToken: ""
      allowedChannels:
        - "YOUR_CHANNEL_ID"
    workingDir: /home/agent
    env: {}
    envFrom: []
    pool:
      maxSessions: 10
      sessionTtlHours: 24
    reactions:
      enabled: true
      removeAfterReply: false
    persistence:
      s3:
        enabled: false
        bucket: ""
        prefix: "agents/kiro"
        credentialsSecret: ""   # Secret with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
        endpoint: ""            # for MinIO: http://minio.default.svc:9000
      paths:
        - .kiro/settings        # paths relative to $HOME to backup/restore
        - .kiro/steering
    resources: {}
    nodeSelector: {}
    tolerations: []
    affinity: {}
```

**Resources created per agent entry:**
- `Deployment/<fullname>-<name>` — replaces StatefulSet (no PVC needed)
- `ConfigMap/<fullname>-<name>` — config.toml + optional AGENTS.md
- `Secret/<fullname>-<name>` — discord bot token

**Auth persistence via init container + preStop hook:**

```
Pod startup:
  init container → aws s3 sync s3://<bucket>/<prefix>/ /home/agent/ (restore)
  main container → agent-broker starts with auth already in place

Pod shutdown:
  preStop hook → aws s3 sync /home/agent/ s3://<bucket>/<prefix>/ (backup)
  main container → terminates
```

Runtime data (sessions, search index) uses `emptyDir` — ephemeral, no AZ dependency.

**Why Deployment instead of StatefulSet:**
- No `volumeClaimTemplates` needed → no EBS AZ binding
- Simpler — no headless Service requirement
- `emptyDir` for runtime data is sufficient

---

##### Generic across all CLI types

Since each CLI stores auth in a different home directory path, the `persistence.paths` field lets users specify exactly what to back up:

| CLI | Paths to back up |
|---|---|
| kiro-cli | `.kiro/settings`, `.kiro/steering` |
| claude-agent-acp | `.claude/` |
| codex | `.codex/` |
| gemini | `.config/gemini/` |

No chart changes needed when a new CLI is added.

---

##### Breaking change notice

⚠️ This is a **breaking change** from `agent.preset`.

| Before | After |
|---|---|
| `agent.preset: kiro` | `agents.kiro.command: kiro-cli` |
| `discord.botToken` | `agents.kiro.discord.botToken` |
| `discord.allowedChannels` | `agents.kiro.discord.allowedChannels` |
| Single Deployment | One Deployment per agent |
| PVC / StatefulSet | S3/MinIO init container + preStop |

Recommend releasing as `0.4.0` with migration notes.

---

##### CI/CD: Helm chart testing

As part of this change, add automated helm chart testing:

```yaml
- name: Lint chart
  run: helm lint charts/agent-broker

- name: Template test (each agent type)
  run: |
    for cmd in kiro-cli claude-agent-acp codex-acp gemini; do
      helm template test charts/agent-broker \
        --set agents.test.command=$cmd \
        --set agents.test.discord.botToken=test \
        --set "agents.test.discord.allowedChannels={123}" \
        --set agents.test.args={} \
        --set agents.test.pool.maxSessions=5 \
        --set agents.test.pool.sessionTtlHours=24 \
        --set agents.test.reactions.enabled=true \
        --set agents.test.reactions.removeAfterReply=false | kubectl apply --dry-run=client -f -
    done
```

cc @thepagent

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

proposal: agents map + S3/MinIO auth backup for multi-agent Helm chart #51

Background

Problem 1 — Single-agent limitation (`preset`)

Problem 2 — PVC + StatefulSet is wrong for auth-only persistence

Proposed design

Generic across all CLI types

Breaking change notice

CI/CD: Helm chart testing

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Path	Content	Changes when
`.kiro/settings/`	OAuth login token	Only on re-login
`.kiro/steering/`	Steering config	User changes
`~/.claude/`	Claude OAuth token	Only on re-login
`~/.codex/`	Codex auth	Only on re-login
`.kiro/sessions/`	ACP session cache	Every restart (ephemeral)
`.semantic_search/`	Search index	Rebuild on restart is fine

CLI	Paths to back up
kiro-cli	`.kiro/settings`, `.kiro/steering`
claude-agent-acp	`.claude/`
codex	`.codex/`
gemini	`.config/gemini/`

Before	After
`agent.preset: kiro`	`agents.kiro.command: kiro-cli`
`discord.botToken`	`agents.kiro.discord.botToken`
`discord.allowedChannels`	`agents.kiro.discord.allowedChannels`
Single Deployment	One Deployment per agent
PVC / StatefulSet	S3/MinIO init container + preStop

proposal: agents map + S3/MinIO auth backup for multi-agent Helm chart #51

Description

Background

Problem 1 — Single-agent limitation (preset)

Problem 2 — PVC + StatefulSet is wrong for auth-only persistence

Proposed design

Generic across all CLI types

Breaking change notice

CI/CD: Helm chart testing

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Problem 1 — Single-agent limitation (`preset`)