Skip to content

proposal: agents map + S3/MinIO auth backup for multi-agent Helm chart #51

@neilkuan

Description

@neilkuan
Background

The current agent.preset approach only supports one agent per Helm release and couples agent selection to hidden template logic. After deeper analysis (including multi-AZ EKS behaviour), this proposal has evolved to address two problems together: multi-agent support and auth persistence across pod restarts.


Problem 1 — Single-agent limitation (preset)

Current design:

agent:
  preset: "kiro"
  • Only one agent per release
  • Adding a new preset requires editing _helpers.tpl
  • Cannot run kiro + claude simultaneously on different Discord channels

Problem 2 — PVC + StatefulSet is wrong for auth-only persistence

The obvious fix for stable storage across pod restarts is StatefulSet + volumeClaimTemplates. However this creates a cross-AZ problem on EKS (and any multi-AZ cluster):

  • EBS volumes are AZ-scoped
  • If a node fails and the pod reschedules to a different AZ, the EBS volume cannot be attached → pod stuck in Pending
  • EFS avoids this but is overkill for a few KB of auth token files

The data that actually needs persistence is tiny and infrequently changing:

Path Content Changes when
.kiro/settings/ OAuth login token Only on re-login
.kiro/steering/ Steering config User changes
~/.claude/ Claude OAuth token Only on re-login
~/.codex/ Codex auth Only on re-login
.kiro/sessions/ ACP session cache Every restart (ephemeral)
.semantic_search/ Search index Rebuild on restart is fine

A generic S3/MinIO backup approach handles all CLI types without AZ coupling.


Proposed design

values.yaml structure — agents map (one entry = one Deployment)

agents:
  kiro:
    image:
      repository: ghcr.io/thepagent/agent-broker
      tag: ""
    command: kiro-cli
    args: [acp, --trust-all-tools]
    discord:
      botToken: ""
      allowedChannels:
        - "YOUR_CHANNEL_ID"
    workingDir: /home/agent
    env: {}
    envFrom: []
    pool:
      maxSessions: 10
      sessionTtlHours: 24
    reactions:
      enabled: true
      removeAfterReply: false
    persistence:
      s3:
        enabled: false
        bucket: ""
        prefix: "agents/kiro"
        credentialsSecret: ""   # Secret with AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY
        endpoint: ""            # for MinIO: http://minio.default.svc:9000
      paths:
        - .kiro/settings        # paths relative to $HOME to backup/restore
        - .kiro/steering
    resources: {}
    nodeSelector: {}
    tolerations: []
    affinity: {}

Resources created per agent entry:

  • Deployment/<fullname>-<name> — replaces StatefulSet (no PVC needed)
  • ConfigMap/<fullname>-<name> — config.toml + optional AGENTS.md
  • Secret/<fullname>-<name> — discord bot token

Auth persistence via init container + preStop hook:

Pod startup:
  init container → aws s3 sync s3://<bucket>/<prefix>/ /home/agent/ (restore)
  main container → agent-broker starts with auth already in place

Pod shutdown:
  preStop hook → aws s3 sync /home/agent/ s3://<bucket>/<prefix>/ (backup)
  main container → terminates

Runtime data (sessions, search index) uses emptyDir — ephemeral, no AZ dependency.

Why Deployment instead of StatefulSet:

  • No volumeClaimTemplates needed → no EBS AZ binding
  • Simpler — no headless Service requirement
  • emptyDir for runtime data is sufficient

Generic across all CLI types

Since each CLI stores auth in a different home directory path, the persistence.paths field lets users specify exactly what to back up:

CLI Paths to back up
kiro-cli .kiro/settings, .kiro/steering
claude-agent-acp .claude/
codex .codex/
gemini .config/gemini/

No chart changes needed when a new CLI is added.


Breaking change notice

⚠️ This is a breaking change from agent.preset.

Before After
agent.preset: kiro agents.kiro.command: kiro-cli
discord.botToken agents.kiro.discord.botToken
discord.allowedChannels agents.kiro.discord.allowedChannels
Single Deployment One Deployment per agent
PVC / StatefulSet S3/MinIO init container + preStop

Recommend releasing as 0.4.0 with migration notes.


CI/CD: Helm chart testing

As part of this change, add automated helm chart testing:

- name: Lint chart
  run: helm lint charts/agent-broker

- name: Template test (each agent type)
  run: |
    for cmd in kiro-cli claude-agent-acp codex-acp gemini; do
      helm template test charts/agent-broker \
        --set agents.test.command=$cmd \
        --set agents.test.discord.botToken=test \
        --set "agents.test.discord.allowedChannels={123}" \
        --set agents.test.args={} \
        --set agents.test.pool.maxSessions=5 \
        --set agents.test.pool.sessionTtlHours=24 \
        --set agents.test.reactions.enabled=true \
        --set agents.test.reactions.removeAfterReply=false | kubectl apply --dry-run=client -f -
    done

cc @thepagent

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions