MWI: Spike running the AWS RA service when auth server unavailable #55070

boxofrad · 2025-05-22T18:16:07Z

This is a spike of a possible solution for #53711.

How it works

Rather than depending on *authclient.Client or client.Client directly, we now pass around our own wrapper struct. If the bot identity service initialization fails but there is an old identity on-disk, it'll return "successfully" but the client wrapper will be rigged to return an error on each RPC.

This effectively allows tbot to run in a "degraded" state without a functioning API client, while the bot identity service's Run method retries the initialization in the background.

The roles anywhere service now writes its SVID to a "cache" destination (in-memory by default, but can be configured to on-disk) and if the client returns an error because initialization fails, it'll fall back to using the cached SVID for the exchange.

I've tested this manually by running tbot once with the following configuration, killing it, stopping the auth server, restarting tbot, and then restarting the auth server.

version: v2
proxy_server: <address>
onboarding:
  join_method: token
  token: <token>
storage:
  type: directory
  path: /Users/dan/Desktop/tbot/internal
services:
  - type: workload-identity-aws-roles-anywhere
    destination:
      type: directory
      path: /Users/dan/Desktop/tbot/aws-ra
    cache:
      type: directory
      path: /Users/dan/Desktop/tbot/aws-ra-cache
    role_arn: <arn>
    profile_arn:  <arn>
    trust_anchor_arn: <arn>
    region: eu-west-2
    selector:
      name: aws-ra-workload-identity

MWI: Spike running the AWS RA service when auth server unavailable

cbf818c

boxofrad added the do-not-merge label May 28, 2025

boxofrad mentioned this pull request Jun 10, 2025

MWI: Implement RFD 216 - tbot resiliency improvements #55609

Merged

boxofrad closed this Jun 11, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MWI: Spike running the AWS RA service when auth server unavailable #55070

MWI: Spike running the AWS RA service when auth server unavailable #55070

Uh oh!

boxofrad commented May 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

MWI: Spike running the AWS RA service when auth server unavailable #55070

MWI: Spike running the AWS RA service when auth server unavailable #55070

Uh oh!

Conversation

boxofrad commented May 22, 2025

How it works

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants