Skip to content

fix(agent-iam): add route53:GetChange for DNS propagation polling#289

Closed
jcastiarena wants to merge 1 commit intomainfrom
fix/agent-iam-route53-getchange
Closed

fix(agent-iam): add route53:GetChange for DNS propagation polling#289
jcastiarena wants to merge 1 commit intomainfrom
fix/agent-iam-route53-getchange

Conversation

@jcastiarena
Copy link
Copy Markdown

Summary

The default IAM policy attached by infrastructure/aws/iam/agent already grants ChangeResourceRecordSets, ListResourceRecordSets, and the Get/List HostedZone* actions on arn:aws:route53:::hostedzone/*, but omits route53:GetChange on arn:aws:route53:::change/*.

The AWS Terraform provider polls GetChange while waiting for a new record to reach INSYNC after creation. Without the permission, scopes that create DNS records fail with AccessDenied after the record is already in the console — a confusing failure mode where the change visibly succeeds and then the deploy rolls back.

This is a cross-scope footgun, not specific to static-files — anything that writes DNS records through the provider hits it (k8s scopes with external-dns, frontend CloudFront distributions, anything with a CNAME/A).

Found during a customer POC installing the static-files scope. Related downstream documentation PR: nullplatform/scopes-static-files#8.

Change

One new statement added to aws_iam_policy.nullplatform_route53_policy:

{
  "Sid" : "Route53GetChange",
  "Effect" : "Allow",
  "Action" : [
    "route53:GetChange"
  ],
  "Resource" : [
    "arn:aws:route53:::change/*"
  ]
}

Resource is change/* (not hostedzone/*) — GetChange lives in a different ARN namespace. Style matches the existing Route 53 statement (jsonencode block, same indentation).

Test plan

  • tofu fmt -recursive clean
  • tofu init -backend=false && tofu test — 5/5 pass (route53_policy_naming, elb_policy_naming, eks_policy_naming, avp_policy_naming, all_policies_valid_json)
  • Attach the updated role to an agent in a test account and run start-initial for a scope that creates a DNS record (e.g. static-files); confirm deploy reaches finalized without AccessDenied on GetChange.

🤖 Generated with Claude Code

The default agent IAM policy already grants ChangeResourceRecordSets and
the Get/List HostedZone actions, but omits route53:GetChange on
`arn:aws:route53:::change/*`. The AWS Terraform provider polls GetChange
while waiting for a record to reach INSYNC after creation; without it,
any scope that creates DNS records (static-files, k8s with external-dns,
anything with a CNAME/A record) fails with AccessDenied *after* the
record is successfully created. The failure mode is confusing because
the record is visible in the console but the deploy rolls back.

Found while installing the static-files scope at a customer POC.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@jcastiarena
Copy link
Copy Markdown
Author

Cross-cloud audit note — this fix lives correctly at the AWS layer (infrastructure/aws/iam/agent/), but the class of issue (IAM policy missing a permission the Terraform provider needs for async-operation polling after the primary change succeeds) is not AWS-specific.

Worth auditing analogous modules in this repo:

  • infrastructure/azure/iam/agent/ (or equivalent path) — does the Azure DNS module grant the permissions needed for record-propagation / async-operation polling? The AzureRM provider has its own wait-for-completion primitives, so the failure mode would look different but the gap would be symmetric.
  • infrastructure/gcp/iam/agent/ (if present) — same question for Cloud DNS.

Not in scope for this PR. Flagging here so it doesn't fall through the cracks when an operator tries to install a DNS-creating scope on Azure or GCP and hits the same wall we hit on AWS.

Related downstream PR documenting the AWS operator experience: nullplatform/scopes-static-files#8.

@jcastiarena
Copy link
Copy Markdown
Author

Closing — permission is scope-specific, not vanilla agent.

After discussion with the team (cc @agustincelentano), we concluded that route53:GetChange is needed by Terraform-based scopes that create DNS records (static-files, and potentially lambda/networking), not by the vanilla agent which uses external-dns (SDK-based, doesn't poll GetChange).

The right place for this permission is in each customer's additional_policies block when they install a scope that needs it — not in the default agent IAM. This is already documented in the downstream PR:

The cert_manager module in this same repo (infrastructure/aws/iam/cert_manager/main.tf:27-29) also has route53:GetChange as a standalone statement — that's the same pattern: opt-in module, not vanilla default.

Architectural principle: the vanilla agent IAM (infrastructure/aws/iam/agent) should only include permissions the base platform needs (EKS, Route53 record management, ELB, AVP). Scope-specific permissions are "extra" and belong in additional_policies per-setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant