Skip to content

feat: dynamic assume-role support, configurable placeholder image & install tofu consolidation#25

Draft
davidf-null wants to merge 17 commits into
mainfrom
feature/assume-role-support
Draft

feat: dynamic assume-role support, configurable placeholder image & install tofu consolidation#25
davidf-null wants to merge 17 commits into
mainfrom
feature/assume-role-support

Conversation

@davidf-null
Copy link
Copy Markdown

@davidf-null davidf-null commented Jun 3, 2026

Summary

Adds dynamic assume-role support and a configurable placeholder image to the
AWS Lambda scope, introduces a requirements module with the IAM policies the scope
needs, and bundles the deploy/state/IAM fixes found while testing — plus a
security/docs hardening pass to keep the published scope account-agnostic.

Changes

Dynamic assume role

  • Resolve the assume-role ARN from the scope-configurations provider
    (assume_role.arn) with ASSUME_ROLE_ARN_DEFAULT as an env-var fallback; when set,
    AWS operations run under the assumed role, otherwise the agent's pod credentials are
    used.
  • Surface sts:AssumeRole errors to stdout so they are visible in NP logs.

Installation tofu (lambda/setup/) + IAM policies

  • IAM policies required for Lambda scope operations (Lambda core, IAM roles,
    networking, storage/observability).
  • Prefix the Lambda execution role with np-lambda- to match the policy resource
    constraint.
  • Add the modern CloudWatch Logs tagging actions to the policy.
  • Consolidation (reviewer feedback): all installation-time tofu now lives under
    lambda/setup/ — the scope-registration module (formerly lambda/specs/tofu/) and
    the IAM policies (formerly the standalone lambda/requirements/ module) are merged
    there. A single tofu apply in lambda/setup/ registers the scope type and
    provisions the IAM policies. name is now a required setup variable; attaching the
    policies stays optional via create_role / role_name.

Configurable placeholder image

  • PLACEHOLDER_IMAGE_URI_DEFAULT env-var fallback for the placeholder image, with
    precedence: scope-config deployment.placeholder_image_uri >
    PLACEHOLDER_IMAGE_URI_DEFAULT > the script's public default.
  • Use the exact PLACEHOLDER_IMAGE_URI when explicitly set and stop appending an
    automatic architecture suffix (publish -amd64 / -arm64 tags instead).

Deploy / state / tofu fixes

  • Ensure the Lambda pull policy on the image's ECR repo before update.
  • Add the missing diagnose.yaml workflow for the diagnose-deployment action.
  • Read TOFU_STATE_BUCKET from .provider.aws_state_bucket as a fallback.
  • Surface tofu apply stderr to stdout for visibility in NP logs.
  • Correct the nullplatform provider version constraint in specs/tofu.

Security & docs hardening

  • Remove the account-specific defaults committed for testing
    (ASSUME_ROLE_ARN_DEFAULT / PLACEHOLDER_IMAGE_URI_DEFAULT carried a real account
    ARN/URI) so the product repo stays account-agnostic.
  • Re-add PLACEHOLDER_IMAGE_URI_DEFAULT to values.yaml as a commented,
    account-agnostic template
    so operators can pick their own image without a
    hardcoded value.
  • Normalize a stray real-looking account ID in a publish comment to the dummy
    123456789012.
  • README: new Placeholder Image (Scope Bootstrap) section explaining why
    Image-based scopes need a private-ECR placeholder, the resolution precedence, how to
    publish one, and a troubleshooting entry.

Note: the real account IDs (235494813897, 688720756067) appear in the branch
history (52cba87 and earlier commits); they are removed from the working tree but
not scrubbed from history (account IDs are low-sensitivity).

Test plan

  • Scope create/update/delete with no assume role → pod credentials used end-to-end
  • Scope ops with assume_role.arn set → run under the assumed role; errors surface
    in NP logs on failure
  • create_role requirements module applies with all IAM policies attached
  • Zip-package scope create → uses the embedded placeholder, no image config needed
  • Image-package scope create → uses PLACEHOLDER_IMAGE_URI_DEFAULT from a private
    ECR (single-arch tag matching the scope architecture)
  • diagnose-deployment action runs via the new diagnose.yaml workflow

🤖 Generated with Claude Code

David Fernandez and others added 17 commits June 1, 2026 15:28
When assume_role.arn is set in the scope-configurations provider, the agent's
base credentials (IRSA) are used only to call sts:AssumeRole; all subsequent
AWS calls (CLI + Tofu) run under the target role. Falls back to ASSUME_ROLE_ARN_DEFAULT
in values.yaml if the provider key is absent. When neither is set, behavior is
unchanged — pod credentials (IRSA) are used directly.

- New utils/assume_role: sourceable helper that exports temporary credentials
- fetch_scope_configuration: reads assume_role.arn from scope-configurations
  provider and applies the role immediately after config is fetched
- diagnose/build_context: explicit assume_role sourcing (only build_context
  that bypasses fetch_scope_configuration)
- values.yaml: documents ASSUME_ROLE_ARN_DEFAULT as fallback config option

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ations

Creates 4 IAM policies covering all AWS operations needed by the lambda scope:
- lambda_policy: Lambda CRUD, versions, aliases, concurrency
- lambda_iam_policy: execution role management (nullplatform-* and np-lambda-*)
- lambda_networking_policy: API Gateway, ALB, Route53
- lambda_storage_policy: ECR, Secrets Manager, CloudWatch, S3 tfstate

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uffix

When PLACEHOLDER_IMAGE_URI is set in values.yaml the operator has already
chosen the exact tag — no architecture suffix should be appended.
Sets the default to :latest (no arch suffix) for this deployment.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The public ECR image only exists as :latest without architecture-specific
tags. Remove the -arm64/-amd64 append logic from the default path.
Users who publish arch-specific images can set PLACEHOLDER_IMAGE_URI
explicitly to the full tag they need.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The existing scope-configurations provider in this account uses a different
schema (.provider.aws_state_bucket) than our Lambda spec (.state.tofu_state_bucket).
Add fallback to support both schemas without requiring a new provider instance.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rements policy

The scope execution role was named "${function}-role", which didn't match the
iam:CreateRole/PassRole Resource constraint (arn:aws:iam::*:role/np-lambda-*) in
lambda/requirements, causing AccessDenied at tofu apply. Prefixing aligns the
role name with the policy the assumed role already grants.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
OpenTofu writes its "Error:" block to stderr, but the NP workflow executor only
captures stdout — so the real failure reason (e.g. AWS AccessDenied) never showed
in the logs, leaving only a generic "scope creation failed". Redirect stderr to
stdout on the apply and stop sending the script's own error message to stderr.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ements policy

The AWS provider (v5) reads log group tags via logs:ListTagsForResource and
manages them via logs:TagResource/UntagResource — the generic resource-tagging
API — but the policy only granted the deprecated logs:TagLogGroup. Creating a
scope's aws_cloudwatch_log_group failed with AccessDenied on ListTagsForResource.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…R_IMAGE_URI_DEFAULT

Adds an env-var fallback for the Lambda placeholder image, mirroring the existing
ASSUME_ROLE_ARN_DEFAULT pattern. Precedence: scope-config
deployment.placeholder_image_uri > PLACEHOLDER_IMAGE_URI_DEFAULT (values.yaml) >
script's hardcoded default. Lets operators point the placeholder at a private ECR
mirror per account without a scope-configuration value or code changes.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… update

Container-image Lambdas require the source ECR repo to grant lambda.amazonaws.com
pull access; without it update-function-code fails with "Lambda does not have
permission to access the ECR image". update_function_code now sets the standard
LambdaECRImageRetrievalPolicy on the image's repo (idempotent, best-effort), and
the requirements role gains ecr:Get/SetRepositoryPolicy. Removes the need to set
the policy by hand per application repo.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt action

The diagnose-deployment action mapped to deployment/workflows/diagnose.yaml,
which did not exist, so every auto-diagnose after a failed deployment errored
with "failed to read workflow file". Adds the workflow mirroring the scope
diagnose flow: lean diagnose/build_context + executor over diagnose/checks.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ASSUME_ROLE_ARN_DEFAULT and PLACEHOLDER_IMAGE_URI_DEFAULT carried a real AWS
account ARN/URI committed for testing. The product repo must stay account-agnostic:
both are now documented as account-specific and provided per-installation via the
scope-configurations provider or the agent's extra_envs (Helm), not hardcoded here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…URI_DEFAULT knob

Document why Image-based scopes need a private-ECR placeholder and how the
URI is resolved (provider key > PLACEHOLDER_IMAGE_URI_DEFAULT > public default),
including how to publish one and a troubleshooting entry.

Also re-add PLACEHOLDER_IMAGE_URI_DEFAULT to values.yaml as a commented,
account-agnostic template so operators can pick their own image, and normalize
a stray real-looking account ID in a publish comment to the dummy 123456789012.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e requirements

Reviewer feedback: the standalone requirements/ folder should not sit at the
lambda/ root — all installation-time tofu should live together under a setup module.

- Move lambda/specs/tofu/ -> lambda/setup/ (the operator-applied install module).
- Merge lambda/requirements/ into lambda/setup/ (requirements.tf + outputs.tf, and
  its variables folded into setup/variables.tf); remove the requirements/ folder.
- A single 'tofu apply' in lambda/setup now registers the scope type AND provisions
  the IAM policies. The 4 policies are always created; attaching them stays optional
  via create_role / role_name.
- Add the aws provider (~> 5.0) + provider block to setup/provider.tf and a nullable
  aws_region var (IAM is global). 'name' is now a required setup variable.
- Update backend key to lambda/setup/terraform.tfstate.
- Refresh references: installation.md (cd path + IAM vars table), prerequisites.md
  (setup/main.tf), and the iam/setup comment.

Verified with 'tofu validate' (Success).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@davidf-null davidf-null changed the title feat: dynamic assume-role support, configurable placeholder image & requirements module feat: dynamic assume-role support, configurable placeholder image & install tofu consolidation Jun 4, 2026
# update-function-code fails with "Lambda does not have permission to access
# the ECR image". Idempotent and best-effort (cross-account repos may not be
# writable from here — Lambda would then need the policy set on the source side).
if [[ "$IMAGE_URI" == *.dkr.ecr.*.amazonaws.com/* ]]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No entiendo porque necesitamos esto?

Que pasa si la uri no es de amazonaws.com?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lambda con docker, solo fucniona con imagenes de ecr https://docs.aws.amazon.com/es_es/lambda/latest/dg/images-create.html

if aws ecr set-repository-policy --repository-name "$ecr_repo" --region "$ecr_region" --policy-text "$lambda_pull_policy" >/dev/null 2>&1; then
log debug " ✅ ensured Lambda pull policy on ECR repo $ecr_repo"
else
log warn " ⚠️ could not set Lambda pull policy on ECR repo $ecr_repo (continuing; pull may fail if not already allowed)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esto está confirmado que puede llegar a funcionar? si es una certeza que va a fallar después, tiraría un error.

Comment on lines +43 to +45
# Use the image URI as-is. If PLACEHOLDER_IMAGE_URI is not set, the default
# :latest tag is used without any architecture suffix — publish arch-specific
# tags and set PLACEHOLDER_IMAGE_URI explicitly if needed.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Saquemos este comment

Comment on lines -7 to +9
iam_role_name="${LAMBDA_FUNCTION_NAME}-role"
# Prefix with "np-lambda-" so the role name matches the iam:CreateRole/PassRole
# Resource constraint in lambda/setup (arn:aws:iam::*:role/np-lambda-*).
iam_role_name="np-lambda-${LAMBDA_FUNCTION_NAME}-role"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esto es un breaking change, además, por qué estamos forzando a que el role tenga que terminar con -role, usaría sólo lo que venga de la variable y que se respete esa convención

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decis que validemos el valor de la variable? Igual esto se va con lo del provider.

Comment thread lambda/scope/tofu/do_tofu
fi

# Run tofu action
# Redirect stderr to stdout: OpenTofu writes its "Error:" block to stderr, and the
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sacar este comment

Comment thread lambda/setup/outputs.tf
@@ -0,0 +1,29 @@
output "lambda_policy_arn" {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Volver a poner todos estos files en lambda/spec.

La convención es que todo esté en esa carpeta, es donde saben buscar los tofu modules y como están hechos todos los scopes.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

La carpeta setup no debería existir, meter todo en las carpetas actuales.

Comment thread lambda/utils/assume_role
@@ -0,0 +1,41 @@
#!/bin/bash
# Sourceable helper — do NOT execute directly.
Copy link
Copy Markdown
Contributor

@fedemaleh fedemaleh Jun 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Este comment sólo debería tener que hace, el tema de que es ourceable y los requirements saquemoslo

Comment thread lambda/utils/assume_role
fi
}

if [ -n "${ASSUME_ROLE_ARN:-}" ]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usaría un nombre más específico como "SCOPE_LAMBDA_ASSUME_ROLE_ARN"

En general un mismo agente ejecuta distintos scopes y servicios. Si usas nombres de variables genéricos (que se pueden setear como env var del agente) es probable que se generen colisiones.

Comment thread lambda/utils/assume_role
# Expects: ASSUME_ROLE_ARN (exported by fetch_scope_configuration or values.yaml)
# SCOPE_ID (optional, used for the session name)

_ar_log() {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sacaría esto, asumiría que log existe como en el resto de los scripts, nosotros armamos los workflows, podemos asegurarnos que esté exportada.

Comment thread lambda/utils/assume_role
Comment on lines +27 to +30
_ar_log info "ERROR: sts:AssumeRole failed for $ASSUME_ROLE_ARN"
_ar_log info "$(cat "$_ar_sts_error")"
rm -f "$_ar_sts_error"
return 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esto me suena que debería ser un exit para fallar. Para los logs usemos level error o warn.


# From scope-configurations category
TOFU_STATE_BUCKET=$(echo "$SCOPE_CONFIG" | jq -r '.state.tofu_state_bucket // empty')
TOFU_STATE_BUCKET=$(echo "$SCOPE_CONFIG" | jq -r '.state.tofu_state_bucket // .provider.aws_state_bucket // empty')
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

De donde salio el .provider.aws_state_bucket?

Me bajé el payload de un config de verdad y tiene esta pinta:

{
  "attributes": {
    "deployment": {
      "placeholder_image_uri": "855647970243.dkr.ecr.us-east-1.amazonaws.com/aws-lambda/nullplatform-lambda-placeholder:latest"
    },
    "state": {
      "tofu_state_bucket": "gal3-scopes-tfstate-galicia-3-68bb45dd"
    }
  },
  "created_at": "2026-05-19T16:44:54.180Z",
  "dimensions": {},
  "groups": [],
  "id": "70a0a1fa-dea0-4db1-9d6c-fe71b1843186",
  "nrn": "organization=1636958496:account=1807223679",
  "specification_id": "80fc7026-7164-4c09-8a4f-424dc3b6aa50",
  "tags": [],
  "updated_at": "2026-05-19T16:44:54.180Z"
}


PLACEHOLDER_IMAGE_URI=$(echo "$SCOPE_CONFIG" | jq -r '.deployment.placeholder_image_uri // empty')
log debug " ✅ placeholder_image_uri=$PLACEHOLDER_IMAGE_URI"
# Fallback to env var set in values.yaml when the provider does not supply it.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sacar este comment y el de las líneas 81 y 100 también.

Comment on lines +82 to +83
ASSUME_ROLE_ARN=$(echo "$SCOPE_CONFIG" | jq -r '.assume_role.arn // empty')
ASSUME_ROLE_ARN="${ASSUME_ROLE_ARN:-${ASSUME_ROLE_ARN_DEFAULT:-}}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Esto no va a salir de scope config, va a salir del nuevo provider verdad?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants