feat: dynamic assume-role support, configurable placeholder image & install tofu consolidation#25
feat: dynamic assume-role support, configurable placeholder image & install tofu consolidation#25davidf-null wants to merge 17 commits into
Conversation
When assume_role.arn is set in the scope-configurations provider, the agent's base credentials (IRSA) are used only to call sts:AssumeRole; all subsequent AWS calls (CLI + Tofu) run under the target role. Falls back to ASSUME_ROLE_ARN_DEFAULT in values.yaml if the provider key is absent. When neither is set, behavior is unchanged — pod credentials (IRSA) are used directly. - New utils/assume_role: sourceable helper that exports temporary credentials - fetch_scope_configuration: reads assume_role.arn from scope-configurations provider and applies the role immediately after config is fetched - diagnose/build_context: explicit assume_role sourcing (only build_context that bypasses fetch_scope_configuration) - values.yaml: documents ASSUME_ROLE_ARN_DEFAULT as fallback config option Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ations Creates 4 IAM policies covering all AWS operations needed by the lambda scope: - lambda_policy: Lambda CRUD, versions, aliases, concurrency - lambda_iam_policy: execution role management (nullplatform-* and np-lambda-*) - lambda_networking_policy: API Gateway, ALB, Route53 - lambda_storage_policy: ECR, Secrets Manager, CloudWatch, S3 tfstate Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…uffix When PLACEHOLDER_IMAGE_URI is set in values.yaml the operator has already chosen the exact tag — no architecture suffix should be appended. Sets the default to :latest (no arch suffix) for this deployment. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The public ECR image only exists as :latest without architecture-specific tags. Remove the -arm64/-amd64 append logic from the default path. Users who publish arch-specific images can set PLACEHOLDER_IMAGE_URI explicitly to the full tag they need. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The existing scope-configurations provider in this account uses a different schema (.provider.aws_state_bucket) than our Lambda spec (.state.tofu_state_bucket). Add fallback to support both schemas without requiring a new provider instance. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rements policy
The scope execution role was named "${function}-role", which didn't match the
iam:CreateRole/PassRole Resource constraint (arn:aws:iam::*:role/np-lambda-*) in
lambda/requirements, causing AccessDenied at tofu apply. Prefixing aligns the
role name with the policy the assumed role already grants.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
OpenTofu writes its "Error:" block to stderr, but the NP workflow executor only captures stdout — so the real failure reason (e.g. AWS AccessDenied) never showed in the logs, leaving only a generic "scope creation failed". Redirect stderr to stdout on the apply and stop sending the script's own error message to stderr. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ements policy The AWS provider (v5) reads log group tags via logs:ListTagsForResource and manages them via logs:TagResource/UntagResource — the generic resource-tagging API — but the policy only granted the deprecated logs:TagLogGroup. Creating a scope's aws_cloudwatch_log_group failed with AccessDenied on ListTagsForResource. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…R_IMAGE_URI_DEFAULT Adds an env-var fallback for the Lambda placeholder image, mirroring the existing ASSUME_ROLE_ARN_DEFAULT pattern. Precedence: scope-config deployment.placeholder_image_uri > PLACEHOLDER_IMAGE_URI_DEFAULT (values.yaml) > script's hardcoded default. Lets operators point the placeholder at a private ECR mirror per account without a scope-configuration value or code changes. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… update Container-image Lambdas require the source ECR repo to grant lambda.amazonaws.com pull access; without it update-function-code fails with "Lambda does not have permission to access the ECR image". update_function_code now sets the standard LambdaECRImageRetrievalPolicy on the image's repo (idempotent, best-effort), and the requirements role gains ecr:Get/SetRepositoryPolicy. Removes the need to set the policy by hand per application repo. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…nt action The diagnose-deployment action mapped to deployment/workflows/diagnose.yaml, which did not exist, so every auto-diagnose after a failed deployment errored with "failed to read workflow file". Adds the workflow mirroring the scope diagnose flow: lean diagnose/build_context + executor over diagnose/checks. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
ASSUME_ROLE_ARN_DEFAULT and PLACEHOLDER_IMAGE_URI_DEFAULT carried a real AWS account ARN/URI committed for testing. The product repo must stay account-agnostic: both are now documented as account-specific and provided per-installation via the scope-configurations provider or the agent's extra_envs (Helm), not hardcoded here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…URI_DEFAULT knob Document why Image-based scopes need a private-ECR placeholder and how the URI is resolved (provider key > PLACEHOLDER_IMAGE_URI_DEFAULT > public default), including how to publish one and a troubleshooting entry. Also re-add PLACEHOLDER_IMAGE_URI_DEFAULT to values.yaml as a commented, account-agnostic template so operators can pick their own image, and normalize a stray real-looking account ID in a publish comment to the dummy 123456789012. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e requirements Reviewer feedback: the standalone requirements/ folder should not sit at the lambda/ root — all installation-time tofu should live together under a setup module. - Move lambda/specs/tofu/ -> lambda/setup/ (the operator-applied install module). - Merge lambda/requirements/ into lambda/setup/ (requirements.tf + outputs.tf, and its variables folded into setup/variables.tf); remove the requirements/ folder. - A single 'tofu apply' in lambda/setup now registers the scope type AND provisions the IAM policies. The 4 policies are always created; attaching them stays optional via create_role / role_name. - Add the aws provider (~> 5.0) + provider block to setup/provider.tf and a nullable aws_region var (IAM is global). 'name' is now a required setup variable. - Update backend key to lambda/setup/terraform.tfstate. - Refresh references: installation.md (cd path + IAM vars table), prerequisites.md (setup/main.tf), and the iam/setup comment. Verified with 'tofu validate' (Success). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| # update-function-code fails with "Lambda does not have permission to access | ||
| # the ECR image". Idempotent and best-effort (cross-account repos may not be | ||
| # writable from here — Lambda would then need the policy set on the source side). | ||
| if [[ "$IMAGE_URI" == *.dkr.ecr.*.amazonaws.com/* ]]; then |
There was a problem hiding this comment.
No entiendo porque necesitamos esto?
Que pasa si la uri no es de amazonaws.com?
There was a problem hiding this comment.
Lambda con docker, solo fucniona con imagenes de ecr https://docs.aws.amazon.com/es_es/lambda/latest/dg/images-create.html
| if aws ecr set-repository-policy --repository-name "$ecr_repo" --region "$ecr_region" --policy-text "$lambda_pull_policy" >/dev/null 2>&1; then | ||
| log debug " ✅ ensured Lambda pull policy on ECR repo $ecr_repo" | ||
| else | ||
| log warn " ⚠️ could not set Lambda pull policy on ECR repo $ecr_repo (continuing; pull may fail if not already allowed)" |
There was a problem hiding this comment.
Esto está confirmado que puede llegar a funcionar? si es una certeza que va a fallar después, tiraría un error.
| # Use the image URI as-is. If PLACEHOLDER_IMAGE_URI is not set, the default | ||
| # :latest tag is used without any architecture suffix — publish arch-specific | ||
| # tags and set PLACEHOLDER_IMAGE_URI explicitly if needed. |
| iam_role_name="${LAMBDA_FUNCTION_NAME}-role" | ||
| # Prefix with "np-lambda-" so the role name matches the iam:CreateRole/PassRole | ||
| # Resource constraint in lambda/setup (arn:aws:iam::*:role/np-lambda-*). | ||
| iam_role_name="np-lambda-${LAMBDA_FUNCTION_NAME}-role" |
There was a problem hiding this comment.
Esto es un breaking change, además, por qué estamos forzando a que el role tenga que terminar con -role, usaría sólo lo que venga de la variable y que se respete esa convención
There was a problem hiding this comment.
Decis que validemos el valor de la variable? Igual esto se va con lo del provider.
| fi | ||
|
|
||
| # Run tofu action | ||
| # Redirect stderr to stdout: OpenTofu writes its "Error:" block to stderr, and the |
| @@ -0,0 +1,29 @@ | |||
| output "lambda_policy_arn" { | |||
There was a problem hiding this comment.
Volver a poner todos estos files en lambda/spec.
La convención es que todo esté en esa carpeta, es donde saben buscar los tofu modules y como están hechos todos los scopes.
There was a problem hiding this comment.
La carpeta setup no debería existir, meter todo en las carpetas actuales.
| @@ -0,0 +1,41 @@ | |||
| #!/bin/bash | |||
| # Sourceable helper — do NOT execute directly. | |||
There was a problem hiding this comment.
Este comment sólo debería tener que hace, el tema de que es ourceable y los requirements saquemoslo
| fi | ||
| } | ||
|
|
||
| if [ -n "${ASSUME_ROLE_ARN:-}" ]; then |
There was a problem hiding this comment.
Usaría un nombre más específico como "SCOPE_LAMBDA_ASSUME_ROLE_ARN"
En general un mismo agente ejecuta distintos scopes y servicios. Si usas nombres de variables genéricos (que se pueden setear como env var del agente) es probable que se generen colisiones.
| # Expects: ASSUME_ROLE_ARN (exported by fetch_scope_configuration or values.yaml) | ||
| # SCOPE_ID (optional, used for the session name) | ||
|
|
||
| _ar_log() { |
There was a problem hiding this comment.
Sacaría esto, asumiría que log existe como en el resto de los scripts, nosotros armamos los workflows, podemos asegurarnos que esté exportada.
| _ar_log info "ERROR: sts:AssumeRole failed for $ASSUME_ROLE_ARN" | ||
| _ar_log info "$(cat "$_ar_sts_error")" | ||
| rm -f "$_ar_sts_error" | ||
| return 1 |
There was a problem hiding this comment.
Esto me suena que debería ser un exit para fallar. Para los logs usemos level error o warn.
|
|
||
| # From scope-configurations category | ||
| TOFU_STATE_BUCKET=$(echo "$SCOPE_CONFIG" | jq -r '.state.tofu_state_bucket // empty') | ||
| TOFU_STATE_BUCKET=$(echo "$SCOPE_CONFIG" | jq -r '.state.tofu_state_bucket // .provider.aws_state_bucket // empty') |
There was a problem hiding this comment.
De donde salio el .provider.aws_state_bucket?
Me bajé el payload de un config de verdad y tiene esta pinta:
{
"attributes": {
"deployment": {
"placeholder_image_uri": "855647970243.dkr.ecr.us-east-1.amazonaws.com/aws-lambda/nullplatform-lambda-placeholder:latest"
},
"state": {
"tofu_state_bucket": "gal3-scopes-tfstate-galicia-3-68bb45dd"
}
},
"created_at": "2026-05-19T16:44:54.180Z",
"dimensions": {},
"groups": [],
"id": "70a0a1fa-dea0-4db1-9d6c-fe71b1843186",
"nrn": "organization=1636958496:account=1807223679",
"specification_id": "80fc7026-7164-4c09-8a4f-424dc3b6aa50",
"tags": [],
"updated_at": "2026-05-19T16:44:54.180Z"
}|
|
||
| PLACEHOLDER_IMAGE_URI=$(echo "$SCOPE_CONFIG" | jq -r '.deployment.placeholder_image_uri // empty') | ||
| log debug " ✅ placeholder_image_uri=$PLACEHOLDER_IMAGE_URI" | ||
| # Fallback to env var set in values.yaml when the provider does not supply it. |
There was a problem hiding this comment.
Sacar este comment y el de las líneas 81 y 100 también.
| ASSUME_ROLE_ARN=$(echo "$SCOPE_CONFIG" | jq -r '.assume_role.arn // empty') | ||
| ASSUME_ROLE_ARN="${ASSUME_ROLE_ARN:-${ASSUME_ROLE_ARN_DEFAULT:-}}" |
There was a problem hiding this comment.
Esto no va a salir de scope config, va a salir del nuevo provider verdad?
Summary
Adds dynamic assume-role support and a configurable placeholder image to the
AWS Lambda scope, introduces a requirements module with the IAM policies the scope
needs, and bundles the deploy/state/IAM fixes found while testing — plus a
security/docs hardening pass to keep the published scope account-agnostic.
Changes
Dynamic assume role
scope-configurationsprovider(
assume_role.arn) withASSUME_ROLE_ARN_DEFAULTas an env-var fallback; when set,AWS operations run under the assumed role, otherwise the agent's pod credentials are
used.
sts:AssumeRoleerrors to stdout so they are visible in NP logs.Installation tofu (
lambda/setup/) + IAM policiesnetworking, storage/observability).
np-lambda-to match the policy resourceconstraint.
lambda/setup/— the scope-registration module (formerlylambda/specs/tofu/) andthe IAM policies (formerly the standalone
lambda/requirements/module) are mergedthere. A single
tofu applyinlambda/setup/registers the scope type andprovisions the IAM policies.
nameis now a required setup variable; attaching thepolicies stays optional via
create_role/role_name.Configurable placeholder image
PLACEHOLDER_IMAGE_URI_DEFAULTenv-var fallback for the placeholder image, withprecedence: scope-config
deployment.placeholder_image_uri>PLACEHOLDER_IMAGE_URI_DEFAULT> the script's public default.PLACEHOLDER_IMAGE_URIwhen explicitly set and stop appending anautomatic architecture suffix (publish
-amd64/-arm64tags instead).Deploy / state / tofu fixes
diagnose.yamlworkflow for thediagnose-deploymentaction.TOFU_STATE_BUCKETfrom.provider.aws_state_bucketas a fallback.tofu applystderr to stdout for visibility in NP logs.specs/tofu.Security & docs hardening
(
ASSUME_ROLE_ARN_DEFAULT/PLACEHOLDER_IMAGE_URI_DEFAULTcarried a real accountARN/URI) so the product repo stays account-agnostic.
PLACEHOLDER_IMAGE_URI_DEFAULTtovalues.yamlas a commented,account-agnostic template so operators can pick their own image without a
hardcoded value.
publishcomment to the dummy123456789012.Image-based scopes need a private-ECR placeholder, the resolution precedence, how to
publish one, and a troubleshooting entry.
Test plan
assume_role.arnset → run under the assumed role; errors surfacein NP logs on failure
create_rolerequirements module applies with all IAM policies attachedPLACEHOLDER_IMAGE_URI_DEFAULTfrom a privateECR (single-arch tag matching the scope architecture)
diagnose-deploymentaction runs via the newdiagnose.yamlworkflow🤖 Generated with Claude Code