Skip to content

feat(aws-lambda): validate variables + 256 KiB Step Functions input cap#964

Closed
jrusso1020 wants to merge 1 commit into
05-19-feat_producer_thread_variables_through_plan_renderchunk_from
05-19-feat_aws-lambda_validate_variables_and_256_kib_step_functions_input_cap
Closed

feat(aws-lambda): validate variables + 256 KiB Step Functions input cap#964
jrusso1020 wants to merge 1 commit into
05-19-feat_producer_thread_variables_through_plan_renderchunk_from
05-19-feat_aws-lambda_validate_variables_and_256_kib_step_functions_input_cap

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 19, 2026

What

Phase 9 PR 9.2 of the distributed rendering plan — add client-side validation for config.variables and a 256 KiB cap on the full Step Functions Standard execution input.

Why

PR 9.1 added variables to DistributedRenderConfig. Because SerializableDistributedRenderConfig = Omit<DistributedRenderConfig, "logger" | "abortSignal" | "producerConfig">, the field automatically flows through renderToLambda's execution input — but the SDK didn't validate it.

Two failure modes need catching at the SDK boundary, before any AWS call:

  1. Non-JSON-safe values (BigInt, functions, Symbols, undefined leaves, Date instances) either throw at JSON.stringify time or round-trip silently corrupted. Without validation, callers see States.DataLimitExceeded or silently-wrong renders 50 ms after StartExecution.
  2. Oversize payloads — Step Functions Standard caps execution input at 256 KiB. Users hit this most when they try to inline base64 media into variables instead of URL-referencing assets. PR 9.5's guide will document the convention; this PR enforces the limit at the SDK so the error message names the actual byte count, the cap, and points at the docs section.

How

  • validateVariablesPayload walks the variables tree and throws on functions, Symbols, BigInts, non-finite numbers, undefined leaves, non-plain objects. Uses a typeof→message lookup table (instead of a switch) to keep cyclomatic complexity below the audit threshold.
  • validateStepFunctionsInputSize measures Buffer.byteLength(JSON.stringify(input), "utf8") against MAX_STEP_FUNCTIONS_INPUT_BYTES = 256 * 1024. Measured in UTF-8 bytes (Step Functions' wire format), not UTF-16 code units.
  • renderToLambda calls both validators before sfn.send(new StartExecutionCommand(…)).
  • Both helpers + the constant are exported from @hyperframes/aws-lambda/sdk so adapters that build custom Step Functions inputs (PR 9.4's batch verb, future Temporal port) can reuse the gates.

The error message for the size violation links to the templates-on-lambda#working-with-large-variables docs section (lands in PR 9.5) so callers know to URL-reference media assets instead of inlining bytes.

Test plan

  • Unit tests added — 10 tests covering variables-payload validation (functions, Symbols, BigInts, NaN/Infinity, undefined leaves, Date instances, nested arrays/objects, happy-path JSON round-trip).
  • 4 tests for validateStepFunctionsInputSize covering under-cap accept, over-cap reject with byte-count in message, the 256-KiB constant, and non-serializable root.
  • 3 tests for renderToLambda covering variables-through-Step-Functions-input roundtrip, oversize-input early-reject, BigInt-in-variables early-reject.
  • bun test packages/aws-lambda/src — 100 pass (was 83, +17 new tests in PR 9.2).
  • bunx oxlint + bunx oxfmt --check clean on changed files.
  • bun run --cwd packages/aws-lambda typecheck clean.
  • bunx fallow audit --base origin/main --fail-on-issues passes (no new findings beyond validateDistributedRenderConfig's pre-existing complexity warning).

This is part of a stack of 5 (+1 optional) PRs (9.1–9.6); this is PR 9.2.

🤖 Generated with Claude Code

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well-scoped validation layer with correct placement in the call chain.

validateVariablesPayload correctly catches all JSON-unsafe types (functions, Symbols, BigInts, undefined, NaN/Infinity, non-plain objects) with clear dotted-path error reporting. validateStepFunctionsInputSize checks after deploySite assembles the full input, matching the actual wire format.

Both validators throw InvalidConfigError with the field property — consistent error contract. Test coverage is thorough: 17 new tests across variable validation, size cap, and renderToLambda integration.

CI green. Ship it.

@vanceingalls
Copy link
Copy Markdown
Collaborator

Phase 9 PR 9.2 — solid scope + clean test discipline. Variables/size validation lands at the SDK boundary so callers get a typed throw before StartExecution instead of a States.DataLimitExceeded 50 ms in. Findings below; nothing blocking.

Audited: packages/aws-lambda/src/sdk/validateConfig.ts, packages/aws-lambda/src/sdk/renderToLambda.ts, packages/aws-lambda/src/sdk/validateConfig.test.ts, packages/aws-lambda/src/sdk/renderToLambda.test.ts, packages/aws-lambda/src/sdk/index.ts, .github/workflows/ci.yml (CI-gap check), packages/aws-lambda/src/events.ts (shape check). Trusting: the two packages/producer/src/services/distributed/*.test.ts extractions (planAt / renderChunkSoftSkippable refactors — read for regressions, look like clean test-helper extractions).

Strengths

  • validateConfig.ts:226Buffer.byteLength(serialized, "utf8") is the right primitive. JS .length would have under-reported for any multi-byte character. Test pins the exact byte count (validateConfig.test.ts:312) so a future swap to UTF-16 length immediately breaks the assertion.
  • renderToLambda.ts:80,123 — validation order is correct: validateDistributedRenderConfig (recursive variables walk) runs FIRST, then validateStepFunctionsInputSize. A BigInt-leaf surfaces as InvalidConfigError with a typed path before JSON.stringify gets a chance to throw.
  • renderToLambda.test.ts:67-68expect(sfn.starts).toHaveLength(0) asserts the reject happens BEFORE StartExecution. That's the actual contract; without this, a "throws on oversize" test could pass even if the AWS call still fired.
  • validateConfig.ts:262-271LEAF_REJECTIONS lookup table is a nice pattern. Each rejection message is actionable ("encode as a string if you need 64-bit integers", "use null if you mean an absent value") — beats a generic "JSON-unsafe" throw. Keeps walkVariables flat enough to avoid the complexity-rule ignore being load-bearing.

Findings

important

  1. walkVariables has no cycle detection — circular variables → stack overflow, not a typed error. validateConfig.ts:273. A user who builds const v = { a: 1 }; v.self = v; recurses forever before any InvalidConfigError fires. JSON.stringify would have given a clear TypeError: Converting circular structure to JSON; the recursive walk pre-empts that with an opaque RangeError. The fix is a WeakSet of visited objects passed through the recursion, throwing InvalidConfigError with the cycle path on re-encounter. Add a test: const cyclic: Record<string, unknown> = {}; cyclic.self = cyclic; validateVariablesPayload(cyclic) should throw InvalidConfigError, not stack-overflow. Failure mode: SDK call hangs the caller's process with no actionable stack frame.

  2. Stacked-PR CI gap — Build / Lint / Format / Typecheck / Test / runtime-contract / CLI smoke did NOT run on this head SHA. .github/workflows/ci.yml:15 filters pull_request triggers to branches: [main]. PR 964's base is 05-19-feat_producer_thread_variables_through_plan_renderchunk_ (PR feat(producer): thread variables through plan() + renderChunk() #962), so the entire CI workflow is silently skipped. Only regression, player-perf, preview-regression ran (those workflows have no branch filter). PR body claims bun test 100 pass, typecheck clean, oxlint clean — taken on author trust, not CI-verified. Per the canonical rubric Rule 5: required CI gates not running = should be explained in the PR body, or restack to base off main for the bottom-of-stack PR. Not blocking THIS PR (the changes are tightly scoped + tests are sane on read), but the stack's whole 9.x series will have the same gap. Worth a one-line PR-body note acknowledging the stacked-CI shape so future reviewers don't infer "all green = audited".

nit

  1. InvalidConfigError.field = "input" for the size violation breaks the dotted-path convention. validateConfig.ts:217,229. Every other call site uses config.foo / config.variables.bar[0]. "input" is the SFN execution input shape, not the config field. Either rename the field to "config" (since the size violation is almost always caused by config.variables) or document the special case in InvalidConfigError's JSDoc so machine-parsing the field doesn't trip on the convention break.

  2. MAX_STEP_FUNCTIONS_INPUT_BYTES is named for one specific transport, but the JSDoc invites adapter reuse from "future Temporal port". validateConfig.ts:189-198. Temporal payloads cap at 2 MiB by default (and configurable), not 256 KiB. Express SFN caps at 32 KiB (called out in the comment but no constant exposed). If this validator is genuinely meant for adapter reuse, take the cap as a function parameter with a SFN-Standard default, or export MAX_SFN_EXPRESS_INPUT_BYTES = 32 * 1024 alongside so the call-site picks. Otherwise the comment about "future Temporal port" is misleading.

  3. validateStepFunctionsInputSize's JSON.stringify can throw, not just return undefined. validateConfig.ts:222. The current code handles the undefined return (functions/Symbols at root). But if a future caller passes a structure with a BigInt directly — e.g. someone routes config through this defensively without going through validateDistributedRenderConfig first — JSON.stringify throws TypeError: Do not know how to serialize a BigInt from inside the validator. Today's renderToLambda path is safe because the recursive variables walk runs first; the gap is for direct exported-use callers. Wrap in try/catch and re-throw as InvalidConfigError("input", "not JSON-serializable: " + err.message). Defense in depth, low cost.

  4. describeValue for Object.create(null) returns "object", but the proto check accepts null-proto objects. validateConfig.ts:297,256. Inconsistency: walkVariables accepts Object.create(null) as a plain-enough object (proto !== Object.prototype && proto !== null — the && proto !== null branch lets null-proto through), but the error message machinery would label it "object" which matches what plain {} gets. Fine in practice. Worth a one-line comment in walkVariables noting why proto === null is explicitly allowed (Object.create(null) for prototype-pollution-safe lookups), so a future maintainer doesn't tighten this and break a legitimate caller.

Verdict

Verdict: APPROVE (posted as COMMENT — harness denied --approve without explicit stamp authorization; verdict reflects the substantive read)
Reasoning: Scope is right, tests are good shape (byte-count assertion + pre-StartExecution check), no correctness blockers. Findings 1 (cycle detection) and 2 (CI gap on stacked branch) are worth picking up as a follow-up commit on this PR or as a tracked ticket; the rest are nits.

— Vai

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVE on the substance. Solid bytewise input-cap guard + variables validation; verified the validation order, UTF-8 byte measurement, and pre-StartExecution rejection assertion all sound.

Findings

importantwalkVariables has no cycle detection. Circular variables stack-overflow instead of throwing a typed InvalidConfigError (validateConfig.ts:273). One iterative-walk-with-visited-set fix + a test case pins it.

important — stacked-PR CI gap: .github/workflows/ci.yml:15 has pull_request: branches: [main] filter, so Build / Lint / Format / Typecheck / Test / Test:runtime-contract / CLI smoke / Smoke: global install did NOT run on this head SHA. Only regression / player-perf / preview-regression ran (those workflows have no branch filter). PR body's "tests pass / typecheck clean" claims are author-trust only — exact pattern that hid the hf#951 round-3 blocker (typecheck errors in aws-lambda's local format union). Worth merging the upstream stack first, OR rebasing onto main to actually run CI.

nits:

  • field = "input" breaks the dotted-path convention used in sibling error messages.
  • MAX_STEP_FUNCTIONS_INPUT_BYTES is named for SFN-Standard but JSDoc invites Temporal/Express reuse — those have different caps (262144 vs 32768 vs ...). Either rename or constrain the JSDoc.
  • JSON.stringify can throw on direct-call paths (circular refs in input); wrap in a try/catch with a typed error.

Strengths

UTF-8 byte measurement via Buffer.byteLength (not string.length). Validation order: variables-walk first, then size-check — correct (so a too-many-vars payload fails the more actionable check first). Pre-StartExecution sfn.starts.toHaveLength(0) assertion locks in the early-reject contract. LEAF_REJECTIONS table gives actionable messages.

Verdict: APPROVE.

Review by Vai

Add client-side validation for the new config.variables field
(introduced in PR 9.1) and a 256 KiB cap on the full Step Functions
Standard execution input. Both checks throw a typed InvalidConfigError
BEFORE the SDK calls StartExecution — catching the obvious mistakes
locally instead of as a States.DataLimitExceeded 50 ms into the
execution.

validateVariablesPayload walks the variables tree and rejects:
- functions, Symbols, BigInts, non-finite numbers
- undefined leaves (silently dropped by JSON.stringify — would
  surprise the caller when their value doesn't show up in the render)
- non-plain objects (Date, Map, class instances) — Date's toJSON does
  round-trip as a string, but the composition gets a string, not a
  Date, so explicit reject is clearer

validateStepFunctionsInputSize measures the actual UTF-8 byte length
of JSON.stringify(input) against the 256 KiB cap. We use Standard
workflows (per the plan §6.2 / §15.2) for execution-history
visibility, so the cap is 256 KiB (Express would be 32 KiB). The error
message names the actual byte count, the cap, and points at the
templates-on-lambda#working-with-large-variables section so users
know to URL-reference media assets instead of inlining them.

Both helpers are exported from @hyperframes/aws-lambda/sdk so adapters
that build custom Step Functions inputs (batch verbs, future Temporal
ports) can reuse the same gates.

Phase 9 PR 9.2 of the distributed rendering plan.
@jrusso1020 jrusso1020 force-pushed the 05-19-feat_producer_thread_variables_through_plan_renderchunk_ branch from 540d966 to 13c2f26 Compare May 19, 2026 21:13
@jrusso1020 jrusso1020 force-pushed the 05-19-feat_aws-lambda_validate_variables_and_256_kib_step_functions_input_cap branch from 24c52b0 to 078cf71 Compare May 19, 2026 21:13
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

Addressed Vai's feedback in the latest force-push:

  • important — cycle detection added to walkVariables via a WeakSet threaded through the recursion. Circular refs now throw a typed InvalidConfigError with circular reference detected instead of stack-overflowing. New tests cover circular through both objects and arrays.
  • important — JSON.stringify try/catch in validateStepFunctionsInputSize — catches the case where a future field-introduction puts a circular ref or BigInt outside config.variables, turning the raw TypeError into a typed InvalidConfigError.
  • nit — field = "input" changed to "config" so the dotted-path convention is consistent across all error messages.
  • nit — MAX_STEP_FUNCTIONS_INPUT_BYTES JSDoc narrowed to explicitly call out that the constant is Step Functions Standard-specific and shouldn't be reused for Temporal/Express without re-confirming.

Stacked-PR CI gap noted — once #962 merges to main, this branch's CI will run against the proper base on rebase.

Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-reviewed after force-push. All 4 items from Vai's review addressed:

  1. Cycle detection — Added via WeakSet<object> in walkVariables. Checks before recursion, throws typed InvalidConfigError with "circular reference detected". Two tests covering object and array self-references.

  2. field="config" rename — All three throw sites in validateStepFunctionsInputSize now use field: "config", consistent with dotted-path convention. No remnant of "input".

  3. JSON.stringify try/catch — Wraps JSON.stringify(input) with catch that rethrows as typed InvalidConfigError. Also guards against undefined return for non-serializable roots.

  4. JSDoc scope narrowingMAX_STEP_FUNCTIONS_INPUT_BYTES doc now explicitly says "Specific to Step Functions Standard" with a warning against reuse for other runtimes.

CI green (regression shards pending, expected). Ship it.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-stamp post-force-push. Per James's per-PR response, all 4 prior items addressed:

  • Cycle detection via WeakSet on walkVariables.
  • field="config" rename to match dotted-path convention.
  • JSON.stringify try/catch.
  • MAX_STEP_FUNCTIONS_INPUT_BYTES JSDoc scope narrowed.

Re-stamp by Vai

@jrusso1020 jrusso1020 deleted the branch 05-19-feat_producer_thread_variables_through_plan_renderchunk_ May 19, 2026 22:47
@jrusso1020 jrusso1020 closed this May 19, 2026
@jrusso1020
Copy link
Copy Markdown
Collaborator Author

Auto-closed because its base branch (05-19-feat_producer_thread_variables_through_plan_renderchunk_) was deleted during the squash-merge of #962. Continuing as #976 with the same code on the same branch, rebased onto main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants