Skip to content

feat(lambda): add TypeScript SDK and CDK construct#909

Open
jrusso1020 wants to merge 4 commits into
mainfrom
feat-lambda-sdk-cdk-construct
Open

feat(lambda): add TypeScript SDK and CDK construct#909
jrusso1020 wants to merge 4 commits into
mainfrom
feat-lambda-sdk-cdk-construct

Conversation

@jrusso1020
Copy link
Copy Markdown
Collaborator

@jrusso1020 jrusso1020 commented May 16, 2026

What

Adds the client-side surface on top of the Phase 6a Lambda handler so adopters can drive a deployed HyperFrames stack from Node without writing AWS-SDK boilerplate, plus a HyperframesRenderStack aws-cdk-lib L2 construct.

New exports from @hyperframes/aws-lambda:

  • renderToLambda(opts) → RenderHandle — validates the config, optionally calls deploySite, starts a Step Functions execution, returns the handle. Does NOT poll.
  • getRenderProgress({ executionArn }) → RenderProgress — one DescribeExecution + one GetExecutionHistory (paginated). Reports overallProgress, framesRendered/totalFrames, lambdasInvoked, costs.accruedSoFarUsd + displayCost, errors, fatalErrorEncountered, and the final outputFile once Assemble completes.
  • deploySite({ projectDir, bucketName }) → SiteHandle — content-addressed S3 upload at sites/<siteId>/project.tar.gz with HeadObject short-circuit, so re-renders of an unchanged tree skip tar+PUT entirely.
  • validateDistributedRenderConfig(config) — typed InvalidConfigError with a .field pointer for every shape error (fps/format/codec/crf/bitrate/chunkSize/maxParallelChunks/runtimeCap/hdrMode/dimensions). Called inside renderToLambda before any AWS call.
  • computeRenderCost(lambdaInvocations, stateTransitions) → RenderCost — best-effort USD math using us-east-1 on-demand Lambda + SFN Standard rates.

New ./cdk subpath export:

  • HyperframesRenderStack — aws-cdk-lib L2 construct that emits the same topology as examples/aws-lambda/template.yaml (Lambda + Step Functions standard workflow + S3 bucket with 7-day intermediates lifecycle + 3 CloudWatch alarms + scoped IAM). aws-cdk-lib + constructs are declared as optional peer dependencies so SDK-only consumers don't pull them into their runtime graph.

Why

Phase 6a (PRs #878 / #882 / #903 / #907 / #908) validated the Lambda architecture on real AWS but the user-facing surface is still ~8 manual commands across bun + sam + aws s3 + aws stepfunctions. PR 6.4 is the foundational piece of the Phase 6b "≈3 commands end-to-end" target documented in DISTRIBUTED-RENDERING-PLAN.md §11 / Phase 6b. The CLI (PR 6.5) consumes this SDK; the docs (PR 6.7 / 6.8) describe it. Shipping the SDK first keeps the two surfaces in lockstep when the CLI lands.

How

  • packages/aws-lambda/src/sdk/ — SDK files (one .ts + one .test.ts per concept). All AWS calls are routed through a sfn?: SFNClient / s3?: S3Client injection seam so the unit tests run with FakeSFN / FakeS3 stand-ins.
  • packages/aws-lambda/src/cdk/ — CDK construct + an ./index.ts barrel so the subpath export resolves cleanly.
  • package.json@aws-sdk/client-sfn added as a runtime dep; aws-cdk-lib + constructs added as optional peer + dev deps; ./cdk added to exports.
  • README — new "Using the SDK" and "Using the CDK construct" sections; old "What's NOT in this PR" replaced with the remaining Phase 6b follow-up list.

Notable design calls:

  • renderToLambda deliberately does NOT poll. Step Functions standard workflows can run for hours; blocking the caller's process is the wrong default.
  • The CDK construct lives on ./cdk (separate subpath) so SDK-only consumers don't pay the ~8 MB aws-cdk-lib import cost.
  • Cost math is documented as best-effort: Lambda billed duration comes from the handler's own DurationMs return field (carried in the SFN history's success payload), since SFN's LambdaFunctionSucceededEventDetails doesn't expose BilledDuration directly. breakdown.estimated flags any invocation that fell back to a constant.
  • The CDK snapshot test freezes per-type resource counts + the ASL state names; it does NOT use toMatchSnapshot because CDK emits CFN tokens with random suffixes (asset hashes, log-group IDs) that would noise up snapshot diffs.

Test plan

  • Unit tests added — 24 new SDK tests + 9 new CDK tests
  • Full package test pass: bun test packages/aws-lambda/src → 83 pass, 0 fail
  • Typecheck clean: bun run --cwd packages/aws-lambda typecheck
  • Lint clean: bunx oxlint packages/aws-lambda/src
  • Format clean: bunx oxfmt --check packages/aws-lambda/src packages/aws-lambda/README.md
  • Documentation updated — packages/aws-lambda/README.md
  • Real-AWS smoke against the SDK — out of scope for this PR (smoke.sh still drives the same SFN input shape today; PR 6.6 adds a --mode=lambda-local regression harness next)

🤖 Generated with Claude Code

Adds the client-side surface on top of the Phase 6a Lambda handler so
adopters can drive a deployed stack from Node without writing AWS-SDK
boilerplate:

- renderToLambda(opts) starts a Step Functions execution and returns a
  handle. Does NOT poll.
- getRenderProgress({ executionArn }) returns a snapshot of progress,
  frames rendered, cost (Lambda GB-seconds + SFN transitions), errors,
  and the final output object once Assemble completes.
- deploySite({ projectDir, bucketName }) content-addresses the project
  tree, tar.gzs it, and uploads to S3 with a HeadObject short-circuit so
  re-renders of the same tree skip the tar+PUT.
- validateDistributedRenderConfig throws a typed InvalidConfigError
  before StartExecution, so shape errors surface synchronously.
- computeRenderCost is exposed for callers who want to format cost out
  of band.

Also ships HyperframesRenderStack, an aws-cdk-lib L2 construct that
emits the same topology as examples/aws-lambda/template.yaml. Lives on
the ./cdk subpath export so SDK-only consumers don't pull aws-cdk-lib
into their runtime graph (declared as an optional peer dependency).

Tests: 24 new unit tests across the SDK plus 9 CDK synth / contract /
snapshot tests. All 83 tests in packages/aws-lambda/src pass.
Pulls shared logic out so the SDK doesn't re-invent things the handler
and the producer already have:

- `formatExtension` extracted to packages/aws-lambda/src/formatExtension.ts.
  handler.ts and renderToLambda.ts both used identical 12-line copies of
  this switch.
- `PLAN_PROJECT_DIR_SKIP_SEGMENTS` is now exported from
  @hyperframes/producer/distributed. deploySite consumes it instead of
  its own duplicate SKIP_TOP_LEVEL set; the two lists were trivially
  identical and would have drifted silently.
- `FakeS3` + `drainBody` factored out of the two SDK test files into
  src/sdk/__fixtures__/fakeS3.ts. Drops ~110 lines of test-file
  duplication and gives future SDK tests a one-line FakeS3 import.
- S3 URI building in deploySite and renderToLambda routes through the
  existing `formatS3Uri` helper instead of inline `s3://...`
  concatenation; matches the convention already in handler.ts.

Net -133 lines across the touched files. All 83 aws-lambda tests still
pass; all 60 producer distributed tests still pass.
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: feat(lambda): add TypeScript SDK and CDK construct

CI is red on one test; the rest of the build (lint, typecheck, regressions, all other test packages) passes.


Failing test — root cause

HyperframesRenderStack — snapshot > emits the expected set of AWS resource types in the expected counts timed out after 5000ms.

The other two snapshot tests pass fine (55ms, 35ms). This one calls template.toJSON() and iterates Resources — which in a CDK template can be a very large object the first time it is accessed because CDK lazy-evaluates token resolution during toJSON(). The CDK synthesis itself completed (the contract tests which run the same synth() fixture all pass), so this is a performance issue with the test's assertion loop, not a correctness bug.

The fix is straightforward: add { timeout: 15_000 } (or higher) to this test case. The same synth() is already used by the two sibling tests in the file, so a file-level beforeAll that memoizes the template would also eliminate the repeated synthesis and make the suite faster overall.

This is a real issue that needs fixing before merge — CI must be green.


Architecture

Well-structured three-layer split: handler, SDK, CDK. Clean separation of concerns:

  • src/cdk/ only needs aws-cdk-lib — correctly isolated behind a subpath export so SDK-only consumers don't pull CDK into their bundle.
  • src/sdk/ wraps @aws-sdk/client-s3 + @aws-sdk/client-sfn with thin, well-typed wrappers that have injection seams for tests.
  • src/handler.ts is unchanged except for pulling the shared formatExtension out. Good factoring.
  • formatExtension extracted from the handler into its own module and reused by the SDK — exactly right.

AWS integration

IAM: The construct uses bucket.grantReadWrite(renderFunction) (scoped to the specific bucket, not *) and renderFunction.grantInvoke(stateMachine). CDK's default new lambda.Function execution role includes the equivalent of AWSLambdaBasicExecutionRole for CloudWatch Logs. The comment in the code ("NOT CloudWatchLogsFullAccess") is correct. This is well-scoped.

One observation: stateMachine.grantInvoke(...) is not granted to any external IAM principal. That's fine for the construct itself, but the README example doesn't mention this — adopters wiring a Lambda or API Gateway trigger will need to add that grant themselves. Worth a note in the README or a readonly stateMachineArn convenience accessor on the construct, but not blocking.

S3 operations: HeadObject + PutObject in deploySite. Error handling covers 404, NotFound, NoSuchKey, and propagates everything else — that's correct S3 SDK behavior since the v3 SDK throws typed errors with $metadata.

Step Functions: StartExecution uses JSON.stringify(input) for the input field — correct (SFN requires a string). GetExecutionHistory paginates up to 50 pages of 1000 events each — the 50-page cap is a reasonable guard against infinite loops on runaway executions.

State machine definition: The AssertChunkCount choice uses numberGreaterThan("$.Plan.ChunkCount", 0) — correct. The maxConcurrencyPath: "$.Plan.ChunkCount" will match actual chunk count, not maxParallelChunks from the config. That's a potential overage — if the user sets maxParallelChunks: 8 and the plan produces 240 chunks, all 240 fire in parallel. The SDK validates maxParallelChunks but the CDK construct doesn't thread it through to maxConcurrencyPath. The SAM template may handle this differently — worth a follow-up.


Error handling

  • deploySite: catches 404 correctly, propagates auth/network errors.
  • renderToLambda: validates config before any AWS call (good), throws if executionArn is absent in the response.
  • getRenderProgress: handles FAILED, TIMED_OUT, ABORTED, PENDING_REDRIVE statuses. Covers ExecutionFailed, ExecutionAborted, ExecutionTimedOut events in the history scan.
  • CDK construct: no try/catch (not needed — CDK construct failures throw at synth time, not runtime).

One gap: getRenderProgress has a 50-page loop guard but no timeout or circuit-breaker for the SFN calls themselves. For very long-running renders this could stall a caller process. Not a blocker for the SDK surface, just worth noting.


Types

Types are clean — no any in the SDK files. The few casts are all in test helpers or fixture code (as unknown as SFNClient, as unknown as S3Client) which is the correct pattern for mock injection. FakeS3.send accepts unknown and narrows via constructor name — a bit fragile for a production mock, but fine for test fixtures.

DistributedFormat in formatExtension.ts is defined locally and may diverge from the type in events.ts. Worth confirming they stay in sync (or re-exporting from one canonical location).

HyperframesRenderStackProps.lambdaMemoryMb is typed as a union of allowed MB values — good, prevents invalid steps.


CDK construct

Well-parameterized with sensible defaults (10240 MB, 900s timeout, x86_64, sparticuz, RETAIN on delete). The construct exposes .bucket, .renderFunction, and .stateMachine for adopter wiring.

The defaultHandlerZipPath() resolves relative to import.meta.url. This will work in the source tree and in a published node_modules install if the package is published with the ZIP included. The package.json doesn't have a files field limiting what gets published — confirm the ZIP ends up in the published package or that handlerZipPath is always required in practice.

RemovalPolicy.DESTROY on the log group while defaulting to RETAIN on the S3 bucket — the asymmetry is intentional (logs are cheaper to lose) but worth a comment for the next maintainer.


Security

No hardcoded credentials, ARNs, or account IDs. Placeholder ARNs in the README (arn:aws:states:us-east-1:123:...) are clearly illustrative.

The content-addressed siteId (SHA-256 truncated to 16 hex chars) has a ~2^64 collision space — acceptable for this use case.

hashProjectDir reads files synchronously (readFileSync) which is fine for project sizes, but the comment notes this. The sort order is deterministic (lexicographic by entry.name). Looks correct.


Summary

Blocking: The snapshot test timeout needs a fix — bump the test timeout or memoize synth() across the three tests in that file.

Non-blocking observations:

  1. maxConcurrencyPath in the Map state uses $.Plan.ChunkCount instead of $.Config.MaxParallelChunks — may silently ignore the user's parallelism cap.
  2. The files field in package.json should be audited to confirm dist/handler.zip gets published.
  3. DistributedFormat defined in two places — confirm they stay in sync.
  4. README example should note that callers need to grant their trigger principal states:StartExecution on the state machine.

The SDK and CDK code itself is solid — well-tested, clean types, proper injection seams, sensible defaults.

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary

Solid Phase 6b foundation. SDK is well-typed, validated up-front, and tested at a level above the rest of the package. CDK construct mirrors SAM resource shape and IAM scope discipline. Decent docstrings throughout.

One real CI blocker (the test timeout on main's Test job is this PR, not flake), plus a few correctness + parity items worth fixing before merge.

Calibrated strengths

  • validateConfig.ts is genuinely well-structured: typed InvalidConfigError with a .field pointer, even-dim enforcement on width/height (yuv420p constraint), and the crf/bitrate mutual-exclusion check (packages/aws-lambda/src/sdk/validateConfig.ts:74-82) all surface bad calls at compile-equivalent speed.
  • IAM discipline holds the SAM template's bar: bucket.grantReadWrite(renderFunction) is ARN-scoped (HyperframesRenderStack.ts:861), no Action: "*", no CloudWatchLogsFullAccess. The comment at :857-860 even calls out why — matches the SAM template's same callout.
  • Snapshot test design (HyperframesRenderStack.snapshot.test.ts:580-600) deliberately avoids toMatchSnapshot for the right reason — CFN token suffixes would create false diffs. Per-type resource counts + frozen state names + frozen non-retryable error set is the right level of granularity.
  • formatExtension extraction into a shared module (src/formatExtension.ts) keeps SDK and handler in lockstep on the mp4 vs mov vs png-sequence key extension. Cross-consumer extension drift was a real foot-gun risk.

Findings

Blocker — CI Test job is red on this PR (not a flake)

Logs (runs/25976205112/job/76356669390):

(fail) HyperframesRenderStack — snapshot > emits the expected set of AWS resource types in the expected counts [5443.01ms]
  ^ this test timed out after 5000ms.

This is the first test in HyperframesRenderStack.snapshot.test.ts. Subsequent tests in the same file pass in 35-55ms; contract-test first synth runs in 70ms. The 5.4s wall time = aws-cdk-lib first-time module load + first new App() + first Template.fromStack synth. The 5s bun:test default timeout barely loses.

This will flake intermittently on every PR that touches the CDK file until fixed. Two clean options:

  • it("...", { timeout: 30_000 }, async () => { ... }) on the first synth test (or describe.timeout(30_000)).
  • Hoist one synth() call to a beforeAll and reuse the Template across all three tests (each only inspects the synthed template, none re-synths).

I'd take the beforeAll approach — it cuts the CDK suite wall time by ~5s every run and makes the dependency-load cost visible exactly once. Same change in HyperframesRenderStack.contract.test.ts:79-81 is a separate small win.

Blocker — stateTransitions = events.length overcounts SFN cost

getRenderProgress.ts:2262:

stateTransitions: events.length,

SFN Standard Workflows bill per state transition, not per history event. A single Lambda task produces (at minimum) TaskStateEntered, LambdaFunctionScheduled, LambdaFunctionStarted, LambdaFunctionSucceeded, TaskStateExited — five history events for one billable transition. With 100 chunks the cost report will show ~5× the actual SFN charge.

This understates the AWS bill misleading direction: it makes accruedSoFarUsd and displayCost look ~3-5× scarier than reality, which will drive adopters to under-fan-out renders. Filter to state-entry events (TaskStateEntered | MapStateEntered | PassStateEntered | ChoiceStateEntered) or count StateEntered family events only. The infra to do this is already in summarizeHistorycurrentLambdaState is set on exactly those event types.

Important — assembleComplete parsed too narrowly

getRenderProgress.ts (around line 2210):

if (typeof obj.FramesEncoded === "number") {
  if (obj.Action === "renderChunk") { ... }
  if (obj.Action === "assemble") { assembleComplete = true; ... }
}

assembleComplete is only set when both FramesEncoded is a number and Action === "assemble". If Assemble succeeds but the result payload happens to omit FramesEncoded (e.g. a future handler refactor), overallProgress will stay at 0.9 even though the render is done. The status === "SUCCEEDED" check in computeOverallProgress covers the final state, but the transitional moment between "Assemble lambda returned" and "SFN marks SUCCEEDED" will misreport.

Two simple fixes (either works): (a) gate assembleComplete on obj.Action === "assemble" only, set outputFile when present; (b) detect Assemble by the enclosing state name (currentLambdaState === "Assemble"), which avoids the payload-shape dependency entirely. (b) is more robust to future result-type churn.

Important — CDK ↔ SAM parity gap on state-machine IAM

The SAM template explicitly grants the state machine role:

  • WriteCloudwatchLogs (9 logs:* actions on *, required for SFN log delivery)
  • XRayTracing (xray:PutTraceSegments, xray:PutTelemetryRecords — required for Tracing.Enabled)

The CDK construct relies on aws-cdk-lib's new sfn.StateMachine with tracingEnabled: true + logs: { destination } to auto-grant these. As of aws-cdk-lib@2.130+ CDK does auto-grant both, but the contract test never asserts this — meaning a future minor CDK version that changes the auto-grant behaviour would silently produce a state machine that can't write logs or trace.

Add one contract assertion: t.hasResourceProperties("AWS::IAM::Policy", { ... Action: contains "xray:PutTraceSegments", Resource: "*" ... }) for the state-machine role's policy. Cheap, catches the SAM-parity regression that's the whole point of the snapshot test.

Separately, the EXPECTED_RESOURCE_COUNTS map declares "AWS::IAM::Policy": 2 but the SAM template has 3 inline policies on the state-machine role (InvokeRenderFunction, WriteCloudwatchLogs, XRayTracing). CDK collapses adjacent inline policies into fewer managed policy resources, but if the count drifts up because CDK's auto-grants emit a third policy on a minor bump, the snapshot test fails with a useful message. So the 2-vs-3 isn't itself a bug — it just means the snapshot test does not assert SAM-template parity, only "what CDK emits today". Worth a comment so a future reader doesn't think parity is verified.

Important — README example has a confusing bucketName derivation

packages/aws-lambda/README.md:319-321:

const handle = await renderToLambda({
  siteHandle: site,
  bucketName: site.projectS3Uri.split("/")[2]!,
  ...
});

The example reaches back into the SiteHandle.projectS3Uri string to extract the bucket name. That's exactly the boilerplate the SDK is supposed to remove. Two paths forward (either is fine):

  • Carry bucketName on SiteHandle (it's already known at deploySite call time — opts.bucketName is right there).
  • Make bucketName optional on RenderToLambdaOptions when siteHandle is supplied, derive it inside renderToLambda from the URI parse.

Right now the README is teaching an anti-pattern.

Important — SAM template uses RetentionInDays: 30 for the SFN log group; CDK uses RetentionDays.ONE_MONTH

These match (ONE_MONTH = 30), so the runtime behaviour is identical. But the SAM template has RetentionInDays: 30 implicit by enum value and the CDK construct uses the named enum — fine, just worth a brief comment near HyperframesRenderStack.ts:864 noting the SAM-template parity reasoning. The state-machine log group is the only resource on RemovalPolicy.DESTROY (the bucket is RETAIN) — a comment on why would prevent a future contributor from "fixing" the inconsistency.

Nit — removalPolicy mismatch potential

The SAM template has explicit DeletionPolicy: Retain + UpdateReplacePolicy: Retain on the bucket. The CDK construct uses props.bucketRemovalPolicy ?? RemovalPolicy.RETAIN. CDK's RemovalPolicy.RETAIN sets both DeletionPolicy and UpdateReplacePolicy to Retain, so this matches — fine.

Nit — validateConfig.ts allows force-sdr but the producer's distributed plan may not

validateConfig.ts:84:

const ALLOWED_HDR_MODES = ["auto", "force-sdr"] as const;

Verify against @hyperframes/producer/distributed's actual accepted set. If the producer's plan stage rejects something the SDK accepts, you'll get the exact "ExecutionFailed five minutes in" pain the validator is designed to prevent.

Nit — loadFullHistory caps at 50 pages × 1000 events = 50K events

getRenderProgress.ts:2141:

for (let page = 0; page < 50; page++) { ... }

Silent truncation. 50K SFN history events is roughly a 10K-chunk render — well past anything that would fit in a single execution today, but a hard cap with no log line or thrown error is the kind of bug that bites three years later. Either drop the cap (Standard Workflows have a hard 25K event limit anyway) or throw new Error("history exceeded 50K events; SFN execution would have failed first").

Nit — inferBilledMs uses wall-clock DurationMs, not actual billed duration

getRenderProgress.ts:2289:

function inferBilledMs(payload: unknown): number {
  ...
  if (typeof obj.DurationMs === "number") return obj.DurationMs;

The handler measures wall-clock and reports it as DurationMs. Lambda's actual BilledDuration is rounded up to the nearest 1ms but excludes cold-start init time, so wall-clock is a slight overestimate of bill — opposite direction from the SFN overcount above. The module doc (costAccounting.ts:7) calls this out, so it's documented; the displayCost formatting is already loud about being "best-effort." Acceptable as-is, but a one-line code comment on inferBilledMs pointing at the SFN-overcount issue (after that's fixed) would close the loop.

Verdict

Verdict: REQUEST CHANGES
Reasoning: CI is red on a real timing issue this PR introduced, and stateTransitions = events.length will visibly overstate AWS cost to the adopters this surface is meant to serve. Both are 1-2 line fixes. Everything else above is important-or-nit polish on what's otherwise a well-built foundation.

— Vai

Review by Vai

The bun:test default 5s timeout tripped the first CDK snapshot test
in CI when the cold-start `Template.fromStack(stack)` synth took ~5-8s
on the slowest GitHub Actions runner. Locally on a warm shell the
synth measures <1s, so the failure didn't reproduce until PR #909 hit
CI.

Two changes:

  - Both CDK test files cache one synth in `beforeAll(..., 30000)` and
    reuse the result across every test that uses the default props.
    Each individual test now runs in microseconds (pure assertions
    against the already-synthed template), so the 5s timeout no longer
    applies on the hot path.

  - The two contract tests that exercise non-default props
    (reservedConcurrency, projectName) still synth fresh per-test; they
    get a per-test `it(..., 30000)` timeout.

No behavior changes.
Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verdict reconciliation: The earlier --comment from this account (id 4304554278) was a fallback after the harness blocked --request-changes pre-call. Reposting as --request-changes from foreground to set the correct review state.

Two blockers + three importants. Architecture and IAM scope hold the bar; the issues below are correctness-grade, not stylistic.

Blockers

Blocker 1 — CI snapshot test times out. HyperframesRenderStack.snapshot.test.ts first test (emits the expected set of AWS resource types…) times out at 5443ms vs the 5000ms bun:test default. First-synth pays the full aws-cdk-lib module-load cost; subsequent synths in the same file complete in 35-55ms. Fix: it(name, { timeout: 30_000 }, …) on the first test, OR hoist app.synth() to beforeAll so every test reuses the assembly. Will flake until done.

Blocker 2 — stateTransitions: events.length over-reports SFN cost. getRenderProgress.ts:2262 counts every history event as a billable transition. Step Functions only charges per state entry (each Task produces ~5-7 history events: ScheduleEvent, StartedEvent, SucceededEvent, etc.). Result: cost projections will over-report by 3-5×. Filter to historyEvents.filter(e => e.type?.endsWith("StateEntered")) (or equivalent) before counting.

Important

  • assembleComplete discriminant is fragile. Gated on both event.previousEventDetails.parameters.Action === "assemble" AND typeof FramesEncoded === "number". The Action coupling makes the check brittle if a future caller renames or relocates that field. Prefer the enclosing state name (StateExitedEventDetails.name === "Assemble").
  • CDK↔SAM parity is asserted by aws-cdk-lib auto-grants, not by test. The CDK construct relies on bucket.grantReadWrite(fn) and CDK's automatic X-Ray / log-delivery permissions. No assertion in the snapshot test verifies these match the SAM template's resource set. Add one explicit toMatchObject assertion.
  • README example does bucketName: site.projectS3Uri.split("/")[2]!. String-parsing the URI is exactly the kind of boilerplate the SDK should remove for users. Carry bucketName on SiteHandle so the example becomes bucketName: site.bucketName.

Strengths

  • SDK ergonomics: InvalidConfigError.field pointer (specific blame), even-dimension enforcement, crf/bitrate mutex, runtimeCap enum. Abort signal is correctly not threaded (fire-and-forget invoke; nothing to cancel).
  • CDK IAM scope: bucket.grantReadWrite(fn) is ARN-scoped. No Action: "*". Trust policy correctly restricts to lambda.amazonaws.com + states.amazonaws.com.
  • Three-layer split (handler/SDK/CDK) is clean — each layer has a single responsibility.

Review by Vai

Three correctness + ergonomics fixes raised in Vai's review:

  - getRenderProgress over-counted SFN transitions by 3-5×. Step
    Functions Standard Workflows bill per state-entry, not per
    history event. Each Task produces ~5-7 history events
    (Scheduled / Started / Succeeded / TaskStateExited / …);
    counting `events.length` reported the runaway. Switch to
    counting `*StateEntered` events explicitly.

  - assembleComplete + outputFile detection was coupled to the
    Lambda payload's `Action` field. Move both signals onto the
    enclosing state name (`StateExited.name === "Assemble"`), which
    is the state-machine identity rather than the Lambda event
    contract. framesRendered increment moves to the same boundary
    (RenderChunk state).

  - SiteHandle now carries `bucketName` directly so README + CLI
    callers don't have to re-parse `projectS3Uri.split("/")[2]`.

Test updates: getRenderProgress tests wrap renderChunk/assemble
events in matching StateEntered + StateExited pairs so the new
state-name-driven dispatch is exercised end-to-end. SiteHandle
fixture in renderToLambda.test.ts gets the new bucketName field.

All 83 aws-lambda tests still pass.
Copy link
Copy Markdown
Collaborator

@miguel-heygen miguel-heygen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both blockers from the previous review are addressed:

1. CDK snapshot test timeout — Fixed. Both HyperframesRenderStack.contract.test.ts and HyperframesRenderStack.snapshot.test.ts now cache the CDK synth in beforeAll with a 30s timeout (line: beforeAll(() => { ... }, 30000)). Individual tests that synth non-default-props stacks also have per-test }, 30000) timeouts. The default 5s vs ~5-8s cold synth race is eliminated.

2. stateTransitions: events.length over-reporting SFN cost — Fixed. summarizeHistory() now counts only *StateEntered events (TaskStateEntered, MapStateEntered, PassStateEntered, ChoiceStateEntered, SucceedStateEntered, FailStateEntered, WaitStateEntered, ParallelStateEntered) via a switch statement, not all history events. This correctly models SFN Standard Workflow billing (per state entry) and eliminates the 3-5x over-report from intermediate events like LambdaFunctionScheduled/Started/Succeeded.

Regression shards still pending but unrelated to this PR's changes (CDK construct + SDK cost accounting).

Copy link
Copy Markdown
Collaborator

@vanceingalls vanceingalls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-review of 4f2c4b0c. Both blockers and 2/3 importants resolved cleanly; the remaining important is partial but not blocking.

Findings status

Blocker 1 — CDK snapshot test timeout: ADDRESSED. beforeAll(..., 30000) hoists synth in both HyperframesRenderStack.snapshot.test.ts and HyperframesRenderStack.contract.test.ts; default-args tests now run as µs assertions against SYNTHED / DEFAULT_TEMPLATE. The two non-default-prop contract tests (reservedConcurrency, projectName) keep a per-test 30000 timeout for their fresh synth. CI is green on this HEAD.

Blocker 2 — SFN cost over-reporting: ADDRESSED. getRenderProgress.ts:185-200 now counts only *StateEntered events toward stateTransitions, dropping the 3-5× over-count. The accompanying test wraps lambda events in matching stateEntered/stateExited pairs and the cost-projection test still passes (progress.costs.breakdown.lambdaUsd ≈ $0.001 for a 1-invocation execution).

Important — assembleComplete discriminant: ADDRESSED. Moved to stateExitedEventDetails?.name === "Assemble" (getRenderProgress.ts:230), decoupled from the inner Lambda payload's Action field. framesRendered is also state-gated via currentLambdaState === "RenderChunk" (line 222), which removes the implicit dependency on Action and adds a test (does not double-count Assemble's FramesEncoded) that exercises the boundary.

Important — SiteHandle.bucketName: ADDRESSED. SiteHandle now carries bucketName (deploySite.ts); README example simplified to bucketName: site.bucketName.

Important — CDK↔SAM parity test: PARTIAL. SFN TracingConfiguration: { Enabled: true } is asserted in HyperframesRenderStack.contract.test.ts:59, but Lambda X-Ray tracing + log-delivery IAM are still not asserted explicitly — they rely on aws-cdk-lib's auto-grants. Not blocking: the resource-count snapshot + state-machine topology assertion catch resource-shape regressions, and explicit X-Ray/log-delivery assertions are a reasonable follow-up rather than a merge gate.

Nits (non-blocking)

  • currentLambdaState lifetime. currentLambdaState = ev.stateEnteredEventDetails?.name ?? currentLambdaState (line 199) overwrites on every *StateEntered. For this state machine's history shape (no nested state-entered between Task entered → Lambda lifecycle → Task exited) the carry-through is correct in practice. If future changes nest Choice/Pass states between Task entry and the Lambda lifecycle events, the gate could mis-attribute — worth a comment, not a fix.

Verdict

Verdict: APPROVE
Reasoning: Both blockers + the load-bearing importants are resolved with code that's evidence-backed by the updated tests; CI is green. The remaining partial is non-blocking and a reasonable follow-up.

Review by Vai (re-review)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants