Skip to content

ci: provision PR preview environments in Railway#2681

Merged
amikofalvy merged 9 commits intomainfrom
codex/preview-env-clean-2662
Mar 18, 2026
Merged

ci: provision PR preview environments in Railway#2681
amikofalvy merged 9 commits intomainfrom
codex/preview-env-clean-2662

Conversation

@vnv-varun
Copy link
Contributor

@vnv-varun vnv-varun commented Mar 13, 2026

Summary

  • add a preview-environments.yml workflow that provisions a per-PR Railway environment from preview-base, injects the resolved backend values into Vercel preview deployments, and aliases stable PR preview domains
  • make the preview lifecycle explicit: PR open/sync creates or updates pr-<number>, and PR close tears that Railway environment down
  • harden the flow with smoke checks, failure diagnostics, Vercel runtime log capture, and extracted preview scripts under .github/scripts/preview/

Key decisions

  • keep Railway as the per-PR backing-services tier and keep Vercel as the app-preview tier; the preview API/UI are served from Vercel, while Railway provides the PR-scoped database and SpiceDB connectivity
  • pass Git metadata to vercel deploy so live CLI preview deployments receive the PR branch-scoped env vars they need at runtime
  • keep the workflow YAML orchestration-focused and move the operational shell logic into checked-in scripts for maintainability and reviewability

Expected behavior

  • on internal PR open or sync, the workflow creates or reuses Railway environment pr-<PR number> cloned from preview-base
  • the workflow resolves the Railway-backed runtime values, writes them into the Vercel preview environment for the PR branch, deploys fresh preview builds, and aliases stable domains:
    • https://pr-<PR number>-api.preview.inkeep.com
    • https://pr-<PR number>-ui.preview.inkeep.com
  • the preview workflow then smoke-checks the API and UI; if provisioning, env injection, deploy, or smoke fails, the PR check fails
  • on PR close, the teardown job deletes the Railway environment pr-<PR number>

How we tested

  • automatic PR run on the current head (d4fb50c64) passed end to end, including Provision Tier 1 (Railway), Inject Vercel Preview Env Vars, and Smoke Test Preview URLs
  • manual workflow_dispatch run 23125912084 also passed
  • the current preview API health endpoint responds successfully at https://pr-2681-api.preview.inkeep.com/health
  • the current preview UI is reachable behind Vercel auth at https://pr-2681-ui.preview.inkeep.com
  • Claude follow-up review on the current head reported all prior issues addressed

Future considerations

  • Neon branching and seeded preview data are still out of scope for this PR; the current path uses Railway-backed preview infrastructure and smoke coverage for env/deploy correctness

Supersedes

@changeset-bot
Copy link

changeset-bot bot commented Mar 13, 2026

⚠️ No Changeset found

Latest commit: d4fb50c

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@pullfrog
Copy link
Contributor

pullfrog bot commented Mar 13, 2026

Error

agent completed without reporting progress

Pullfrog  | Rerun failed job ➔View workflow run | Triggered by Pullfrogpullfrog.com𝕏

@vercel
Copy link

vercel bot commented Mar 13, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 16, 2026 9:51am
agents-docs Ready Ready Preview, Comment Mar 16, 2026 9:51am
agents-manage-ui Ready Ready Preview, Comment Mar 16, 2026 9:51am

Request Review

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(8) Total Issues | Risk: Medium

🟠⚠️ Major (6) 🟠⚠️

Inline Comments:

  • 🟠 Major: preview-environments.yml:544 Action actions/checkout@v6 not pinned to SHA
  • 🟠 Major: preview-environments.yml:213 Docker image ghcr.io/railwayapp/cli:latest uses mutable tag (also lines 438, 839)
  • 🟠 Major: preview-environments.yml:357-367 Secrets written to GITHUB_OUTPUT before masking
  • 🟠 Major: preview-environments.yml:514-519 upsert_env discards error body and lacks timeout
  • 🟠 Major: preview-environments.yml:780-782 Vercel runtime logs may expose secrets in error traces
  • 🟠 Major: preview-environments.yml:848-862 Teardown lacks error handling and verification

🟡 Minor (2) 🟡

Inline Comments:

  • 🟡 Minor: preview-environments.yml:91-97 GitHub API curl call missing explicit timeout
  • 🟡 Minor: preview-environments.yml:327-342 Retry loops use fixed delays instead of exponential backoff (duplicated at lines 447-478, 697-740)

💭 Consider (3) 💭

Inline Comments:

  • 💭 Consider: preview-environments.yml:564 Vercel CLI installed without version pinning (also line 776)

💭 1) jobs Add timeout-minutes to all jobs

Issue: None of the jobs specify timeout-minutes. This workflow involves external API calls, polling loops with up to 30 attempts (60+ seconds), and deployment operations.

Why: Without timeouts, hung jobs could consume runner minutes indefinitely. Other workflows in this repo (ci.yml, cypress.yml) specify explicit timeouts.

Fix: Add timeout-minutes: 20 (or appropriate value) to each job definition, especially provision-tier1, configure-vercel-preview, smoke-preview, and teardown-tier1.

Refs:

💭 2) ANTHROPIC_API_KEY Add explicit masking for defense-in-depth

Issue: ANTHROPIC_API_KEY is passed via environment and used in upsert_env but is never explicitly masked with ::add-mask::. While GitHub Actions auto-masks ${{ secrets.* }} references, the masking applies to exact string values.

Why: If the secret appears in a slightly different context (URL-encoded, in error messages with context, or truncated), it may not be masked. Explicit masking provides defense-in-depth.

Fix: Add at the start of the step that uses it:

echo "::add-mask::${ANTHROPIC_API_KEY}"

✅ Good Practices Observed

The workflow demonstrates several security and operational best practices:

  • Least-privilege permissions with contents: read and pull-requests: write
  • Internal PR gating — preview environments only provision for internal PRs (lines 106-112)
  • PR number/branch validation with regex guards (lines 83-86, 201-204, 420-423, 827-830)
  • Proper secret masking for database URLs (lines 361-363, 491-493)
  • Safe environment naming with strict ^pr-[0-9]+$ pattern enforcement
  • No use of dangerous triggers like pull_request_target
  • Comprehensive diagnostics on smoke test failure (lines 745-782)

🚫 REQUEST CHANGES

Summary: This is a well-structured preview environment workflow with good security fundamentals. The main concerns are supply chain risks (unpinned action/container tags), secret handling timing (mask-before-use), and operational reliability (timeouts, error handling, backoff strategies). The action SHA pinning and secret masking order are the most important items to address before merge. The other findings are improvements that can be addressed now or in follow-up.


Discarded (7)
Location Issue Reason Discarded
json_get_var() "Empty parameter assignment" False positive — code correctly uses local json="$1"
ensure_runtime_var_seeded() "Empty parameter assignment" False positive — code correctly uses local key="$1"
extract_runtime_var() "Empty parameter assignment" False positive — code correctly uses local key="$1"
debug_key() "Empty parameter assignment" False positive — code correctly uses local key="$1"
upsert_env() "Empty parameter assignment" False positive — code correctly uses local project_id="$1"
wait_for_success_url() "Empty parameter assignment" False positive — code correctly uses local url="$1"
line 420 "Branch regex may reject valid names" Valid observation but intentionally restrictive for safety
Reviewers (5)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-devops 8 0 2 0 2 0 4
pr-review-sre 10 0 0 0 3 0 7
pr-review-security-iam 4 0 1 0 3 0 0
pr-review-errors 6 0 0 0 2 0 4
pr-review-standards 7 0 0 0 0 0 7
Total 35 0 3 0 10 0 22

Note: pr-review-standards hallucinated multiple CRITICAL findings about "empty parameter assignments" in bash functions — all were false positives upon verification.

fi

- name: Checkout
uses: actions/checkout@v6
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Action not pinned to SHA

Issue: actions/checkout@v6 uses a mutable tag instead of a full SHA. All other workflows in this repository consistently use SHA pinning.

Why: Mutable tags can be updated by upstream maintainers, creating a supply chain risk where malicious code could be injected into workflows.

Fix:

Suggested change
uses: actions/checkout@v6
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6

Refs:

-v "${RAILWAY_HOME_DIR}:/root/.railway" \
-v "$PWD:/workspace" \
-w /workspace \
ghcr.io/railwayapp/cli:latest railway "$@"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Docker image uses mutable :latest tag

Issue: The Railway CLI Docker image uses the :latest tag at lines 213, 438, and 839. Mutable container tags are a supply chain risk — the image contents can change between workflow runs.

Why: This can introduce breaking changes, vulnerabilities, or malicious code without any visible change to the workflow file. It's functionally equivalent to using mutable action tags.

Fix: Pin to a specific version tag or digest. Check Railway's container registry for available version tags:

# Example (replace with actual stable version):
ghcr.io/railwayapp/cli:3.x.x
# or using digest:
ghcr.io/railwayapp/cli@sha256:<digest>

Refs:

Comment on lines +357 to +367
MANAGE_DB_URL="$(extract_runtime_var "${RAILWAY_MANAGE_DB_URL_KEY}")"
RUN_DB_URL="$(extract_runtime_var "${RAILWAY_RUN_DB_URL_KEY}")"
SPICEDB_ENDPOINT="$(extract_runtime_var "${RAILWAY_SPICEDB_ENDPOINT_KEY}")"

echo "::add-mask::${MANAGE_DB_URL}"
echo "::add-mask::${RUN_DB_URL}"
echo "::add-mask::${SPICEDB_ENDPOINT}"

echo "manage_db_url=${MANAGE_DB_URL}" >> "$GITHUB_OUTPUT"
echo "run_db_url=${RUN_DB_URL}" >> "$GITHUB_OUTPUT"
echo "spicedb_endpoint=${SPICEDB_ENDPOINT}" >> "$GITHUB_OUTPUT"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Mask secrets immediately after extraction

Issue: Database URLs are extracted and stored in variables (lines 357-359), then masking is applied (lines 361-363), and finally written to outputs (lines 365-367). If extraction fails mid-way or an error occurs between extraction and masking, secret values may appear unmasked in logs.

Why: The ::add-mask:: directive should be called immediately after each secret is obtained, before any other code runs that could fail and expose the value.

Fix: Restructure to mask each secret immediately after extraction:

MANAGE_DB_URL="$(extract_runtime_var "${RAILWAY_MANAGE_DB_URL_KEY}")"
echo "::add-mask::${MANAGE_DB_URL}"

RUN_DB_URL="$(extract_runtime_var "${RAILWAY_RUN_DB_URL_KEY}")"
echo "::add-mask::${RUN_DB_URL}"

SPICEDB_ENDPOINT="$(extract_runtime_var "${RAILWAY_SPICEDB_ENDPOINT_KEY}")"
echo "::add-mask::${SPICEDB_ENDPOINT}"

echo "manage_db_url=${MANAGE_DB_URL}" >> "$GITHUB_OUTPUT"
echo "run_db_url=${RUN_DB_URL}" >> "$GITHUB_OUTPUT"
echo "spicedb_endpoint=${SPICEDB_ENDPOINT}" >> "$GITHUB_OUTPUT"

Refs:

Comment on lines +514 to +519
curl --fail-with-body -sS \
-X POST \
-H "Authorization: Bearer ${VERCEL_TOKEN}" \
-H "Content-Type: application/json" \
"https://api.vercel.com/v10/projects/${project_id}/env?upsert=true&teamId=${VERCEL_ORG_ID}" \
--data "${payload}" >/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Error output discarded and missing timeout

Issue: The upsert_env() function uses --fail-with-body but redirects output to /dev/null, losing the Vercel API error body. Additionally, there's no --connect-timeout or --max-time, so a hung request could block the workflow indefinitely.

Why: When any of the 9 sequential upsert_env calls fail (lines 523-541), operators see only a generic curl error with no indication of which env var failed or why. The --fail-with-body flag is designed to preserve error details, but discarding stdout defeats this.

Fix: Add timeouts and preserve error output:

upsert_env() {
  local project_id="$1"
  local key="$2"
  local value="$3"
  local payload response

  payload="$(jq -n \
    --arg key "$key" \
    --arg value "$value" \
    --arg branch "$PR_BRANCH" \
    '{key:$key, value:$value, type:"encrypted", target:["preview"], gitBranch:$branch}')"

  if ! response=$(curl --fail-with-body -sS \
    --connect-timeout 10 --max-time 60 \
    -X POST \
    -H "Authorization: Bearer ${VERCEL_TOKEN}" \
    -H "Content-Type: application/json" \
    "https://api.vercel.com/v10/projects/${project_id}/env?upsert=true&teamId=${VERCEL_ORG_ID}" \
    --data "${payload}" 2>&1); then
    echo "Failed to upsert env var ${key} for project ${project_id}:" >&2
    echo "${response}" >&2
    return 1
  fi
}

Refs:

Comment on lines +780 to +782
vercel inspect "${TARGET_DEPLOYMENT}" --token="${VERCEL_TOKEN}" --scope="${VERCEL_ORG_ID}" 2>&1 | tee /tmp/vercel-inspect.txt
echo "## Vercel Runtime Logs (API)" >> "$GITHUB_STEP_SUMMARY"
vercel logs "${TARGET_DEPLOYMENT}" --token="${VERCEL_TOKEN}" --scope="${VERCEL_ORG_ID}" --since=1h --no-follow 2>&1 | tail -n 200
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Runtime logs may expose secrets

Issue: The diagnostic step fetches Vercel runtime logs and outputs them to the workflow console and step summary. If the application crashes during startup due to configuration issues, stack traces may include environment variable values (especially database connection strings in driver errors).

Why: This is noted as a known issue in the PR description (500 FUNCTION_INVOCATION_FAILED). When debugging such failures, logs containing connection strings could be written to the publicly-viewable workflow summary.

Fix: Filter or redact the logs before outputting, or keep them as workflow artifacts with restricted access:

vercel logs "${TARGET_DEPLOYMENT}" --token="${VERCEL_TOKEN}" --scope="${VERCEL_ORG_ID}" --since=1h --no-follow 2>&1 \
  | sed -E 's/(postgres(ql)?:\/\/[^@]+@[^:]+:[^\/]+)/[REDACTED]/gi' \
  | sed -E 's/(ANTHROPIC_API_KEY|SPICEDB_PRESHARED_KEY|BETTER_AUTH_SECRET)=[^[:space:]]*/\1=[REDACTED]/g' \
  | tail -n 200

Refs:

Comment on lines +848 to +862
railway project list --json > /tmp/railway-projects.json
ENV_EXISTS="$(
jq -r \
--arg project_id "${RAILWAY_PROJECT_ID}" \
--arg name "${RAILWAY_ENV_NAME}" \
'[.[] | select(.id == $project_id) | .environments.edges[].node | select(.name == $name)] | length' \
/tmp/railway-projects.json
)"

if [ "${ENV_EXISTS}" = "0" ]; then
echo "Railway environment ${RAILWAY_ENV_NAME} does not exist; nothing to teardown."
exit 0
fi

railway environment delete "${RAILWAY_ENV_NAME}" --yes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Teardown lacks error handling and verification

Issue: The teardown job doesn't check if railway project list succeeded (line 848), and doesn't verify that deletion actually completed (line 862). Unlike the provision job which wraps this in an if block (lines 222-232), the teardown assumes the command succeeds.

Why: If railway project list fails (network issues, auth expired), subsequent jq may return incorrect results, causing silent failures. If deletion fails, Railway environments leak and accumulate, increasing costs. The set -euo pipefail should catch complete failures but partial/malformed output could slip through.

Fix: Add explicit error handling consistent with the provision job:

if ! railway project list --json > /tmp/railway-projects.json; then
  echo "Failed to list Railway environments for project ${RAILWAY_PROJECT_ID}" >&2
  echo "Unable to verify teardown state; exiting to prevent orphaning."
  exit 1
fi

# ... existing jq logic ...

if ! railway environment delete "${RAILWAY_ENV_NAME}" --yes; then
  echo "Failed to delete Railway environment ${RAILWAY_ENV_NAME}." >&2
  echo "Manual cleanup may be required via Railway dashboard or CLI." >&2
  exit 1
fi

# Verify deletion
railway project list --json > /tmp/railway-projects-post.json
POST_EXISTS="$(jq -r --arg project_id "${RAILWAY_PROJECT_ID}" --arg name "${RAILWAY_ENV_NAME}" \
  '[.[] | select(.id == $project_id) | .environments.edges[].node | select(.name == $name)] | length' \
  /tmp/railway-projects-post.json)"
if [ "${POST_EXISTS}" != "0" ]; then
  echo "Warning: Environment ${RAILWAY_ENV_NAME} may not have been fully deleted" >&2
fi

Refs:

Comment on lines +91 to +97
PR_BRANCH="$(
curl -fsS \
-H "Authorization: Bearer ${GH_TOKEN}" \
-H "Accept: application/vnd.github+json" \
"${GITHUB_API_URL}/repos/${WORKFLOW_REPO_FULL_NAME}/pulls/${PR_NUMBER}" |
jq -r '.head.ref // empty'
)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Add explicit timeout to curl call

Issue: The curl call to GitHub API uses -fsS but no --connect-timeout or --max-time. If the API is slow or unresponsive, this could hang indefinitely until the job times out.

Why: Fast-fail behavior helps surface issues quickly rather than wasting runner time.

Fix: Add explicit timeouts:

Suggested change
PR_BRANCH="$(
curl -fsS \
-H "Authorization: Bearer ${GH_TOKEN}" \
-H "Accept: application/vnd.github+json" \
"${GITHUB_API_URL}/repos/${WORKFLOW_REPO_FULL_NAME}/pulls/${PR_NUMBER}" |
jq -r '.head.ref // empty'
)"
PR_BRANCH="$(
curl --connect-timeout 10 --max-time 30 -fsS \
-H "Authorization: Bearer ${GH_TOKEN}" \
-H "Accept: application/vnd.github+json" \
"${GITHUB_API_URL}/repos/${WORKFLOW_REPO_FULL_NAME}/pulls/${PR_NUMBER}" |
jq -r '.head.ref // empty'
)"

Refs:

Comment on lines +327 to +342
for attempt in $(seq 1 "${max_attempts}"); do
line="$(printf '%s\n' "${SERVICE_ENV_DUMP}" | grep -m1 "^${key}=" || true)"
if [ -n "${line}" ]; then
value="${line#*=}"
if ! printf '%s' "${value}" | grep -q '\$[{][{]'; then
printf '%s' "${value}"
return 0
fi
fi

if [ "${attempt}" -lt "${max_attempts}" ]; then
sleep "${sleep_seconds}"
refresh_service_env_dump
continue
fi
done
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Consider exponential backoff for retry loop

Issue: The extract_runtime_var() function uses fixed 2-second delays (30 attempts × 2s = 60s total). When Railway experiences issues, concurrent workflow runs will retry at the same intervals, creating thundering herd patterns.

Why: Exponential backoff with jitter spreads retry traffic and reduces load on external services during incidents. This pattern is also duplicated in the configure-vercel-preview job (lines 447-478).

Fix: Consider adding exponential backoff:

# Inside the retry loop, replace fixed sleep with:
local delay=$((sleep_seconds * (2 ** (attempt / 5))))  # double every 5 attempts
delay=$((delay > 30 ? 30 : delay))  # cap at 30s
sleep $((delay + RANDOM % 3))  # add jitter

Refs:

echo "PR branch is required for preview deploy targeting."
exit 1
fi
npm install -g vercel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: Pin Vercel CLI version

Issue: The Vercel CLI is installed via npm install -g vercel without version pinning (also at line 776).

Why: While the existing vercel-production.yml workflow uses the same pattern, unpinned global installs can lead to non-deterministic behavior if Vercel releases breaking changes.

Fix: Consider pinning to a specific version for reproducibility:

Suggested change
npm install -g vercel
npm install -g vercel@39

Refs:

@github-actions github-actions bot deleted a comment from claude bot Mar 13, 2026
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(10) Total Issues | Risk: Medium

Delta Re-Review: This review covers only the 52 new lines added since the prior review. The prior review identified 8 issues that remain pending.

🟠⚠️ Major (1) 🟠⚠️

Inline Comments:

  • 🟠 Major: preview-environments.yml:673 tsx dependency unavailable at runtime — probe will always fail

🟡 Minor (1) 🟡

Inline Comments:

  • 🟡 Minor: preview-environments.yml:661-667 Cleanup not guaranteed on early exit

💭 Consider (1) 💭

💭 1) probe_api_env_schema Clarify probe_status usage intent

Issue: The probe_status variable is captured on line 686 (probe_status=$?) but is only used in the diagnostic echo. The probe function returns successfully regardless of whether schema validation passed or failed.

Why: It's unclear whether this is intentional "informational-only" behavior or a forgotten TODO. Future maintainers may not understand whether probe failure should block deployment.

Fix: Either:

  • If intentionally non-blocking, add a comment: # Note: probe failure is diagnostic only; schema errors caught at runtime
  • If it should block, add: if [ "${probe_status}" -ne 0 ]; then exit 1; fi

Refs:

🕐 Pending Recommendations (8)

Issues from the prior review that remain unaddressed:


🚫 REQUEST CHANGES

Summary: The new probe_api_env_schema() diagnostic function has a blocking issue: it uses tsx which isn't installed in this job. The probe will always fail, defeating its diagnostic purpose. Address the tsx dependency before merge. The prior review's 8 issues also remain pending — particularly the action SHA pinning and secret masking order.


Discarded (2)
Location Issue Reason Discarded
657-658 mktemp/cp operations don't verify success Low-probability edge case (tmp disk full); impact is minimal in ephemeral CI
probe_status Variable captured but not returned Reclassified as Consider — it's unclear whether this is intentional
Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 0 0 0 0 0 0 0
pr-review-errors 3 0 1 0 1 0 1
pr-review-devops 2 0 0 0 1 0 0
Total 5 0 1 0 2 0 1

Note: Delta review scope — only reviewed the 52 new lines in probe_api_env_schema(). Prior review findings are carried forward as Pending Recommendations.


{
echo "## API Env Schema Probe"
if node --import tsx/esm -e '
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: tsx dependency unavailable at runtime

Issue: The probe_api_env_schema() function uses node --import tsx/esm to validate env schemas, but tsx is a devDependency that requires pnpm install to be available. This job only runs npm install -g vercel (line 564) and does not install project dependencies.

Why: The probe will fail immediately with a module resolution error:

Error: Cannot find package 'tsx' imported from...

This defeats the diagnostic purpose — the probe will always report failure even when the env schema is valid, making it impossible to distinguish between actual schema issues and missing dependencies.

Fix: Either:

  1. Add pnpm setup and install (recommended if you want full TypeScript validation):
- name: Setup pnpm
  uses: pnpm/action-setup@v4
- name: Setup Node.js
  uses: actions/setup-node@v4
  with:
    node-version-file: '.nvmrc'
    cache: 'pnpm'
- name: Install dependencies
  run: pnpm install --frozen-lockfile
  1. Use a simpler validation that doesn't require tsx (if you want to keep the job lightweight):
# Just check the files exist and are valid JS/TS syntax
node --check packages/agents-core/src/env.ts || echo "Syntax error in env.ts"

Refs:

Comment on lines +661 to +667
cleanup_probe_env() {
if [ -n "${backup_env}" ] && [ -f "${backup_env}" ]; then
mv "${backup_env}" ".env"
else
rm -f ".env"
fi
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Cleanup not guaranteed on early exit

Issue: The cleanup_probe_env() function is defined but not registered as a trap handler. If the script exits unexpectedly between line 669 (cp "${env_file}" ".env") and line 691 (cleanup_probe_env), the original .env file won't be restored.

Why: With set -euo pipefail, any unexpected error (e.g., tee failing on disk full) would exit before cleanup runs, leaving .env in an inconsistent state. While the CI environment is ephemeral, this could cause confusing failures if subsequent steps in the same job depend on .env.

Fix: Add a trap to ensure cleanup runs on exit:

cleanup_probe_env() {
  if [ -n "${backup_env}" ] && [ -f "${backup_env}" ]; then
    mv "${backup_env}" ".env"
  else
    rm -f ".env"
  fi
}
trap cleanup_probe_env EXIT  # Add this after defining the function

cp "${env_file}" ".env"
# ... rest of probe logic ...

cleanup_probe_env
trap - EXIT  # Clear the trap after successful cleanup

Refs:

@github-actions github-actions bot deleted a comment from claude bot Mar 13, 2026
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(8) Total Issues | Risk: Medium

Delta Re-Review: This review covers the changes since the prior review (commit 8ede9a12fa62c4e58e57d). The delta removes the problematic probe_api_env_schema() function (addressing the tsx dependency issue), adds production env defaults, and introduces a logEnvIssues() helper to 3 env.ts files.

✅ Delta Changes Look Good

The new commit addresses the blocking issue from the last review:

  • Removed probe_api_env_schema() — Fixes the MAJOR issue where tsx was unavailable at runtime
  • Added NODE_ENV, ENVIRONMENT, LOG_LEVEL — Reasonable production defaults for Vercel preview
  • Added logEnvIssues() helper — Improves error visibility when env validation fails
  • Extended debug keys — More comprehensive env snapshot for diagnostics

💭 Consider (1) 💭

Inline Comments:

  • 💭 Consider: agents-api/src/env.ts:347-352 Duplicated logEnvIssues() helper across three env.ts files

🕐 Pending Recommendations (8)

Issues from prior reviews that remain unaddressed:


🚫 REQUEST CHANGES

Summary: The delta changes are clean and address the prior tsx dependency issue. However, the 6 MAJOR and 2 other issues from the original review remain pending. The action SHA pinning and secret masking order are the most important items to address before merge.


Discarded (1)
Location Issue Reason Discarded
logEnvIssues() Using console.error instead of structured logger Acceptable for bootstrap-phase code where logger may not be initialized yet. Matches existing pattern in env.ts files.
Reviewers (2)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-standards 0 0 0 0 0 0 0
pr-review-consistency 2 0 0 0 1 0 1
Total 2 0 0 0 1 0 1

Note: Delta review scope — only reviewed the changes in commit c4e58e57d. Prior review findings (8 items) are carried forward as Pending Recommendations.

Comment on lines +347 to +352
const logEnvIssues = (scope: string, error: z.ZodError) => {
for (const issue of error.issues) {
const key = issue.path.length > 0 ? issue.path.join('.') : '<root>';
console.error(`[${scope}] ${key}: ${issue.message}`);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: Duplicated helper across three files

Issue: This identical logEnvIssues() function is added to three separate env.ts files (agents-api, agents-core, agents-work-apps).

Why: Creates maintenance burden — if the logging format needs to change, three files must be updated. The codebase already has a similar formatZodErrors() helper in agents-cli/src/utils/profiles/profile-manager.ts.

Fix: Consider extracting to a shared location in @inkeep/agents-core (e.g., packages/agents-core/src/utils/zod-helpers.ts). Alternatively, since this is small bootstrap-phase code that's unlikely to change, the duplication may be acceptable — just documenting the pattern.

Refs:

@github-actions github-actions bot deleted a comment from claude bot Mar 13, 2026
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(8) Total Issues | Risk: Medium

Delta Re-Review: This review covers the changes since the prior review (commit c4e58e57d5e47c0933). The delta adds git metadata flags to Vercel deployments (21 lines changed).

✅ Delta Changes Look Good

The new commit is a clean improvement:

  • Added PR_HEAD_SHA — Correctly sources from github.event.pull_request.head.sha || github.sha
  • Added SHA validation — Guards against missing SHA before proceeding (lines 668-671)
  • Added git metadata flags — Uses -m flags to attach deployment metadata to Vercel:
    • githubDeployment=1
    • githubCommitRef, githubCommitSha
    • githubOrg, githubRepo, githubCommitOrg, githubCommitRepo

This improves deployment traceability in Vercel's dashboard by linking deployments back to their source commits.

🕐 Pending Recommendations (8)

Issues from prior reviews that remain unaddressed:

💭 Consider (1) 💭

💭 1) agents-api/src/env.ts:347-352 Duplicated logEnvIssues() helper

Issue: Identical logEnvIssues() function added to three env.ts files. This was noted in the prior review and remains — consider extracting to @inkeep/agents-core if this helper is expected to evolve.

Refs:


✅ Good Practices Observed (Cumulative)

The workflow demonstrates several security and operational best practices:

  • Least-privilege permissions with contents: read and pull-requests: write
  • Internal PR gating — preview environments only provision for internal PRs
  • PR number/branch validation with regex guards
  • Proper secret masking for database URLs (lines 361-363, 491-493)
  • Safe environment naming with strict ^pr-[0-9]+$ pattern enforcement
  • No use of dangerous triggers like pull_request_target
  • Comprehensive diagnostics on smoke test failure
  • Git metadata attachment (new in this delta) for deployment traceability

🚫 REQUEST CHANGES

Summary: The delta changes are clean and improve deployment traceability. However, the 6 MAJOR and 2 other issues from prior reviews remain pending. The most important items to address before merge are:

  1. Action SHA pinning (actions/checkout@v6 → use SHA like other workflows)
  2. Docker image tag pinning (ghcr.io/railwayapp/cli:latest → use specific version)
  3. Secret masking order (mask immediately after extraction, before any other code runs)

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator 1 0 1 0 0 8 0
Total 1 0 1 0 0 8 0

Note: Delta review scope — only reviewed the 21 new lines in commit 5e47c0933. Prior review findings (8 items) are carried forward as Pending Recommendations.

@github-actions github-actions bot deleted a comment from claude bot Mar 13, 2026
Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

Delta Re-Review: This review covers the changes since the prior review (commit 3ffb380f036e9eb458). The delta is a 2-line change that addresses the prior review's Minor finding.

✅ Prior Issue Addressed

The commit 36e9eb458 (ci: broaden preview log redaction) correctly addresses the Minor finding from the prior review:

Prior Finding How Addressed
Incomplete secret coverage in redact_preview_logs() Replaced hardcoded list with generic regex: [A-Z_]*(SECRET|KEY|TOKEN|PASSWORD)[A-Z_]*

Before:

-e 's#((ANTHROPIC_API_KEY|SPICEDB_PRESHARED_KEY|BETTER_AUTH_SECRET|...)[:=])[^\r\n[:space:]]+#\1[REDACTED]#g'

After:

-e 's#([A-Z_]*(SECRET|KEY|TOKEN|PASSWORD)[A-Z_]*[:=])[^\r\n[:space:]]+#\1[REDACTED]#g'

Why this is better:

  • Comprehensive: Catches any uppercase variable containing SECRET, KEY, TOKEN, or PASSWORD
  • Future-proof: No manual updates needed when new secrets are added
  • Exact match to recommendation: Implements the suggestion from the prior review

✅ Good Practices Summary (Cumulative)

This workflow now demonstrates excellent security and operational practices:

  • SHA-pinned actionsactions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
  • Version-pinned CLIs@railway/cli@4.31.0, vercel@50.32.5
  • Mask-before-use pattern — Secrets masked immediately after extraction
  • Comprehensive timeouts — All jobs have timeout-minutes, all curl calls have --connect-timeout and --max-time
  • Error handlingupsert_env() captures response, teardown has verification loop
  • Secret redaction — Generic pattern covers all SECRET/KEY/TOKEN/PASSWORD variables + Bearer tokens + postgres URLs
  • Least-privilege permissionscontents: read, pull-requests: write
  • Internal PR gating — Preview environments only for internal PRs
  • Safe naming patterns — Strict ^pr-[0-9]+$ environment name validation

💭 Consider (1) 💭

💭 1) agents-api/src/env.ts:347-352 Duplicated logEnvIssues() helper

Issue: The identical logEnvIssues() function is present in three env.ts files.

Why: Minor maintenance overhead — but since this is small bootstrap-phase code unlikely to change, the duplication is acceptable. Flagged in prior review for awareness.

Refs:


✅ APPROVE

Summary: This PR is ready to merge. All 9 issues from prior reviews have been addressed. The workflow now has comprehensive security hardening (SHA-pinned actions, version-pinned CLIs, immediate secret masking, generic secret redaction), proper timeouts at both job and request levels, robust error handling with verification, and safe environment naming. The logEnvIssues() duplication is a minor consideration for future cleanup but does not block merge.

Note: I'm unable to submit a formal GitHub approval due to permission limitations, but this review recommends approval.


Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator 1 0 1 0 0 0 0
Total 1 0 1 0 0 0 0

Note: Delta review — only reviewed the 2-line change in commit 36e9eb458. No subagent reviewers dispatched for this trivial delta.

@github-actions github-actions bot deleted a comment from claude bot Mar 14, 2026
@itoqa
Copy link

itoqa bot commented Mar 14, 2026

Ito Test Report ✅

13 test cases ran. 13 passed.

✅ Verification focused on included smoke, edge, and adversarial checks and found stable behavior for preview routing, auth challenges, and API resilience under expected conditions.

✅ Passed (13)
Test Case Summary Timestamp Screenshot
ROUTE-2 UI preview root was reachable and correctly presented an auth challenge without 5xx failure. 0:00 ROUTE-2_0-00.png
ROUTE-3 API root returned controlled 404 responses and showed no invocation-failure or 5xx behavior. 0:00 ROUTE-3_0-00.png
EDGE-2 UI remained reachable through 20 rapid reload cycles with stable 200 responses and no persistent fatal state. 1:27 EDGE-2_1-27.png
EDGE-3 Concurrent tab access stayed stable: current API health returned 204 via curl, API root returned 404, and UI consistently redirected to Vercel auth without persistent 5xx degradation. 2:54 EDGE-3_2-54.png
EDGE-4 Both portrait (390x844) and landscape (844x390) layouts rendered cleanly without horizontal overflow or major overlap. 1:27 EDGE-4_1-27.png
EDGE-5 Back/forward traversal across root, guarded routes, and health endpoint stayed stable with no redirect loop or crash. 1:27 EDGE-5_1-27.png
EDGE-6 Current and neighbor preview domains were isolated: current UI flowed to auth login and current API health was 204, while neighbor UI/API returned 404 signatures including DEPLOYMENT_NOT_FOUND. 3:08 EDGE-6_3-08.png
ADV-1 Protected routes remained auth-gated with 401 responses and no sensitive DOM leakage. 4:23 ADV-1_4-23.png
ADV-2 Encoded query payloads remained inert with no script execution or unsafe reflection. 4:42 ADV-2_4-42.png
ADV-3 Traversal-style encoded API probes were safely rejected with controlled 400 BAD_REQUEST responses. 4:51 ADV-3_4-51.png
ADV-4 Forged session-like cookies did not grant privileged access and the flow stayed stable under auth challenge. 5:07 ADV-4_5-07.png
ADV-5 Browser cross-origin fetch attempts were blocked by CORS while explicit Origin curl returned controlled health response headers. 5:39 ADV-5_5-39.png
ADV-6 High-frequency burst traffic preserved endpoint availability with consistent 204 responses before and after the burst. 6:36 ADV-6_6-36.png
📋 View Recording

Screen Recording

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(5) Total Issues | Risk: Medium

Delta Re-Review: This review covers the changes since the prior approved review (commit 36e9eb45839eac86c8). The delta extracts ~640 lines of inline bash from the workflow into 8 modular scripts under .github/scripts/preview/.

🟠⚠️ Major (1) 🟠⚠️

🟠 1) common.sh:29-40 Regression: Lost error handling for railway project list

Issue: The prior approved version had explicit error handling:

if railway project list --json > /tmp/railway-projects.json; then
  ...
else
  echo "Failed to list Railway environments for project ${RAILWAY_PROJECT_ID}"
  exit 1
fi

The extracted railway_env_exists_count() function lost this — it now silently proceeds if the Railway API call fails.

Why: When Railway authentication fails or the API is down, the error will be a generic "jq: parse error" rather than the informative message from the approved version. This is a functional regression.

Fix: Restore the error handling:

if ! railway project list --json > "${output_path}"; then
  echo "Failed to list Railway environments for project ${project_id}" >&2
  return 1
fi

Refs:

Inline Comments:

  • 🟠 Major: common.sh:29-40 Lost error handling for railway project list (regression)

🟡 Minor (3) 🟡

Inline Comments:

  • 🟡 Minor: common.sh:1-3 Missing set -euo pipefail for consistency
  • 🟡 Minor: deploy-vercel-preview.sh:4-12 Duplicated require_env_vars() — consider sourcing common.sh
  • 🟡 Minor: provision-railway.sh:22-26 railway link output discarded — harder to debug auth failures

💭 Consider (1) 💭

Inline Comments:

  • 💭 Consider: fetch-vercel-runtime-logs.sh:2 Add comment explaining set +e usage

💭 Prior Review Items (Still Valid) 💭

The following item from prior reviews remains applicable:

💭 1) agents-api/src/env.ts:347-352 Duplicated logEnvIssues() helper

Issue: The identical logEnvIssues() function is present in three env.ts files.

Why: Minor maintenance overhead — but since this is small bootstrap-phase code unlikely to change, the duplication is acceptable. Flagged in prior review for awareness.

Refs:


✅ Good Practices Observed

The script extraction maintains excellent patterns from prior reviews:

  • SHA-pinned actionsactions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
  • Version-pinned CLIs@railway/cli@4.31.0, vercel@50.32.5
  • Mask-before-use pattern — Secrets masked immediately after extraction
  • Comprehensive timeouts — All jobs have timeout-minutes, all curl calls have --connect-timeout and --max-time
  • Secret redaction — Generic pattern covers all SECRET/KEY/TOKEN/PASSWORD variables
  • Strict modeset -euo pipefail in all main scripts
  • Modular design — Clean separation of concerns into focused scripts
  • shellcheck directives — Proper source annotations for static analysis

🚫 REQUEST CHANGES

Summary: The script extraction is well-structured and maintains most good practices from prior reviews. However, there is one regression: the error handling for railway project list was lost during extraction. This must be restored before merge. The other findings (missing strict mode in common.sh, duplicated require_env_vars, discarded railway link output) are minor improvements.

Note: Unable to submit formal REQUEST_CHANGES due to permission limitations, but this review identifies a blocking regression.


Discarded (8)
Location Issue Reason Discarded
provision-railway.sh:119-133 Retry loop uses fixed delay without jitter Pre-existing pattern from approved commit; not introduced by this delta
upsert-vercel-preview-env.sh:43-75 Retry loop uses fixed delay without jitter Pre-existing pattern; not introduced by this delta
teardown-railway.sh:36-45 Verification loop uses fixed delay Pre-existing pattern; not introduced by this delta
smoke-preview.sh:22-30 wait_for_success_url discards curl output Pre-existing pattern; would require larger refactor
deploy-vercel-preview.sh:121-132 Vercel deploy lacks timeout Pre-existing; job timeout provides outer bound
provision-railway.sh:22-26 Railway CLI lacks timeout Pre-existing; job timeout provides outer bound
provision-railway.sh:101-105 railway variable set output discarded Pre-existing pattern; set -e will catch failures
smoke-preview.sh:4-12 Duplicated require_env_vars Covered by deploy-vercel-preview.sh finding
Reviewers (5)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-devops 4 0 0 0 2 0 1
pr-review-sre 6 0 0 0 0 0 6
pr-review-errors 7 1 0 0 2 0 0
pr-review-security-iam 0 0 0 0 0 0 0
pr-review-standards 0 0 0 0 0 0 0
Total 17 1 0 0 4 0 7

Note: Most SRE findings were discarded because they related to pre-existing patterns that were already present in the approved commit. The security and standards reviewers found no new issues in the delta. The key finding (railway_env_exists_count regression) was identified by the errors reviewer.

Comment on lines +29 to +40
railway_env_exists_count() {
local project_id="$1"
local env_name="$2"
local output_path="${3:-/tmp/railway-projects.json}"

railway project list --json > "${output_path}"
jq -r \
--arg project_id "${project_id}" \
--arg name "${env_name}" \
'[.[] | select(.id == $project_id) | .environments.edges[].node | select(.name == $name)] | length' \
"${output_path}"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: Regression — lost error handling for railway project list

Issue: The prior approved version (commit 36e9eb458) had explicit error handling:

if railway project list --json > /tmp/railway-projects.json; then
  ...
else
  echo "Failed to list Railway environments for project ${RAILWAY_PROJECT_ID}"
  exit 1
fi

This extracted version silently proceeds if the Railway API call fails. With set -euo pipefail in calling scripts, the failure should propagate, but the error message ("Failed to list Railway environments...") is lost, making debugging much harder.

Why: When Railway authentication fails or the API is down, the error will be a generic "jq: parse error" rather than the informative message that existed before. This is a functional regression from the approved version.

Fix:

Suggested change
railway_env_exists_count() {
local project_id="$1"
local env_name="$2"
local output_path="${3:-/tmp/railway-projects.json}"
railway project list --json > "${output_path}"
jq -r \
--arg project_id "${project_id}" \
--arg name "${env_name}" \
'[.[] | select(.id == $project_id) | .environments.edges[].node | select(.name == $name)] | length' \
"${output_path}"
}
railway_env_exists_count() {
local project_id="$1"
local env_name="$2"
local output_path="${3:-/tmp/railway-projects.json}"
if ! railway project list --json > "${output_path}"; then
echo "Failed to list Railway environments for project ${project_id}" >&2
return 1
fi
jq -r \
--arg project_id "${project_id}" \
--arg name "${env_name}" \
'[.[] | select(.id == $project_id) | .environments.edges[].node | select(.name == $name)] | length' \
"${output_path}"
}

Refs:

Comment on lines +1 to +3
#!/usr/bin/env bash

require_env_vars() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Missing set -euo pipefail for consistency

Issue: All other scripts in this directory use set -euo pipefail on line 2, but common.sh omits it. While scripts that source it set strict mode themselves, the inconsistency creates a risk if common.sh is ever sourced in a different context.

Why: Defense-in-depth — ensures common.sh functions fail fast even if sourced before the calling script's strict mode takes effect.

Fix: Add after the shebang:

Suggested change
#!/usr/bin/env bash
require_env_vars() {
#!/usr/bin/env bash
set -euo pipefail
require_env_vars() {

Refs:

Comment on lines +4 to +12
require_env_vars() {
local required
for required in "$@"; do
if [ -z "${!required:-}" ]; then
echo "Missing required configuration: ${required}" >&2
exit 1
fi
done
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Duplicated require_env_vars() — consider sourcing common.sh

Issue: This script defines its own require_env_vars() function (lines 4-12), duplicating the implementation in common.sh. Three scripts (deploy-vercel-preview.sh, smoke-preview.sh, capture-preview-failure-diagnostics.sh) have this duplication, while four others correctly source common.sh.

Why: Duplicated utility functions can diverge over time. Centralizing in common.sh ensures consistent behavior and reduces maintenance burden.

Fix: Replace the local definition with:

SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
# shellcheck source=.github/scripts/preview/common.sh
source "${SCRIPT_DIR}/common.sh"

Refs:

Comment on lines +22 to +26
railway link \
--project "${RAILWAY_PROJECT_ID}" \
--service "${RAILWAY_OUTPUT_SERVICE}" \
--environment "${RAILWAY_TEMPLATE_ENVIRONMENT}" \
>/dev/null
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: railway link output discarded — harder to debug auth failures

Issue: The output of railway link is redirected to /dev/null. If Railway authentication or linking fails, subsequent commands will fail with misleading "missing variable" or "environment not found" errors rather than surfacing the actual auth/link failure.

Why: When railway link fails, the error is hidden and cascading failures produce confusing messages like "no project linked" without showing the root cause (e.g., invalid token, network error).

Fix: Remove stdout suppression and add explicit error handling:

echo "Linking Railway project ${RAILWAY_PROJECT_ID}..."
if ! railway link \
  --project "${RAILWAY_PROJECT_ID}" \
  --service "${RAILWAY_OUTPUT_SERVICE}" \
  --environment "${RAILWAY_TEMPLATE_ENVIRONMENT}"; then
  echo "Failed to link Railway project" >&2
  exit 1
fi

Refs:

  • Same pattern appears in teardown-railway.sh:18-22 and upsert-vercel-preview-env.sh:37-41

@@ -0,0 +1,22 @@
#!/usr/bin/env bash
set +e
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💭 Consider: Add comment explaining set +e usage

Issue: This script uses set +e (line 2) rather than set -euo pipefail like all other scripts in this directory. This appears intentional for diagnostic collection where partial output is valuable.

Why: The inconsistency may confuse future maintainers. A brief comment explaining the design choice would help.

Fix: Add explanatory comment:

Suggested change
set +e
#!/usr/bin/env bash
# Diagnostic script - use set +e to allow partial output collection even if some commands fail
set +e

Refs:

@github-actions github-actions bot deleted a comment from claude bot Mar 16, 2026
@itoqa
Copy link

itoqa bot commented Mar 16, 2026

Ito Test Report ✅

18 test cases ran. 18 passed.

✅ Verification found the preview workflow behaviors that were exercised are consistent with the implementation, including input guards, protected UI status handling, diagnostics execution, and redaction behavior.

✅ Passed (18)
Test Case Summary
ROUTE-1 Checks page for PR #2681 showed Preview Environments run with the expected core jobs and successful job graph continuity.
ROUTE-2 API alias health endpoint returned HTTP 204 and UI alias consistently redirected to Vercel authentication, matching protected preview expectations.
ROUTE-3 Smoke Test Preview URLs job showed successful completion with Basic URL checks in the run flow, and alias checks aligned with healthy API plus protected UI behavior.
ROUTE-5 Failed smoke run 23043520535 shows Basic URL checks failure and executes both diagnostics steps; API triage capture was recorded with a non-2xx response for debugging.
LOGIC-1 Found repeated [agents-api env] key-level diagnostics in failed smoke logs for run 23043520535, confirming observable per-key env validation failures.
LOGIC-2 Confirmed [agents-core env] key-level diagnostics and aggregate invalid-env failure via reproducible local validation output; direct GitHub log download remained admin-gated (403) but did not prevent logic verification.
LOGIC-3 Confirmed [agents-work-apps env] key-level diagnostics and aggregate invalid-env failure via reproducible local validation output; direct GitHub log download remained admin-gated (403) but logic behavior was still validated.
LOGIC-4 Verified repeated deterministic unresolved runtime variable errors in provisioning/injection logs and captured current preview API health response artifact.
EDGE-1 Validated run 22983890503 where Preview Flow Disabled completed and provision, deploy, and smoke jobs were not executed.
EDGE-2 Fork PR #2699 exposed no Preview Environments checks or runs, and no evidence of provisioning, env injection, or smoke execution for that fork context.
EDGE-3 Confirmed PR number guard rejects non-numeric input with explicit invalid-number failure using authenticated-dispatch attempt plus local compute-context fallback validation; downstream execution did not start in fallback run.
EDGE-4 Validated branch-name guard in upsert script rejects disallowed branch characters with explicit invalid-branch error, preventing the upsert/deploy path from continuing.
EDGE-5 UI preview endpoint returned protected HTTP 401 while latest smoke run passed, and a separate smoke run failed with exit-code annotations, demonstrating boundary acceptance vs failing execution paths.
ADV-1 Provision annotations explicitly withhold secret-related outputs (spicedb_endpoint, run_db_url, manage_db_url), showing redaction protections are active in run-visible diagnostics.
ADV-2 Validated injection-style PR number payload is rejected by numeric guard as literal input and does not execute as a command in fallback compute-context execution.
ADV-3 Observed preview alias hosts for PR #2681 remained host-only under the expected preview domain pattern, and health capture succeeded on the API alias; additional privileged abuse-generation attempts were denied by integration permissions.
ADV-4 Unauthenticated desktop and mobile checks consistently stayed behind Vercel protection (HTTP 401 and login redirect), including five rapid mobile refreshes without protected content exposure.
ADV-5 After repeated navigation churn, the workflow list returned to a stable state and still reported the same latest run as authoritative.
📋 View Recording

Screen Recording

Copy link
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) Total Issues | Risk: Low

Delta Re-Review: This review covers the changes since the prior review (commit 39eac86c8d4fb50c64). The delta is a single commit (ci: harden preview script extraction) with +31/-34 lines across 8 bash scripts.

✅ All Prior Issues Addressed

The commit d4fb50c64 correctly addresses all 4 findings from the prior review:

Prior Finding How Addressed
🟠 Major: Lost error handling for railway project list ✅ Added error handling to railway_env_exists_count() in common.sh:35-38
🟡 Minor: railway link output discarded ✅ Added explicit error handling in provision-railway.sh:22-29, teardown-railway.sh:18-25, upsert-vercel-preview-env.sh:37-44
🟡 Minor: Duplicated require_env_vars() ✅ Scripts now source common.sh instead of duplicating the function
💭 Consider: Add comment explaining set +e ✅ Added explanatory comment in fetch-vercel-runtime-logs.sh:2

✅ Good Practices Summary (Cumulative)

This workflow now demonstrates excellent security and operational practices:

  • SHA-pinned actionsactions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd
  • Version-pinned CLIs@railway/cli@4.31.0, vercel@50.32.5
  • Mask-before-use pattern — Secrets masked immediately after extraction
  • Comprehensive timeouts — All jobs have timeout-minutes, all curl calls have --connect-timeout and --max-time
  • Error handlingupsert_env() captures response, railway_env_exists_count() fails explicitly, railway link checked in all scripts
  • Secret redaction — Generic pattern covers all SECRET/KEY/TOKEN/PASSWORD variables + Bearer tokens + postgres URLs
  • Least-privilege permissionscontents: read, pull-requests: write
  • Internal PR gating — Preview environments only for internal PRs
  • Safe naming patterns — Strict ^pr-[0-9]+$ environment name validation
  • Code consolidation — Shared helpers in common.sh, no duplicated code

💭 Consider (1) 💭

💭 1) agents-api/src/env.ts:347-352 Duplicated logEnvIssues() helper

Issue: The identical logEnvIssues() function is present in three env.ts files (agents-api, agents-core, agents-work-apps).

Why: Minor maintenance overhead — but since this is small bootstrap-phase code unlikely to change, the duplication is acceptable. Flagged in prior reviews for awareness.

Refs:


✅ APPROVE

Summary: All 4 issues from the prior review have been correctly addressed. The workflow now has comprehensive security hardening, proper error handling throughout, and clean code organization with shared helpers in common.sh. Ready to merge. 🚀

Note: Unable to submit formal GitHub approval due to permission limitations, but this review recommends approval.


Reviewers (3)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-devops 4 0 0 0 0 0 0
pr-review-errors 0 0 0 0 0 0 0
pr-review-standards 0 0 0 0 0 0 0
Total 4 0 0 0 0 0 0

Note: All reviewers confirmed prior issues are correctly addressed. No new issues identified in the delta.

@github-actions github-actions bot deleted a comment from claude bot Mar 16, 2026
@itoqa
Copy link

itoqa bot commented Mar 16, 2026

Ito Test Report ✅

15 test cases ran. 15 passed.

✅ Verification completed with a code-first review of the non-passed outcomes; the included results are stable reachability, security, stress, and CI-path checks that remained consistent with the inspected local code paths.

✅ Passed (15)
Test Case Summary Timestamp Screenshot
ROUTE-1 UI root at http://localhost:3000 loaded with HTTP 200, stayed on localhost, and showed no invocation-failure marker. 0:00 ROUTE-1_0-00.png
ROUTE-4 Root endpoint was reachable with HTTP 200 and response content did not contain FUNCTION_INVOCATION_FAILED. 0:00 ROUTE-4_0-00.png
ROUTE-5 UI bootstrap traffic used local API host localhost:3002 and did not target staging or production domains. 1:34 ROUTE-5_1-34.png
EDGE-2 After clearing session state, the UI loaded deterministically without crash markers or console errors. 1:43 EDGE-2_1-43.png
EDGE-4 Three stress cycles completed without unrecoverable blank state, runtime crash markers, or domain drift. 2:06 EDGE-4_2-06.png
EDGE-5 Duplicate tabs remained consistent on the same localhost route after alternating reload operations. 2:32 EDGE-5_2-32.png
EDGE-6 Rapid click and key burst did not destabilize the UI; page stayed responsive with normal request behavior. 2:48 EDGE-6_2-48.png
ADV-1 Untrusted-origin probe returned without permissive CORS allow-origin for evil.example. 4:22 ADV-1_4-22.png
ADV-3 Encoded script payload navigation did not set window.__xss and no execution signal was observed. 4:50 ADV-3_4-50.png
ADV-6 Rapid interaction burst completed and UI remained stable without invocation-failure crash marker. 6:09 ADV-6_6-09.png
MANUAL-1 Confirmed compute-context enforces numeric PR validation and fails immediately on invalid input via simulated execution. 21:02 MANUAL-1_21-02.png
MANUAL-2 Inspected failed run 23043520535 and confirmed step-summary/diagnostic flow includes preview context, Tier 1 provisioning, smoke diagnostics capture, and Vercel runtime log retrieval; sensitive tokens are redacted in logs. 11:05 MANUAL-2_11-05.png
MANUAL-3 Verified closed-PR teardown job wiring and teardown script logic that deletes the PR environment and confirms absence afterward. 21:02 MANUAL-3_21-02.png
MANUAL-4 Executed env parser validation paths with required vars removed/invalid and confirmed per-field stderr diagnostics precede consolidated invalid-environment errors for agents-api, agents-core, and agents-work-apps. 11:05 MANUAL-4_11-05.png
MANUAL-5 Confirmed workflow logic computes fork PRs as external and gates preview provisioning/deploy/smoke/teardown jobs behind is_internal_pr=true. 21:02 MANUAL-5_21-02.png
📋 View Recording

Screen Recording

@vnv-varun vnv-varun changed the title ci: provision PR preview environments ci: provision PR preview environments in Railway Mar 16, 2026
@vnv-varun vnv-varun requested a review from amikofalvy March 18, 2026 18:13
const logEnvIssues = (scope: string, error: z.ZodError) => {
for (const issue of error.issues) {
const key = issue.path.length > 0 ? issue.path.join('.') : '<root>';
console.error(`[${scope}] ${key}: ${issue.message}`);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pullfrog Should we be using the logger utility instead of console error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

console.error is intentional here — logEnvIssues runs at module load time during bootstrap, before the structured logger is initialized (the logger itself depends on env vars like LOG_LEVEL being parsed from this same schema). Using the logger would create a circular dependency. This matches the existing pattern across all three env.ts files (agents-api, agents-core, agents-work-apps).

Pullfrog  | View workflow run | Using Claude Code | Triggered by Pullfrogpullfrog.com𝕏

@pullfrog
Copy link
Contributor

pullfrog bot commented Mar 18, 2026

Addressed review comment from @amikofalvy on packages/agents-work-apps/src/env.ts:82.

Question: Should logEnvIssues use the structured logger instead of console.error?

Answer: No code change needed. console.error is intentional — this code runs at module load time during bootstrap, before the structured logger is initialized (the logger depends on env vars like LOG_LEVEL being parsed from this same schema). Using the logger would create a circular dependency. Replied inline.

Pullfrog  | View workflow run | Using Claude Code | Triggered by Pullfrogpullfrog.com𝕏

Copy link
Collaborator

@amikofalvy amikofalvy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's try it!

@amikofalvy amikofalvy added this pull request to the merge queue Mar 18, 2026
Merged via the queue into main with commit 91d6f7e Mar 18, 2026
27 checks passed
@amikofalvy amikofalvy deleted the codex/preview-env-clean-2662 branch March 18, 2026 21:28
dimaMachina pushed a commit that referenced this pull request Mar 18, 2026
* ci: add preview env diagnostics

* ci: probe preview env schema before deploy

* ci: probe preview env schema before deploy

* ci: harden preview api env defaults

* ci: attach git metadata to preview deploys

* ci: harden preview workflow operations

* ci: broaden preview log redaction

* ci: extract preview workflow scripts

* ci: harden preview script extraction
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants