ci: harden Railway preview provisioning by vnv-varun · Pull Request #2900 · inkeep/agents

vnv-varun · 2026-03-30T19:07:16Z

Summary

Make preview Railway automation use a single GraphQL control plane and reduce unnecessary Railway API churn. This removes the Railway CLI dependency from preview provisioning, bootstrap, env injection, and teardown, makes runtime-var repair deterministic, and makes the workflow compatible with the workspace-token API path we validated live.

Changes

replace preview Railway CLI operations with GraphQL helpers for:
- environment lookup
- environment create and delete
- variable reads and writes
- TCP proxy creation and readiness checks
remove Railway CLI installation from the preview workflow
make RAILWAY_API_TOKEN the primary CI secret path, with fallback to RAILWAY_TOKEN during migration
keep explicit preview runtime-var templates authoritative so stale inherited values get rewritten
add rate-limit-aware retries and Retry-After handling for Railway API calls
reduce Railway polling by using narrower environment lookups and a single runtime-var interpolation loop
cancel superseded same-PR preview runs and serialize Railway-mutating jobs across PRs
surface GraphQL helper errors instead of silently returning empty values
treat tcpProxies: null as an empty list during proxy readiness polling
harden teardown so it exits cleanly if the preview env was already deleted between lookup and delete

Why

The preview workflow was doing too much unnecessary Railway control-plane work and splitting that work across two different integration surfaces:

Railway CLI for some operations
direct GraphQL for others

That made auth behavior inconsistent, made rate-limit behavior harder to reason about, and forced preview CI to keep using the broader account token because the workspace token did not work with the CLI path.

This PR fixes that by moving preview Railway automation onto one API surface with one retry model and one token model.

Test Plan

bash -n .github/scripts/preview/*.sh
workflow YAML parse
git diff --check
live smoke through the refactored shell helpers using the Railway workspace token:
- resolve preview-base environment ID
- resolve doltgres service ID
- create a temporary preview env from preview-base
- upsert a service variable through GraphQL
- read that variable back through GraphQL
- delete the temporary env through GraphQL
confirm the full preview workflow reruns cleanly on the latest PR head

Notes

This PR hardens teardown behavior, but it does not add a scheduled stale-preview janitor.
There are still stale Railway PR envs from already-merged PRs in the live project today. This PR does not retroactively delete them.

changeset-bot · 2026-03-30T19:07:22Z

⚠️ No Changeset found

Latest commit: 53598e8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

vercel · 2026-03-30T19:07:24Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
agents-api	Ready	Preview, Comment	Mar 30, 2026 8:39pm
agents-docs	Ready	Preview, Comment	Mar 30, 2026 8:39pm
agents-manage-ui	Ready	Preview, Comment	Mar 30, 2026 8:39pm

pullfrog · 2026-03-30T20:27:14Z

TL;DR — Replaces all Railway CLI usage in preview environment workflows with direct GraphQL API calls, eliminating the split control plane (CLI + GraphQL) that caused two auth behaviors, two retry models, and workspace-token incompatibility. All GraphQL helpers now surface application-level errors (not just HTTP-level) through a shared railway_require_graphql_success gate, and the teardown polling loop uses jittered sleep to reduce thundering-herd pressure.

Key changes

Unified railway_graphql client with retry and rate-limit handling — Replaces the old fire-and-forget curl wrapper with a robust GraphQL client that handles HTTP status codes, Retry-After headers (including decimal values), and exponential backoff with jitter.
Structured GraphQL error surface via railway_require_graphql_success — Adds railway_graphql_has_errors, railway_graphql_first_error_message, and railway_require_graphql_success helpers to detect and report application-level GraphQL errors, wired into every query/mutation helper.
New GraphQL helper functions in common.sh — Adds railway_environment_create_from_source, railway_environment_delete_by_id, railway_project_service_id, railway_variables_json, and railway_variable_collection_upsert to replace CLI equivalents.
Remove Railway CLI from all workflow jobs — Drops four npm install -g @railway/cli@4.31.0 steps across provision, bootstrap, env-inject, and teardown jobs.
Migrate auth secret to RAILWAY_API_TOKEN — Uses secrets.RAILWAY_API_TOKEN as primary with fallback to secrets.RAILWAY_TOKEN during migration.
Batch runtime variable resolution in resolve_runtime_vars — Replaces three sequential extract_runtime_var calls with a single loop that fetches rendered variables once per attempt and checks all three keys atomically.
Cross-PR serialization for Railway-mutating jobs — Adds concurrency groups to provision and teardown jobs so Railway mutations serialize across PRs while same-PR runs cancel in-progress.
Idempotent teardown with race-condition guard — Handles the case where a preview environment disappears between existence check and delete by re-checking and exiting cleanly.
Null-safe jq filters and jittered polling — Uses // [] fallback so tcpProxies null responses don't crash the proxy check/create loop, and replaces fixed sleep 2 in teardown polling with sleep_with_jitter to avoid synchronized retries.

_{Summary ｜ 6 files ｜ 7 commits ｜ base: main ← varun/fix-preview-runtime-var-sync}

Unified GraphQL control plane

Before: Preview automation split across Railway CLI (railway link, railway variable list/set, railway environment new/delete) and a minimal railway_graphql curl wrapper with no retries, browser-spoofing headers, and no variable support.
After: A single railway_graphql function handles all Railway API communication with proper Authorization: Bearer headers, per-attempt HTTP status inspection, Retry-After respect (including decimal seconds), and exponential backoff with jitter for 429/5xx errors.

The old railway_graphql was a bare curl -fsS call with User-Agent: Mozilla/5.0 and Origin: https://railway.com headers — artifacts of working around an earlier API restriction. The new version uses a clean User-Agent: inkeep-preview-ci, captures response bodies and headers to temp files, and distinguishes retryable errors (429, 5xx, network failures) from permanent ones.

What retry strategy does the new client use?

On success (HTTP 200): return immediately. On Retry-After header: sleep for the server-specified duration with jitter. On 429/5xx/network error: exponential backoff via sleep_with_backoff_and_jitter (base × 2^attempt, capped at 30s). On other 4xx: fail immediately with the response body on stderr. Max 6 attempts by default, configurable via RAILWAY_GRAPHQL_MAX_ATTEMPTS.

common.sh

Structured GraphQL error handling

Before: GraphQL helpers silently passed through responses containing { "errors": [...] } — callers only saw HTTP-level failures from the transport layer.
After: Three new helpers (railway_graphql_has_errors, railway_graphql_first_error_message, railway_require_graphql_success) check for application-level GraphQL errors and surface the first error message with caller-provided context.

Every query/mutation helper — railway_project_service_id, railway_variables_json, railway_environment_id, and the TCP proxy flow in railway_ensure_tcp_proxy — now calls railway_require_graphql_success or uses railway_graphql_has_errors directly. The TCP proxy query loop retries on GraphQL errors rather than crashing, and the create-proxy path uses the shared error extraction instead of inline jq -e '.errors' checks.

common.sh

CLI removal and secret migration

Before: Four workflow jobs each ran npm install -g @railway/cli@4.31.0 and used secrets.RAILWAY_TOKEN as the sole auth path.
After: All CLI install steps are removed. Auth uses secrets.RAILWAY_API_TOKEN || secrets.RAILWAY_TOKEN so the workspace-scoped token works immediately while the old secret name remains as a fallback.

This is the change that makes the workflow compatible with the workspace-token API path — the CLI required the broader account token, while the GraphQL endpoint works with either.

preview-environments.yml

Batch runtime variable resolution

Before: Three separate extract_runtime_var calls each polled Railway's CLI independently, grepping a flat env dump for one key at a time.
After: A single resolve_runtime_vars function fetches the rendered variable JSON once per attempt via railway_variables_json and checks all three keys (MANAGE_DB_URL, RUN_DB_URL, SPICEDB_ENDPOINT) atomically.

The ensure_runtime_var_seeded function also gained early-exit logic — if an explicit template value already matches the current value, it skips the upsert entirely and logs "Updated" vs "Seeded" to distinguish overwrites from first writes.

provision-railway.sh · common.sh

Cross-PR serialization and same-PR cancellation

Before: Workflow-level concurrency was set to cancel-in-progress: false, meaning overlapping runs for the same PR would queue rather than cancel, while Railway mutations across different PRs could race.
After: Workflow-level concurrency allows same-PR cancellation (cancel-in-progress: true). Provision and teardown jobs each have their own concurrency group keyed to preview-railway-${{ vars.RAILWAY_PROJECT_ID }} with cancel-in-progress: false, serializing Railway mutations across all PRs.

preview-environments.yml

Idempotent teardown

Before: Teardown required railway link and called railway environment delete — if the environment vanished between the existence check and the delete call, the job would fail.
After: Teardown resolves the environment ID via railway_wait_for_environment_id, and if that fails, re-checks existence. If the environment disappeared in the interim, it exits cleanly instead of failing. The post-delete polling loop now uses sleep_with_jitter instead of a fixed sleep 2 to avoid synchronized retries across concurrent teardowns.

teardown-railway.sh

^{｜ View workflow run ｜ Triggered by Pullfrog ｜ Using Claude Opus ｜ 𝕏}

pullfrog

Solid consolidation — one API surface, one retry model, one token path. Two issues worth addressing before merging: a jq null-iteration crash in the TCP proxy retry loop, and a minor Retry-After header parsing gap for fractional values. Everything else looks correct.

^{｜ Fix all ➔ ｜ Fix 👍s ➔ ｜ View workflow run ｜ Using Claude Opus ｜ 𝕏}

pullfrog · 2026-03-30T20:28:10Z

.github/scripts/preview/common.sh

+}' "$(jq -nc --arg environment_id "${env_id}" --arg service_id "${service_id}" '{environmentId: $environment_id, serviceId: $service_id}')"
    )"

+    count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"


If the Railway API returns null for tcpProxies (e.g. the service instance isn't provisioned yet in a freshly created environment), iterating .data.tcpProxies[] will crash under set -e instead of retrying.

Use the // [] fallback to coerce null into an empty array:

Suggested change

count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"

count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] | .[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"

pullfrog · 2026-03-30T20:28:11Z

.github/scripts/preview/common.sh

+    count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"
    active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] | length' <<< "${response}")"


Same null crash risk on the active line:

Suggested change

count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"

active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] | length' <<< "${response}")"

count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] | .[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"

active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] | .[] | select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] | length' <<< "${response}")"

pullfrog · 2026-03-30T20:28:11Z

.github/scripts/preview/common.sh

+
    if [ "${attempt}" -lt "${max_attempts}" ]; then
-      sleep_with_jitter "${sleep_seconds}"
+      if [ -n "${retry_after}" ] && [[ "${retry_after}" =~ ^[0-9]+$ ]]; then


Minor: Retry-After can be a decimal (e.g. 1.5). The ^[0-9]+$ regex rejects it, falling through to exponential backoff even though the server gave an explicit delay. Not a correctness bug — backoff is a safe fallback — but worth noting for a future pass if Railway starts sending fractional values.

pullfrog · 2026-03-30T20:28:11Z

.github/scripts/preview/common.sh

+        -w '%{http_code}' \
+        -H "Content-Type: application/json" \
+        -H "Authorization: Bearer ${RAILWAY_API_TOKEN}" \
+        -H "User-Agent: inkeep-preview-ci" \


Good change from User-Agent: Mozilla/5.0 plus the spoofed Origin/Referer headers to an honest User-Agent: inkeep-preview-ci. Much better for traceability.

pullfrog · 2026-03-30T20:28:11Z

.github/scripts/preview/provision-railway.sh

@@ -52,22 +54,18 @@ else
  echo "Railway environment ${RAILWAY_ENV_NAME} already exists"
 fi

+RAILWAY_ENV_ID="$(railway_wait_for_environment_id "${RAILWAY_PROJECT_ID}" "${RAILWAY_ENV_NAME}" 20 4)"


This calls railway_env_exists_count (line 43) then railway_wait_for_environment_id (line 57) — two separate GraphQL round-trips that both resolve the environment list. Not a bug, and the second one does polling, but if you wanted to trim one API call you could extract the env ID directly from railway_environment_id at line 43 and skip railway_env_exists_count. Low priority — just noting the redundancy.

.github/workflows/preview-environments.yml

pullfrog · 2026-03-30T20:28:11Z

.github/scripts/preview/teardown-railway.sh

+if ! RAILWAY_ENV_ID="$(railway_wait_for_environment_id "${RAILWAY_PROJECT_ID}" "${RAILWAY_ENV_NAME}" 10 2)"; then
+  if [ "$(railway_env_exists_count "${RAILWAY_PROJECT_ID}" "${RAILWAY_ENV_NAME}")" = "0" ]; then
+    echo "Railway environment ${RAILWAY_ENV_NAME} disappeared before teardown; nothing to do."
+    exit 0
+  fi
+  echo "Failed to resolve Railway environment ID for ${RAILWAY_ENV_NAME} during teardown." >&2
+  exit 1
+fi


The race-condition handling here is well done — re-checking railway_env_exists_count after railway_wait_for_environment_id fails accounts for the environment disappearing between the initial existence check and the ID lookup.

claude

PR Review Summary

(4) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

🟠 1) common.sh GraphQL helper functions silently swallow errors

files:

common.sh:193-202 — railway_project_service_id
common.sh:220-224 — railway_variables_json
common.sh:246-267 — railway_environment_id

Issue: Multiple GraphQL helper functions extract data from responses without checking for the .errors field. When a GraphQL query fails (invalid IDs, permission denied, server error), the response contains errors but these functions return empty strings/objects instead of failing.

Why: GraphQL returns HTTP 200 with an .errors array for application-level failures. The current pattern of extracting .data.xxx // empty masks these errors completely. This leads to cascading failures with misleading error messages. For example:

An auth error returns empty service ID → empty variables JSON → empty database URLs
The eventual failure is "missing RUN_DB_URL" when the root cause was "invalid Railway token"

Fix: Add error checking to each GraphQL helper function:

if jq -e '.errors' >/dev/null 2>&1 <<< "${response}"; then
  echo "GraphQL error: $(jq -r '.errors[0].message // \"unknown\"' <<< "${response}")" >&2
  return 1
fi

Consider creating a shared railway_check_graphql_response() helper to DRY this up across all functions.

Refs:

GraphQL error handling best practices

Inline Comments:

🟠 Major: common.sh:202 GraphQL errors silently return empty string
🟠 Major: common.sh:224 GraphQL errors silently return empty object

🟡 Minor (2) 🟡

🟡 1) preview-environments.yml:56 Workflow-level cancel-in-progress may conflict with job serialization

Issue: The workflow uses cancel-in-progress: true at workflow level while Railway-mutating jobs use cancel-in-progress: false at job level. This creates a race where workflow cancellation can abort Railway jobs mid-operation.
Why: Could leave preview environments in inconsistent state.
Fix: Consider setting workflow-level cancel-in-progress: false or documenting the expected behavior.
Refs: preview-environments.yml:171-173

Inline Comments:

🟡 Minor: preview-environments.yml:56 Workflow-level cancel-in-progress may conflict with job-level serialization
🟡 Minor: provision-railway.sh:183 Dead code - refresh_service_env_dump is unused here

💭 Consider (1) 💭

💭 1) teardown-railway.sh:46 Use jitter for consistency
Issue: Fixed sleep 2 while other polling loops use sleep_with_jitter.
Why: Minor inconsistency; low impact due to job-level serialization.
Fix: Replace with sleep_with_jitter 2.

Inline Comments:

💭 Consider: teardown-railway.sh:46 Use jitter for consistency

💡 APPROVE WITH SUGGESTIONS

Summary: This is a solid refactor that consolidates Railway automation onto a single GraphQL control plane. The architecture is cleaner and the rate-limit handling is well-designed. The main gap is GraphQL error handling — the helper functions silently swallow API errors, which could make debugging Railway issues difficult. Adding explicit error checks would make this more robust for production CI.

Discarded (6)

Location	Issue	Reason Discarded
`common.sh:107-152`	Temp files not cleaned up via trap	Minor hygiene; CI runners are ephemeral, cleanup happens on happy paths
`common.sh:144-151`	GraphQL errors may log sensitive data	Railway API responses don't typically contain secrets; existing redact_preview_logs covers other outputs
`preview-environments.yml:191`	Token fallback may persist indefinitely	Valid migration pattern; comment documents intent; low priority to track cleanup
`common.sh:168`	railway_environment_create_from_source doesn't check errors	Caller handles failure via fallback check on line 48
`common.sh:243`	railway_variable_collection_upsert doesn't check errors	Lower severity; provision script validates interpolation result downstream
`common.sh:65-75`	railway_env_exists_count conflates errors with non-existence	Low confidence; defensive checks exist in callers

Reviewers (2)

Reviewer	Returned	Main Findings	Consider	Inline Comments	Discarded
`pr-review-devops`	6	1	1	2	3
`pr-review-errors`	8	1	0	2	3
Total	14	2	1	4	6

claude · 2026-03-30T20:29:39Z

.github/scripts/preview/common.sh

  )"

-  jq -r --arg env_name "${env_name}" '.data.environments.edges[] | select(.node.name == $env_name) | .node.id' <<< "${response}"
+  jq -c '.data.variables // {}' <<< "${response}"


🟠 MAJOR: GraphQL errors silently return empty object

Issue: When the GraphQL query fails (invalid environment ID, permission error, or server error), the response contains an .errors field. This function extracts .data.variables // {}, which returns an empty object {} on error, masking the failure.

Why: A GraphQL error results in an empty object being returned. Downstream callers extract variables from this empty object, producing empty strings for database URLs without any error indication. These propagate to later steps, causing hard-to-diagnose failures.

Fix: Check for GraphQL errors before extracting data:

if jq -e '.errors' >/dev/null 2>&1 <<< "${response}"; then echo "GraphQL error fetching variables: $(jq -r '.errors[0].message // \"unknown\"' <<< "${response}")" >&2 return 1 fi

Refs:

bootstrap-preview-auth.sh:30 — caller that would benefit

upsert-vercel-preview-env.sh:41 — caller that would benefit

claude · 2026-03-30T20:29:43Z

.github/scripts/preview/common.sh

+    | .node
+    | select(.id == $service_ref or .name == $service_ref or .name == ("@inkeep/" + $service_ref))
+    | .id
+  ' <<< "${response}" | head -n 1


🟠 MAJOR: GraphQL errors silently return empty string

Issue: This function captures the GraphQL response but does not check for .errors. If the query fails (invalid project ID, permission denied), the jq filter returns an empty string and the function returns successfully.

Why: Callers in provision-railway.sh validate for empty, but bootstrap-preview-auth.sh and upsert-vercel-preview-env.sh don't validate the result before passing to railway_variables_json. An empty service ID leads to downstream failures with misleading error messages.

Fix: Add error checking:

if jq -e '.errors' >/dev/null 2>&1 <<< "${response}"; then echo "GraphQL error querying services: $(jq -r '.errors[0].message // \"unknown\"' <<< "${response}")" >&2 return 1 fi

claude · 2026-03-30T20:29:49Z

.github/workflows/preview-environments.yml

 concurrency:
  group: preview-environments-${{ github.event.pull_request.number || inputs.pr_number || github.ref }}
-  cancel-in-progress: false
+  cancel-in-progress: true


🟡 Minor: Workflow-level cancel-in-progress may conflict with job-level serialization

Issue: The top-level concurrency group uses cancel-in-progress: true, which cancels in-progress runs when a new commit is pushed. However, the provision-tier1 and teardown-tier1 jobs use cancel-in-progress: false to serialize Railway mutations.

Why: Workflow-level cancellation can abort Railway jobs mid-operation before job-level serialization kicks in, potentially leaving environments in an inconsistent state. The PR description mentions "same-PR cancellation and cross-PR serialization" but this interaction may not work as intended.

Fix: Consider either:

Set workflow-level cancel-in-progress: false and rely solely on job-level concurrency

Document that preview environments may be left orphaned and rely on the stale-preview janitor mentioned in the PR notes

claude

PR Review Summary

(0) New Issues | Risk: Low

Delta Review (since last review)

The new commit 434979032 ("ci: surface Railway GraphQL helper errors") directly addresses the Major findings from the previous review:

Prior Finding	Status
🟠 GraphQL helper functions silently swallow errors	✅ Fixed — Added `railway_require_graphql_success()` helper and integrated error checks into `railway_project_service_id`, `railway_variables_json`, `railway_environment_id`
🟠 TCP proxy null-iteration crash (`tcpProxies: null`)	✅ Fixed — Lines 360-361 now use `.data.tcpProxies // [] \| .[]` pattern
🟡 `Retry-After` fractional value parsing	✅ Fixed — Regex updated to `^[0-9]+([.][0-9]+)?$`

New error handling design is clean:

Shared helpers (railway_graphql_has_errors, railway_graphql_first_error_message, railway_require_graphql_success) provide consistent error checking across all GraphQL helpers
TCP proxy loop now properly retries on GraphQL errors and surfaces the error message on final failure
Error messages include context (e.g., "GraphQL error querying Railway services") for easier debugging

🕐 Pending Recommendations (3)

These items from prior reviews are on unchanged code and remain addressable:

🟡 preview-environments.yml:56 — Workflow-level cancel-in-progress: true may cancel Railway jobs mid-operation before job-level serialization kicks in
🟡 provision-railway.sh:108 — refresh_service_env_dump no longer assigns SERVICE_ENV_DUMP but is still called; consider removing the stale variable assignment pattern
💭 teardown-railway.sh:46 — Fixed sleep 2 while other polling loops use sleep_with_jitter

✅ APPROVE

Summary: The delta cleanly addresses all Major findings from the previous review. The new error handling is well-designed with shared helpers that surface GraphQL errors clearly. The remaining items are Minor/Consider-level issues on unchanged code that can be addressed in a follow-up if desired. Good work hardening this CI infrastructure! 🎉

Reviewers (1)

Reviewer	Returned	Main Findings	Consider	While You're Here	Inline Comments	Pending Recs	Discarded
`orchestrator`	0	0	0	0	0	3	0
Total	0	0	0	0	0	3	0

Note: Delta-only review — no sub-reviewers dispatched since the single changed file was already reviewed in the prior run.

claude

PR Review Summary

(0) New Issues | Risk: Low

Delta Review (since last review)

The new commit 53598e8fe ("ci: add jitter to preview teardown polling") addresses a previously raised "Consider" item:

Prior Finding	Status
💭 `teardown-railway.sh:46` — Fixed `sleep 2` while other polling loops use `sleep_with_jitter`	✅ Fixed — Now uses `sleep_with_jitter 2` for consistency

Change is correct and minimal: The teardown post-delete polling loop now uses jittered sleep like other polling loops in the codebase (railway_wait_for_environment_id, railway_ensure_tcp_proxy, etc.). This prevents thundering herd scenarios if multiple teardown jobs poll Railway simultaneously.

🕐 Pending Recommendations (2)

These items from prior reviews are on unchanged code and remain addressable:

🟡 preview-environments.yml:56 — Workflow-level cancel-in-progress: true may cancel Railway jobs mid-operation before job-level serialization kicks in
🟡 provision-railway.sh:108 — refresh_service_env_dump no longer assigns SERVICE_ENV_DUMP but is still called; consider removing the stale variable assignment pattern

✅ APPROVE

Summary: The delta cleanly addresses the remaining "Consider" item from the previous review. All Major findings from earlier reviews have been addressed. The PR is ready to merge — solid consolidation of Railway automation onto a single GraphQL control plane with proper error handling, retry logic, and consistent polling patterns. 🚀

Reviewers (1)

Reviewer	Returned	Main Findings	Consider	While You're Here	Inline Comments	Pending Recs	Discarded
`orchestrator`	0	0	0	0	0	2	0
Total	0	0	0	0	0	2	0

Note: Delta-only review — no sub-reviewers dispatched since the single changed line addresses a prior "Consider" item.

pullfrog · 2026-03-30T20:44:45Z

Addressed review feedback on PR #2900.

Thread: .github/workflows/preview-environments.yml:56 — cancel-in-progress: true trade-off

The author confirmed this is intentional: self-healing provisioning (idempotent env creation, deterministic var repair, retried proxy setup, tolerant teardown) makes mid-flight cancellation safe. Avoiding stale runs reduces Railway control-plane pressure.
Acknowledged and resolved the thread. No code changes needed.

^{｜ View workflow run ｜ Triggered by Pullfrog ｜ Using Claude Opus ｜ 𝕏}

vnv-varun · 2026-03-30T20:45:40Z

A note on the main design choice in this PR.

The important change here is not just more retries. It is moving preview Railway automation onto a single GraphQL control plane and treating same-PR cancellation as an intentional tradeoff.

Why we made that tradeoff:

older runs on the same PR were continuing to hit Railway even after a newer push made them obsolete
that stale work was part of the control-plane pressure problem we were seeing
leaving workflow-level cancel-in-progress: false would keep wasting Railway capacity on runs we no longer care about

Why cancel-in-progress: true is acceptable here:

preview provisioning is now designed to self-heal on rerun
environment creation is idempotent
runtime vars are repaired deterministically from the template/override source of truth
TCP proxy setup is retried until it converges
teardown now tolerates the env disappearing between lookup and delete

So the expected failure mode of a mid-flight cancellation is partial preview state, but the next run for that PR should reconcile it instead of compounding drift. That is the right tradeoff for keeping Railway pressure under control.

One thing this PR does not solve is stale preview cleanup for already-merged or already-closed PRs. There are still a few old pr-* Railway envs hanging around today. That is not because of this cancellation choice. It is because we still do not have a scheduled stale-preview janitor, which should be follow-up work.

itoqa · 2026-03-30T21:03:18Z

Ito Test Report ✅

14 test cases ran. 14 passed.

The unified run passed all 14 of 14 test cases with zero failures, confirming non-production smoke coverage across API and UI: API health stabilized immediately at HTTP 204 (empty body), unauthenticated UI root correctly landed on /login, valid admin sign-in consistently reached /default/projects, and authenticated default-tenant projects API access returned HTTP 200 with expected JSON. Key security and robustness checks also succeeded, including stable session behavior under rapid submits/refresh/back-forward navigation, correct mobile (iPhone 12) post-login rendering without horizontal overflow, denial of unauthorized or cross-tenant data access (401/403), resistance to invalid-credential bursts and injection-like inputs, and blocking of external returnUrl open-redirect attempts while preserving safe internal redirects.

✅ Passed (14)

Category	Summary	Screenshot
Adversarial	Five rapid invalid password submissions all failed with visible error and did not authenticate the user; UI stayed on login flow.
Adversarial	Injection-like payloads in email/password did not execute scripts or corrupt the page, and authentication was denied with safe error output.
Adversarial	Direct unauthenticated API access returned 401 Unauthorized and exposed no project data.
Adversarial	Authenticated cross-tenant API probe to not-default tenant returned 403, confirming tenant boundary enforcement.
Adversarial	External returnUrl payload did not redirect off-origin; browser remained on local UI origin and ended on /default/projects.
Edge	Rapid submit sequence stabilized to a single authenticated state; after reload the session persisted on the protected projects page with no auth bounce loop.
Edge	After refreshing during the login transition, the flow recovered and ended in a stable authenticated state without transient lockups.
Edge	Repeated history navigation did not prevent sign-in; fields/buttons stayed functional and authentication completed to the protected projects page.
Edge	iPhone-12 viewport login succeeded and the initial projects screen rendered without horizontal overflow while primary actions stayed reachable.
Edge	Login initiated from /login?returnUrl=%2F completed successfully and redirected to in-app /default/projects on localhost origin.
Happy-path	API health check reached HTTP 204 on the first polling attempt with an empty response body.
Happy-path	UI root is reachable and correctly routes to `/login` for unauthenticated state; earlier `/default/projects` observation aligns with existing local auth/dev session behavior.
Happy-path	Login via email+password succeeded and redirected to authenticated manage UI at /default/projects with project chrome visible and no auth error banner.
Happy-path	Credential bootstrap returned HTTP 200 with session cookie, and authenticated tenant projects endpoint returned HTTP 200 JSON payload in authorized default-tenant scope.

Commit: 53598e8

View Full Run

Tell us how we did: Give Ito Feedback

* ci: sync preview runtime vars from template overrides * ci: retry flaky Railway preview operations * ci: throttle Railway preview mutations * ci: reduce Railway preview polling * ci: move preview Railway automation to GraphQL * ci: surface Railway GraphQL helper errors * ci: add jitter to preview teardown polling

ci: sync preview runtime vars from template overrides

a4bfd97

vercel bot deployed to Preview – agents-manage-ui March 30, 2026 19:09 View deployment

vercel bot deployed to Preview – agents-docs March 30, 2026 19:09 View deployment

vercel bot deployed to Preview – agents-api March 30, 2026 19:10 View deployment

ci: retry flaky Railway preview operations

1435e33

vnv-varun changed the title ~~ci: resync stale preview runtime vars~~ ci: harden Railway preview provisioning Mar 30, 2026

ci: throttle Railway preview mutations

f78aae6

vercel bot deployed to Preview – agents-manage-ui March 30, 2026 19:25 View deployment

vercel bot deployed to Preview – agents-docs March 30, 2026 19:26 View deployment

vercel bot deployed to Preview – agents-api March 30, 2026 19:26 View deployment

ci: reduce Railway preview polling

20d65b5

vercel bot deployed to Preview – agents-manage-ui March 30, 2026 19:53 View deployment

vercel bot deployed to Preview – agents-docs March 30, 2026 19:53 View deployment

vercel bot deployed to Preview – agents-api March 30, 2026 19:54 View deployment

ci: move preview Railway automation to GraphQL

3649b0f

vercel bot deployed to Preview – agents-manage-ui March 30, 2026 20:11 View deployment

vercel bot deployed to Preview – agents-docs March 30, 2026 20:11 View deployment

vercel bot deployed to Preview – agents-api March 30, 2026 20:13 View deployment

vnv-varun marked this pull request as ready for review March 30, 2026 20:25

pullfrog bot reviewed Mar 30, 2026

View reviewed changes

claude bot reviewed Mar 30, 2026

View reviewed changes

github-actions bot deleted a comment from claude bot Mar 30, 2026

ci: surface Railway GraphQL helper errors

4349790

vercel bot deployed to Preview – agents-manage-ui March 30, 2026 20:34 View deployment

vercel bot deployed to Preview – agents-docs March 30, 2026 20:35 View deployment

claude bot reviewed Mar 30, 2026

View reviewed changes

ci: add jitter to preview teardown polling

53598e8

github-actions bot deleted a comment from claude bot Mar 30, 2026

vercel bot deployed to Preview – agents-manage-ui March 30, 2026 20:37 View deployment

claude bot reviewed Mar 30, 2026

View reviewed changes

vercel bot deployed to Preview – agents-docs March 30, 2026 20:38 View deployment

github-actions bot deleted a comment from claude bot Mar 30, 2026

vercel bot deployed to Preview – agents-api March 30, 2026 20:39 View deployment

amikofalvy added this pull request to the merge queue Mar 30, 2026

Merged via the queue into main with commit 21311f8 Mar 30, 2026
27 of 28 checks passed

amikofalvy deleted the varun/fix-preview-runtime-var-sync branch March 30, 2026 21:31

claude bot mentioned this pull request Mar 30, 2026

Support nested files and folders for Skills #2719

Merged

vnv-varun mentioned this pull request Mar 31, 2026

ci: add preview janitor and recreate control #2930

Open

8 tasks

	count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] \| select(.applicationPort == $application_port)] \| length' <<< "${response}")"
	count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] \| .[] \| select(.applicationPort == $application_port)] \| length' <<< "${response}")"

		count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] \| select(.applicationPort == $application_port)] \| length' <<< "${response}")"
		active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] \| select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] \| length' <<< "${response}")"

Conversation

vnv-varun commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Why

Test Plan

Notes

Uh oh!

changeset-bot bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

⚠️ No Changeset found

Uh oh!

vercel bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pullfrog bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key changes

Unified GraphQL control plane

Structured GraphQL error handling

CLI removal and secret migration

Batch runtime variable resolution

Cross-PR serialization and same-PR cancellation

Idempotent teardown

Uh oh!

pullfrog bot left a comment

Choose a reason for hiding this comment

Uh oh!

pullfrog bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

pullfrog bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pullfrog bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

PR Review Summary

🟠⚠️ Major (2) 🟠⚠️

🟡 Minor (2) 🟡

💭 Consider (1) 💭

💡 APPROVE WITH SUGGESTIONS

Uh oh!

claude bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 30, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

PR Review Summary

Delta Review (since last review)

🕐 Pending Recommendations (3)

✅ APPROVE

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

PR Review Summary

Delta Review (since last review)

🕐 Pending Recommendations (2)

✅ APPROVE

Uh oh!

pullfrog bot commented Mar 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

vnv-varun commented Mar 30, 2026 •

edited

Loading

changeset-bot bot commented Mar 30, 2026 •

edited

Loading

vercel bot commented Mar 30, 2026 •

edited

Loading

pullfrog bot commented Mar 30, 2026 •

edited

Loading

pullfrog bot commented Mar 30, 2026 •

edited

Loading