ci: harden Railway preview provisioning#2900
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
TL;DR — Replaces all Railway CLI usage in preview environment workflows with direct GraphQL API calls, eliminating the split control plane (CLI + GraphQL) that caused two auth behaviors, two retry models, and workspace-token incompatibility. All GraphQL helpers now surface application-level errors (not just HTTP-level) through a shared Key changes
Summary | 6 files | 7 commits | base: Unified GraphQL control plane
The old
Structured GraphQL error handling
Every query/mutation helper — CLI removal and secret migration
This is the change that makes the workflow compatible with the workspace-token API path — the CLI required the broader account token, while the GraphQL endpoint works with either. Batch runtime variable resolution
The
Cross-PR serialization and same-PR cancellation
Idempotent teardown
|
There was a problem hiding this comment.
Solid consolidation — one API surface, one retry model, one token path. Two issues worth addressing before merging: a jq null-iteration crash in the TCP proxy retry loop, and a minor Retry-After header parsing gap for fractional values. Everything else looks correct.
Claude Opus | 𝕏
.github/scripts/preview/common.sh
Outdated
| }' "$(jq -nc --arg environment_id "${env_id}" --arg service_id "${service_id}" '{environmentId: $environment_id, serviceId: $service_id}')" | ||
| )" | ||
|
|
||
| count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")" |
There was a problem hiding this comment.
If the Railway API returns null for tcpProxies (e.g. the service instance isn't provisioned yet in a freshly created environment), iterating .data.tcpProxies[] will crash under set -e instead of retrying.
Use the // [] fallback to coerce null into an empty array:
| count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")" | |
| count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] | .[] | select(.applicationPort == $application_port)] | length' <<< "${response}")" |
.github/scripts/preview/common.sh
Outdated
| count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")" | ||
| active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] | length' <<< "${response}")" |
There was a problem hiding this comment.
Same null crash risk on the active line:
| count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")" | |
| active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] | length' <<< "${response}")" | |
| count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] | .[] | select(.applicationPort == $application_port)] | length' <<< "${response}")" | |
| active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] | .[] | select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] | length' <<< "${response}")" |
.github/scripts/preview/common.sh
Outdated
|
|
||
| if [ "${attempt}" -lt "${max_attempts}" ]; then | ||
| sleep_with_jitter "${sleep_seconds}" | ||
| if [ -n "${retry_after}" ] && [[ "${retry_after}" =~ ^[0-9]+$ ]]; then |
There was a problem hiding this comment.
Minor: Retry-After can be a decimal (e.g. 1.5). The ^[0-9]+$ regex rejects it, falling through to exponential backoff even though the server gave an explicit delay. Not a correctness bug — backoff is a safe fallback — but worth noting for a future pass if Railway starts sending fractional values.
| -w '%{http_code}' \ | ||
| -H "Content-Type: application/json" \ | ||
| -H "Authorization: Bearer ${RAILWAY_API_TOKEN}" \ | ||
| -H "User-Agent: inkeep-preview-ci" \ |
There was a problem hiding this comment.
Good change from User-Agent: Mozilla/5.0 plus the spoofed Origin/Referer headers to an honest User-Agent: inkeep-preview-ci. Much better for traceability.
| @@ -52,22 +54,18 @@ else | |||
| echo "Railway environment ${RAILWAY_ENV_NAME} already exists" | |||
| fi | |||
|
|
|||
| RAILWAY_ENV_ID="$(railway_wait_for_environment_id "${RAILWAY_PROJECT_ID}" "${RAILWAY_ENV_NAME}" 20 4)" | |||
There was a problem hiding this comment.
This calls railway_env_exists_count (line 43) then railway_wait_for_environment_id (line 57) — two separate GraphQL round-trips that both resolve the environment list. Not a bug, and the second one does polling, but if you wanted to trim one API call you could extract the env ID directly from railway_environment_id at line 43 and skip railway_env_exists_count. Low priority — just noting the redundancy.
| if ! RAILWAY_ENV_ID="$(railway_wait_for_environment_id "${RAILWAY_PROJECT_ID}" "${RAILWAY_ENV_NAME}" 10 2)"; then | ||
| if [ "$(railway_env_exists_count "${RAILWAY_PROJECT_ID}" "${RAILWAY_ENV_NAME}")" = "0" ]; then | ||
| echo "Railway environment ${RAILWAY_ENV_NAME} disappeared before teardown; nothing to do." | ||
| exit 0 | ||
| fi | ||
| echo "Failed to resolve Railway environment ID for ${RAILWAY_ENV_NAME} during teardown." >&2 | ||
| exit 1 | ||
| fi |
There was a problem hiding this comment.
The race-condition handling here is well done — re-checking railway_env_exists_count after railway_wait_for_environment_id fails accounts for the environment disappearing between the initial existence check and the ID lookup.
There was a problem hiding this comment.
PR Review Summary
(4) Total Issues | Risk: Medium
🟠⚠️ Major (2) 🟠⚠️
🟠 1) common.sh GraphQL helper functions silently swallow errors
files:
common.sh:193-202—railway_project_service_idcommon.sh:220-224—railway_variables_jsoncommon.sh:246-267—railway_environment_id
Issue: Multiple GraphQL helper functions extract data from responses without checking for the .errors field. When a GraphQL query fails (invalid IDs, permission denied, server error), the response contains errors but these functions return empty strings/objects instead of failing.
Why: GraphQL returns HTTP 200 with an .errors array for application-level failures. The current pattern of extracting .data.xxx // empty masks these errors completely. This leads to cascading failures with misleading error messages. For example:
- An auth error returns empty service ID → empty variables JSON → empty database URLs
- The eventual failure is "missing RUN_DB_URL" when the root cause was "invalid Railway token"
Fix: Add error checking to each GraphQL helper function:
if jq -e '.errors' >/dev/null 2>&1 <<< "${response}"; then
echo "GraphQL error: $(jq -r '.errors[0].message // \"unknown\"' <<< "${response}")" >&2
return 1
fiConsider creating a shared railway_check_graphql_response() helper to DRY this up across all functions.
Refs:
Inline Comments:
- 🟠 Major:
common.sh:202GraphQL errors silently return empty string - 🟠 Major:
common.sh:224GraphQL errors silently return empty object
🟡 Minor (2) 🟡
🟡 1) preview-environments.yml:56 Workflow-level cancel-in-progress may conflict with job serialization
Issue: The workflow uses cancel-in-progress: true at workflow level while Railway-mutating jobs use cancel-in-progress: false at job level. This creates a race where workflow cancellation can abort Railway jobs mid-operation.
Why: Could leave preview environments in inconsistent state.
Fix: Consider setting workflow-level cancel-in-progress: false or documenting the expected behavior.
Refs: preview-environments.yml:171-173
Inline Comments:
- 🟡 Minor:
preview-environments.yml:56Workflow-level cancel-in-progress may conflict with job-level serialization - 🟡 Minor:
provision-railway.sh:183Dead code - refresh_service_env_dump is unused here
💭 Consider (1) 💭
💭 1) teardown-railway.sh:46 Use jitter for consistency
Issue: Fixed sleep 2 while other polling loops use sleep_with_jitter.
Why: Minor inconsistency; low impact due to job-level serialization.
Fix: Replace with sleep_with_jitter 2.
Inline Comments:
- 💭 Consider:
teardown-railway.sh:46Use jitter for consistency
💡 APPROVE WITH SUGGESTIONS
Summary: This is a solid refactor that consolidates Railway automation onto a single GraphQL control plane. The architecture is cleaner and the rate-limit handling is well-designed. The main gap is GraphQL error handling — the helper functions silently swallow API errors, which could make debugging Railway issues difficult. Adding explicit error checks would make this more robust for production CI.
Discarded (6)
| Location | Issue | Reason Discarded |
|---|---|---|
common.sh:107-152 |
Temp files not cleaned up via trap | Minor hygiene; CI runners are ephemeral, cleanup happens on happy paths |
common.sh:144-151 |
GraphQL errors may log sensitive data | Railway API responses don't typically contain secrets; existing redact_preview_logs covers other outputs |
preview-environments.yml:191 |
Token fallback may persist indefinitely | Valid migration pattern; comment documents intent; low priority to track cleanup |
common.sh:168 |
railway_environment_create_from_source doesn't check errors | Caller handles failure via fallback check on line 48 |
common.sh:243 |
railway_variable_collection_upsert doesn't check errors | Lower severity; provision script validates interpolation result downstream |
common.sh:65-75 |
railway_env_exists_count conflates errors with non-existence | Low confidence; defensive checks exist in callers |
Reviewers (2)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-devops |
6 | 1 | 1 | 0 | 2 | 0 | 3 |
pr-review-errors |
8 | 1 | 0 | 0 | 2 | 0 | 3 |
| Total | 14 | 2 | 1 | 0 | 4 | 0 | 6 |
| )" | ||
|
|
||
| jq -r --arg env_name "${env_name}" '.data.environments.edges[] | select(.node.name == $env_name) | .node.id' <<< "${response}" | ||
| jq -c '.data.variables // {}' <<< "${response}" |
There was a problem hiding this comment.
🟠 MAJOR: GraphQL errors silently return empty object
Issue: When the GraphQL query fails (invalid environment ID, permission error, or server error), the response contains an .errors field. This function extracts .data.variables // {}, which returns an empty object {} on error, masking the failure.
Why: A GraphQL error results in an empty object being returned. Downstream callers extract variables from this empty object, producing empty strings for database URLs without any error indication. These propagate to later steps, causing hard-to-diagnose failures.
Fix: Check for GraphQL errors before extracting data:
if jq -e '.errors' >/dev/null 2>&1 <<< "${response}"; then
echo "GraphQL error fetching variables: $(jq -r '.errors[0].message // \"unknown\"' <<< "${response}")" >&2
return 1
fiRefs:
- bootstrap-preview-auth.sh:30 — caller that would benefit
- upsert-vercel-preview-env.sh:41 — caller that would benefit
| | .node | ||
| | select(.id == $service_ref or .name == $service_ref or .name == ("@inkeep/" + $service_ref)) | ||
| | .id | ||
| ' <<< "${response}" | head -n 1 |
There was a problem hiding this comment.
🟠 MAJOR: GraphQL errors silently return empty string
Issue: This function captures the GraphQL response but does not check for .errors. If the query fails (invalid project ID, permission denied), the jq filter returns an empty string and the function returns successfully.
Why: Callers in provision-railway.sh validate for empty, but bootstrap-preview-auth.sh and upsert-vercel-preview-env.sh don't validate the result before passing to railway_variables_json. An empty service ID leads to downstream failures with misleading error messages.
Fix: Add error checking:
if jq -e '.errors' >/dev/null 2>&1 <<< "${response}"; then
echo "GraphQL error querying services: $(jq -r '.errors[0].message // \"unknown\"' <<< "${response}")" >&2
return 1
fi| concurrency: | ||
| group: preview-environments-${{ github.event.pull_request.number || inputs.pr_number || github.ref }} | ||
| cancel-in-progress: false | ||
| cancel-in-progress: true |
There was a problem hiding this comment.
🟡 Minor: Workflow-level cancel-in-progress may conflict with job-level serialization
Issue: The top-level concurrency group uses cancel-in-progress: true, which cancels in-progress runs when a new commit is pushed. However, the provision-tier1 and teardown-tier1 jobs use cancel-in-progress: false to serialize Railway mutations.
Why: Workflow-level cancellation can abort Railway jobs mid-operation before job-level serialization kicks in, potentially leaving environments in an inconsistent state. The PR description mentions "same-PR cancellation and cross-PR serialization" but this interaction may not work as intended.
Fix: Consider either:
- Set workflow-level
cancel-in-progress: falseand rely solely on job-level concurrency - Document that preview environments may be left orphaned and rely on the stale-preview janitor mentioned in the PR notes
There was a problem hiding this comment.
PR Review Summary
(0) New Issues | Risk: Low
Delta Review (since last review)
The new commit 434979032 ("ci: surface Railway GraphQL helper errors") directly addresses the Major findings from the previous review:
| Prior Finding | Status |
|---|---|
| 🟠 GraphQL helper functions silently swallow errors | ✅ Fixed — Added railway_require_graphql_success() helper and integrated error checks into railway_project_service_id, railway_variables_json, railway_environment_id |
🟠 TCP proxy null-iteration crash (tcpProxies: null) |
✅ Fixed — Lines 360-361 now use .data.tcpProxies // [] | .[] pattern |
🟡 Retry-After fractional value parsing |
✅ Fixed — Regex updated to ^[0-9]+([.][0-9]+)?$ |
New error handling design is clean:
- Shared helpers (
railway_graphql_has_errors,railway_graphql_first_error_message,railway_require_graphql_success) provide consistent error checking across all GraphQL helpers - TCP proxy loop now properly retries on GraphQL errors and surfaces the error message on final failure
- Error messages include context (e.g., "GraphQL error querying Railway services") for easier debugging
🕐 Pending Recommendations (3)
These items from prior reviews are on unchanged code and remain addressable:
- 🟡
preview-environments.yml:56— Workflow-levelcancel-in-progress: truemay cancel Railway jobs mid-operation before job-level serialization kicks in - 🟡
provision-railway.sh:108—refresh_service_env_dumpno longer assignsSERVICE_ENV_DUMPbut is still called; consider removing the stale variable assignment pattern - 💭
teardown-railway.sh:46— Fixedsleep 2while other polling loops usesleep_with_jitter
✅ APPROVE
Summary: The delta cleanly addresses all Major findings from the previous review. The new error handling is well-designed with shared helpers that surface GraphQL errors clearly. The remaining items are Minor/Consider-level issues on unchanged code that can be addressed in a follow-up if desired. Good work hardening this CI infrastructure! 🎉
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator |
0 | 0 | 0 | 0 | 0 | 3 | 0 |
| Total | 0 | 0 | 0 | 0 | 0 | 3 | 0 |
Note: Delta-only review — no sub-reviewers dispatched since the single changed file was already reviewed in the prior run.
There was a problem hiding this comment.
PR Review Summary
(0) New Issues | Risk: Low
Delta Review (since last review)
The new commit 53598e8fe ("ci: add jitter to preview teardown polling") addresses a previously raised "Consider" item:
| Prior Finding | Status |
|---|---|
💭 teardown-railway.sh:46 — Fixed sleep 2 while other polling loops use sleep_with_jitter |
✅ Fixed — Now uses sleep_with_jitter 2 for consistency |
Change is correct and minimal: The teardown post-delete polling loop now uses jittered sleep like other polling loops in the codebase (railway_wait_for_environment_id, railway_ensure_tcp_proxy, etc.). This prevents thundering herd scenarios if multiple teardown jobs poll Railway simultaneously.
🕐 Pending Recommendations (2)
These items from prior reviews are on unchanged code and remain addressable:
- 🟡
preview-environments.yml:56— Workflow-levelcancel-in-progress: truemay cancel Railway jobs mid-operation before job-level serialization kicks in - 🟡
provision-railway.sh:108—refresh_service_env_dumpno longer assignsSERVICE_ENV_DUMPbut is still called; consider removing the stale variable assignment pattern
✅ APPROVE
Summary: The delta cleanly addresses the remaining "Consider" item from the previous review. All Major findings from earlier reviews have been addressed. The PR is ready to merge — solid consolidation of Railway automation onto a single GraphQL control plane with proper error handling, retry logic, and consistent polling patterns. 🚀
Reviewers (1)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
orchestrator |
0 | 0 | 0 | 0 | 0 | 2 | 0 |
| Total | 0 | 0 | 0 | 0 | 0 | 2 | 0 |
Note: Delta-only review — no sub-reviewers dispatched since the single changed line addresses a prior "Consider" item.
|
Addressed review feedback on PR #2900. Thread:
|
|
A note on the main design choice in this PR. The important change here is not just more retries. It is moving preview Railway automation onto a single GraphQL control plane and treating same-PR cancellation as an intentional tradeoff. Why we made that tradeoff:
Why
So the expected failure mode of a mid-flight cancellation is partial preview state, but the next run for that PR should reconcile it instead of compounding drift. That is the right tradeoff for keeping Railway pressure under control. One thing this PR does not solve is stale preview cleanup for already-merged or already-closed PRs. There are still a few old |
Ito Test Report ✅14 test cases ran. 14 passed. The unified run passed all 14 of 14 test cases with zero failures, confirming non-production smoke coverage across API and UI: API health stabilized immediately at HTTP 204 (empty body), unauthenticated UI root correctly landed on /login, valid admin sign-in consistently reached /default/projects, and authenticated default-tenant projects API access returned HTTP 200 with expected JSON. Key security and robustness checks also succeeded, including stable session behavior under rapid submits/refresh/back-forward navigation, correct mobile (iPhone 12) post-login rendering without horizontal overflow, denial of unauthorized or cross-tenant data access (401/403), resistance to invalid-credential bursts and injection-like inputs, and blocking of external returnUrl open-redirect attempts while preserving safe internal redirects. ✅ Passed (14)Commit: Tell us how we did: Give Ito Feedback |
* ci: sync preview runtime vars from template overrides * ci: retry flaky Railway preview operations * ci: throttle Railway preview mutations * ci: reduce Railway preview polling * ci: move preview Railway automation to GraphQL * ci: surface Railway GraphQL helper errors * ci: add jitter to preview teardown polling















Summary
Make preview Railway automation use a single GraphQL control plane and reduce unnecessary Railway API churn. This removes the Railway CLI dependency from preview provisioning, bootstrap, env injection, and teardown, makes runtime-var repair deterministic, and makes the workflow compatible with the workspace-token API path we validated live.
Changes
RAILWAY_API_TOKENthe primary CI secret path, with fallback toRAILWAY_TOKENduring migrationRetry-Afterhandling for Railway API callstcpProxies: nullas an empty list during proxy readiness pollingWhy
The preview workflow was doing too much unnecessary Railway control-plane work and splitting that work across two different integration surfaces:
That made auth behavior inconsistent, made rate-limit behavior harder to reason about, and forced preview CI to keep using the broader account token because the workspace token did not work with the CLI path.
This PR fixes that by moving preview Railway automation onto one API surface with one retry model and one token model.
Test Plan
bash -n .github/scripts/preview/*.shgit diff --checkpreview-baseenvironment IDdoltgresservice IDpreview-baseNotes