Skip to content

ci: harden Railway preview provisioning#2900

Merged
amikofalvy merged 7 commits intomainfrom
varun/fix-preview-runtime-var-sync
Mar 30, 2026
Merged

ci: harden Railway preview provisioning#2900
amikofalvy merged 7 commits intomainfrom
varun/fix-preview-runtime-var-sync

Conversation

@vnv-varun
Copy link
Copy Markdown
Contributor

@vnv-varun vnv-varun commented Mar 30, 2026

Summary

Make preview Railway automation use a single GraphQL control plane and reduce unnecessary Railway API churn. This removes the Railway CLI dependency from preview provisioning, bootstrap, env injection, and teardown, makes runtime-var repair deterministic, and makes the workflow compatible with the workspace-token API path we validated live.

Changes

  • replace preview Railway CLI operations with GraphQL helpers for:
    • environment lookup
    • environment create and delete
    • variable reads and writes
    • TCP proxy creation and readiness checks
  • remove Railway CLI installation from the preview workflow
  • make RAILWAY_API_TOKEN the primary CI secret path, with fallback to RAILWAY_TOKEN during migration
  • keep explicit preview runtime-var templates authoritative so stale inherited values get rewritten
  • add rate-limit-aware retries and Retry-After handling for Railway API calls
  • reduce Railway polling by using narrower environment lookups and a single runtime-var interpolation loop
  • cancel superseded same-PR preview runs and serialize Railway-mutating jobs across PRs
  • surface GraphQL helper errors instead of silently returning empty values
  • treat tcpProxies: null as an empty list during proxy readiness polling
  • harden teardown so it exits cleanly if the preview env was already deleted between lookup and delete

Why

The preview workflow was doing too much unnecessary Railway control-plane work and splitting that work across two different integration surfaces:

  • Railway CLI for some operations
  • direct GraphQL for others

That made auth behavior inconsistent, made rate-limit behavior harder to reason about, and forced preview CI to keep using the broader account token because the workspace token did not work with the CLI path.

This PR fixes that by moving preview Railway automation onto one API surface with one retry model and one token model.

Test Plan

  • bash -n .github/scripts/preview/*.sh
  • workflow YAML parse
  • git diff --check
  • live smoke through the refactored shell helpers using the Railway workspace token:
    • resolve preview-base environment ID
    • resolve doltgres service ID
    • create a temporary preview env from preview-base
    • upsert a service variable through GraphQL
    • read that variable back through GraphQL
    • delete the temporary env through GraphQL
  • confirm the full preview workflow reruns cleanly on the latest PR head

Notes

  • This PR hardens teardown behavior, but it does not add a scheduled stale-preview janitor.
  • There are still stale Railway PR envs from already-merged PRs in the live project today. This PR does not retroactively delete them.

@changeset-bot
Copy link
Copy Markdown

changeset-bot bot commented Mar 30, 2026

⚠️ No Changeset found

Latest commit: 53598e8

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel
Copy link
Copy Markdown

vercel bot commented Mar 30, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
agents-api Ready Ready Preview, Comment Mar 30, 2026 8:39pm
agents-docs Ready Ready Preview, Comment Mar 30, 2026 8:39pm
agents-manage-ui Ready Ready Preview, Comment Mar 30, 2026 8:39pm

Request Review

@vnv-varun vnv-varun changed the title ci: resync stale preview runtime vars ci: harden Railway preview provisioning Mar 30, 2026
@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog bot commented Mar 30, 2026

TL;DR — Replaces all Railway CLI usage in preview environment workflows with direct GraphQL API calls, eliminating the split control plane (CLI + GraphQL) that caused two auth behaviors, two retry models, and workspace-token incompatibility. All GraphQL helpers now surface application-level errors (not just HTTP-level) through a shared railway_require_graphql_success gate, and the teardown polling loop uses jittered sleep to reduce thundering-herd pressure.

Key changes

  • Unified railway_graphql client with retry and rate-limit handling — Replaces the old fire-and-forget curl wrapper with a robust GraphQL client that handles HTTP status codes, Retry-After headers (including decimal values), and exponential backoff with jitter.
  • Structured GraphQL error surface via railway_require_graphql_success — Adds railway_graphql_has_errors, railway_graphql_first_error_message, and railway_require_graphql_success helpers to detect and report application-level GraphQL errors, wired into every query/mutation helper.
  • New GraphQL helper functions in common.sh — Adds railway_environment_create_from_source, railway_environment_delete_by_id, railway_project_service_id, railway_variables_json, and railway_variable_collection_upsert to replace CLI equivalents.
  • Remove Railway CLI from all workflow jobs — Drops four npm install -g @railway/cli@4.31.0 steps across provision, bootstrap, env-inject, and teardown jobs.
  • Migrate auth secret to RAILWAY_API_TOKEN — Uses secrets.RAILWAY_API_TOKEN as primary with fallback to secrets.RAILWAY_TOKEN during migration.
  • Batch runtime variable resolution in resolve_runtime_vars — Replaces three sequential extract_runtime_var calls with a single loop that fetches rendered variables once per attempt and checks all three keys atomically.
  • Cross-PR serialization for Railway-mutating jobs — Adds concurrency groups to provision and teardown jobs so Railway mutations serialize across PRs while same-PR runs cancel in-progress.
  • Idempotent teardown with race-condition guard — Handles the case where a preview environment disappears between existence check and delete by re-checking and exiting cleanly.
  • Null-safe jq filters and jittered polling — Uses // [] fallback so tcpProxies null responses don't crash the proxy check/create loop, and replaces fixed sleep 2 in teardown polling with sleep_with_jitter to avoid synchronized retries.

Summary | 6 files | 7 commits | base: mainvarun/fix-preview-runtime-var-sync


Unified GraphQL control plane

Before: Preview automation split across Railway CLI (railway link, railway variable list/set, railway environment new/delete) and a minimal railway_graphql curl wrapper with no retries, browser-spoofing headers, and no variable support.
After: A single railway_graphql function handles all Railway API communication with proper Authorization: Bearer headers, per-attempt HTTP status inspection, Retry-After respect (including decimal seconds), and exponential backoff with jitter for 429/5xx errors.

The old railway_graphql was a bare curl -fsS call with User-Agent: Mozilla/5.0 and Origin: https://railway.com headers — artifacts of working around an earlier API restriction. The new version uses a clean User-Agent: inkeep-preview-ci, captures response bodies and headers to temp files, and distinguishes retryable errors (429, 5xx, network failures) from permanent ones.

What retry strategy does the new client use?

On success (HTTP 200): return immediately. On Retry-After header: sleep for the server-specified duration with jitter. On 429/5xx/network error: exponential backoff via sleep_with_backoff_and_jitter (base × 2^attempt, capped at 30s). On other 4xx: fail immediately with the response body on stderr. Max 6 attempts by default, configurable via RAILWAY_GRAPHQL_MAX_ATTEMPTS.

common.sh


Structured GraphQL error handling

Before: GraphQL helpers silently passed through responses containing { "errors": [...] } — callers only saw HTTP-level failures from the transport layer.
After: Three new helpers (railway_graphql_has_errors, railway_graphql_first_error_message, railway_require_graphql_success) check for application-level GraphQL errors and surface the first error message with caller-provided context.

Every query/mutation helper — railway_project_service_id, railway_variables_json, railway_environment_id, and the TCP proxy flow in railway_ensure_tcp_proxy — now calls railway_require_graphql_success or uses railway_graphql_has_errors directly. The TCP proxy query loop retries on GraphQL errors rather than crashing, and the create-proxy path uses the shared error extraction instead of inline jq -e '.errors' checks.

common.sh


CLI removal and secret migration

Before: Four workflow jobs each ran npm install -g @railway/cli@4.31.0 and used secrets.RAILWAY_TOKEN as the sole auth path.
After: All CLI install steps are removed. Auth uses secrets.RAILWAY_API_TOKEN || secrets.RAILWAY_TOKEN so the workspace-scoped token works immediately while the old secret name remains as a fallback.

This is the change that makes the workflow compatible with the workspace-token API path — the CLI required the broader account token, while the GraphQL endpoint works with either.

preview-environments.yml


Batch runtime variable resolution

Before: Three separate extract_runtime_var calls each polled Railway's CLI independently, grepping a flat env dump for one key at a time.
After: A single resolve_runtime_vars function fetches the rendered variable JSON once per attempt via railway_variables_json and checks all three keys (MANAGE_DB_URL, RUN_DB_URL, SPICEDB_ENDPOINT) atomically.

The ensure_runtime_var_seeded function also gained early-exit logic — if an explicit template value already matches the current value, it skips the upsert entirely and logs "Updated" vs "Seeded" to distinguish overwrites from first writes.

provision-railway.sh · common.sh


Cross-PR serialization and same-PR cancellation

Before: Workflow-level concurrency was set to cancel-in-progress: false, meaning overlapping runs for the same PR would queue rather than cancel, while Railway mutations across different PRs could race.
After: Workflow-level concurrency allows same-PR cancellation (cancel-in-progress: true). Provision and teardown jobs each have their own concurrency group keyed to preview-railway-${{ vars.RAILWAY_PROJECT_ID }} with cancel-in-progress: false, serializing Railway mutations across all PRs.

preview-environments.yml


Idempotent teardown

Before: Teardown required railway link and called railway environment delete — if the environment vanished between the existence check and the delete call, the job would fail.
After: Teardown resolves the environment ID via railway_wait_for_environment_id, and if that fails, re-checks existence. If the environment disappeared in the interim, it exits cleanly instead of failing. The post-delete polling loop now uses sleep_with_jitter instead of a fixed sleep 2 to avoid synchronized retries across concurrent teardowns.

teardown-railway.sh

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

Copy link
Copy Markdown
Contributor

@pullfrog pullfrog bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid consolidation — one API surface, one retry model, one token path. Two issues worth addressing before merging: a jq null-iteration crash in the TCP proxy retry loop, and a minor Retry-After header parsing gap for fractional values. Everything else looks correct.

Pullfrog  | Fix all ➔Fix 👍s ➔View workflow run | Using Claude Opus𝕏

}' "$(jq -nc --arg environment_id "${env_id}" --arg service_id "${service_id}" '{environmentId: $environment_id, serviceId: $service_id}')"
)"

count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the Railway API returns null for tcpProxies (e.g. the service instance isn't provisioned yet in a freshly created environment), iterating .data.tcpProxies[] will crash under set -e instead of retrying.

Use the // [] fallback to coerce null into an empty array:

Suggested change
count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"
count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] | .[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"

Comment on lines 326 to 327
count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"
active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] | length' <<< "${response}")"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same null crash risk on the active line:

Suggested change
count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"
active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies[] | select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] | length' <<< "${response}")"
count="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] | .[] | select(.applicationPort == $application_port)] | length' <<< "${response}")"
active="$(jq -r --argjson application_port "${application_port}" '[.data.tcpProxies // [] | .[] | select(.applicationPort == $application_port and .syncStatus == "ACTIVE")] | length' <<< "${response}")"


if [ "${attempt}" -lt "${max_attempts}" ]; then
sleep_with_jitter "${sleep_seconds}"
if [ -n "${retry_after}" ] && [[ "${retry_after}" =~ ^[0-9]+$ ]]; then
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Retry-After can be a decimal (e.g. 1.5). The ^[0-9]+$ regex rejects it, falling through to exponential backoff even though the server gave an explicit delay. Not a correctness bug — backoff is a safe fallback — but worth noting for a future pass if Railway starts sending fractional values.

-w '%{http_code}' \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ${RAILWAY_API_TOKEN}" \
-H "User-Agent: inkeep-preview-ci" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good change from User-Agent: Mozilla/5.0 plus the spoofed Origin/Referer headers to an honest User-Agent: inkeep-preview-ci. Much better for traceability.

Comment on lines 43 to +57
@@ -52,22 +54,18 @@ else
echo "Railway environment ${RAILWAY_ENV_NAME} already exists"
fi

RAILWAY_ENV_ID="$(railway_wait_for_environment_id "${RAILWAY_PROJECT_ID}" "${RAILWAY_ENV_NAME}" 20 4)"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This calls railway_env_exists_count (line 43) then railway_wait_for_environment_id (line 57) — two separate GraphQL round-trips that both resolve the environment list. Not a bug, and the second one does polling, but if you wanted to trim one API call you could extract the env ID directly from railway_environment_id at line 43 and skip railway_env_exists_count. Low priority — just noting the redundancy.

Comment on lines +24 to +31
if ! RAILWAY_ENV_ID="$(railway_wait_for_environment_id "${RAILWAY_PROJECT_ID}" "${RAILWAY_ENV_NAME}" 10 2)"; then
if [ "$(railway_env_exists_count "${RAILWAY_PROJECT_ID}" "${RAILWAY_ENV_NAME}")" = "0" ]; then
echo "Railway environment ${RAILWAY_ENV_NAME} disappeared before teardown; nothing to do."
exit 0
fi
echo "Failed to resolve Railway environment ID for ${RAILWAY_ENV_NAME} during teardown." >&2
exit 1
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The race-condition handling here is well done — re-checking railway_env_exists_count after railway_wait_for_environment_id fails accounts for the environment disappearing between the initial existence check and the ID lookup.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(4) Total Issues | Risk: Medium

🟠⚠️ Major (2) 🟠⚠️

🟠 1) common.sh GraphQL helper functions silently swallow errors

files:

Issue: Multiple GraphQL helper functions extract data from responses without checking for the .errors field. When a GraphQL query fails (invalid IDs, permission denied, server error), the response contains errors but these functions return empty strings/objects instead of failing.

Why: GraphQL returns HTTP 200 with an .errors array for application-level failures. The current pattern of extracting .data.xxx // empty masks these errors completely. This leads to cascading failures with misleading error messages. For example:

  • An auth error returns empty service ID → empty variables JSON → empty database URLs
  • The eventual failure is "missing RUN_DB_URL" when the root cause was "invalid Railway token"

Fix: Add error checking to each GraphQL helper function:

if jq -e '.errors' >/dev/null 2>&1 <<< "${response}"; then
  echo "GraphQL error: $(jq -r '.errors[0].message // \"unknown\"' <<< "${response}")" >&2
  return 1
fi

Consider creating a shared railway_check_graphql_response() helper to DRY this up across all functions.

Refs:

Inline Comments:

  • 🟠 Major: common.sh:202 GraphQL errors silently return empty string
  • 🟠 Major: common.sh:224 GraphQL errors silently return empty object

🟡 Minor (2) 🟡

🟡 1) preview-environments.yml:56 Workflow-level cancel-in-progress may conflict with job serialization

Issue: The workflow uses cancel-in-progress: true at workflow level while Railway-mutating jobs use cancel-in-progress: false at job level. This creates a race where workflow cancellation can abort Railway jobs mid-operation.
Why: Could leave preview environments in inconsistent state.
Fix: Consider setting workflow-level cancel-in-progress: false or documenting the expected behavior.
Refs: preview-environments.yml:171-173

Inline Comments:

  • 🟡 Minor: preview-environments.yml:56 Workflow-level cancel-in-progress may conflict with job-level serialization
  • 🟡 Minor: provision-railway.sh:183 Dead code - refresh_service_env_dump is unused here

💭 Consider (1) 💭

💭 1) teardown-railway.sh:46 Use jitter for consistency
Issue: Fixed sleep 2 while other polling loops use sleep_with_jitter.
Why: Minor inconsistency; low impact due to job-level serialization.
Fix: Replace with sleep_with_jitter 2.

Inline Comments:

  • 💭 Consider: teardown-railway.sh:46 Use jitter for consistency

💡 APPROVE WITH SUGGESTIONS

Summary: This is a solid refactor that consolidates Railway automation onto a single GraphQL control plane. The architecture is cleaner and the rate-limit handling is well-designed. The main gap is GraphQL error handling — the helper functions silently swallow API errors, which could make debugging Railway issues difficult. Adding explicit error checks would make this more robust for production CI.

Discarded (6)
Location Issue Reason Discarded
common.sh:107-152 Temp files not cleaned up via trap Minor hygiene; CI runners are ephemeral, cleanup happens on happy paths
common.sh:144-151 GraphQL errors may log sensitive data Railway API responses don't typically contain secrets; existing redact_preview_logs covers other outputs
preview-environments.yml:191 Token fallback may persist indefinitely Valid migration pattern; comment documents intent; low priority to track cleanup
common.sh:168 railway_environment_create_from_source doesn't check errors Caller handles failure via fallback check on line 48
common.sh:243 railway_variable_collection_upsert doesn't check errors Lower severity; provision script validates interpolation result downstream
common.sh:65-75 railway_env_exists_count conflates errors with non-existence Low confidence; defensive checks exist in callers
Reviewers (2)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
pr-review-devops 6 1 1 0 2 0 3
pr-review-errors 8 1 0 0 2 0 3
Total 14 2 1 0 4 0 6

)"

jq -r --arg env_name "${env_name}" '.data.environments.edges[] | select(.node.name == $env_name) | .node.id' <<< "${response}"
jq -c '.data.variables // {}' <<< "${response}"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: GraphQL errors silently return empty object

Issue: When the GraphQL query fails (invalid environment ID, permission error, or server error), the response contains an .errors field. This function extracts .data.variables // {}, which returns an empty object {} on error, masking the failure.

Why: A GraphQL error results in an empty object being returned. Downstream callers extract variables from this empty object, producing empty strings for database URLs without any error indication. These propagate to later steps, causing hard-to-diagnose failures.

Fix: Check for GraphQL errors before extracting data:

if jq -e '.errors' >/dev/null 2>&1 <<< "${response}"; then
  echo "GraphQL error fetching variables: $(jq -r '.errors[0].message // \"unknown\"' <<< "${response}")" >&2
  return 1
fi

Refs:

| .node
| select(.id == $service_ref or .name == $service_ref or .name == ("@inkeep/" + $service_ref))
| .id
' <<< "${response}" | head -n 1
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟠 MAJOR: GraphQL errors silently return empty string

Issue: This function captures the GraphQL response but does not check for .errors. If the query fails (invalid project ID, permission denied), the jq filter returns an empty string and the function returns successfully.

Why: Callers in provision-railway.sh validate for empty, but bootstrap-preview-auth.sh and upsert-vercel-preview-env.sh don't validate the result before passing to railway_variables_json. An empty service ID leads to downstream failures with misleading error messages.

Fix: Add error checking:

if jq -e '.errors' >/dev/null 2>&1 <<< "${response}"; then
  echo "GraphQL error querying services: $(jq -r '.errors[0].message // \"unknown\"' <<< "${response}")" >&2
  return 1
fi

concurrency:
group: preview-environments-${{ github.event.pull_request.number || inputs.pr_number || github.ref }}
cancel-in-progress: false
cancel-in-progress: true
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Minor: Workflow-level cancel-in-progress may conflict with job-level serialization

Issue: The top-level concurrency group uses cancel-in-progress: true, which cancels in-progress runs when a new commit is pushed. However, the provision-tier1 and teardown-tier1 jobs use cancel-in-progress: false to serialize Railway mutations.

Why: Workflow-level cancellation can abort Railway jobs mid-operation before job-level serialization kicks in, potentially leaving environments in an inconsistent state. The PR description mentions "same-PR cancellation and cross-PR serialization" but this interaction may not work as intended.

Fix: Consider either:

  1. Set workflow-level cancel-in-progress: false and rely solely on job-level concurrency
  2. Document that preview environments may be left orphaned and rely on the stale-preview janitor mentioned in the PR notes

@github-actions github-actions bot deleted a comment from claude bot Mar 30, 2026
Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) New Issues | Risk: Low

Delta Review (since last review)

The new commit 434979032 ("ci: surface Railway GraphQL helper errors") directly addresses the Major findings from the previous review:

Prior Finding Status
🟠 GraphQL helper functions silently swallow errors Fixed — Added railway_require_graphql_success() helper and integrated error checks into railway_project_service_id, railway_variables_json, railway_environment_id
🟠 TCP proxy null-iteration crash (tcpProxies: null) Fixed — Lines 360-361 now use .data.tcpProxies // [] | .[] pattern
🟡 Retry-After fractional value parsing Fixed — Regex updated to ^[0-9]+([.][0-9]+)?$

New error handling design is clean:

  • Shared helpers (railway_graphql_has_errors, railway_graphql_first_error_message, railway_require_graphql_success) provide consistent error checking across all GraphQL helpers
  • TCP proxy loop now properly retries on GraphQL errors and surfaces the error message on final failure
  • Error messages include context (e.g., "GraphQL error querying Railway services") for easier debugging

🕐 Pending Recommendations (3)

These items from prior reviews are on unchanged code and remain addressable:

  • 🟡 preview-environments.yml:56 — Workflow-level cancel-in-progress: true may cancel Railway jobs mid-operation before job-level serialization kicks in
  • 🟡 provision-railway.sh:108refresh_service_env_dump no longer assigns SERVICE_ENV_DUMP but is still called; consider removing the stale variable assignment pattern
  • 💭 teardown-railway.sh:46 — Fixed sleep 2 while other polling loops use sleep_with_jitter

✅ APPROVE

Summary: The delta cleanly addresses all Major findings from the previous review. The new error handling is well-designed with shared helpers that surface GraphQL errors clearly. The remaining items are Minor/Consider-level issues on unchanged code that can be addressed in a follow-up if desired. Good work hardening this CI infrastructure! 🎉

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator 0 0 0 0 0 3 0
Total 0 0 0 0 0 3 0

Note: Delta-only review — no sub-reviewers dispatched since the single changed file was already reviewed in the prior run.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review Summary

(0) New Issues | Risk: Low

Delta Review (since last review)

The new commit 53598e8fe ("ci: add jitter to preview teardown polling") addresses a previously raised "Consider" item:

Prior Finding Status
💭 teardown-railway.sh:46 — Fixed sleep 2 while other polling loops use sleep_with_jitter Fixed — Now uses sleep_with_jitter 2 for consistency

Change is correct and minimal: The teardown post-delete polling loop now uses jittered sleep like other polling loops in the codebase (railway_wait_for_environment_id, railway_ensure_tcp_proxy, etc.). This prevents thundering herd scenarios if multiple teardown jobs poll Railway simultaneously.

🕐 Pending Recommendations (2)

These items from prior reviews are on unchanged code and remain addressable:

  • 🟡 preview-environments.yml:56 — Workflow-level cancel-in-progress: true may cancel Railway jobs mid-operation before job-level serialization kicks in
  • 🟡 provision-railway.sh:108refresh_service_env_dump no longer assigns SERVICE_ENV_DUMP but is still called; consider removing the stale variable assignment pattern

✅ APPROVE

Summary: The delta cleanly addresses the remaining "Consider" item from the previous review. All Major findings from earlier reviews have been addressed. The PR is ready to merge — solid consolidation of Railway automation onto a single GraphQL control plane with proper error handling, retry logic, and consistent polling patterns. 🚀

Reviewers (1)
Reviewer Returned Main Findings Consider While You're Here Inline Comments Pending Recs Discarded
orchestrator 0 0 0 0 0 2 0
Total 0 0 0 0 0 2 0

Note: Delta-only review — no sub-reviewers dispatched since the single changed line addresses a prior "Consider" item.

@pullfrog
Copy link
Copy Markdown
Contributor

pullfrog bot commented Mar 30, 2026

Addressed review feedback on PR #2900.

Thread: .github/workflows/preview-environments.yml:56cancel-in-progress: true trade-off

  • The author confirmed this is intentional: self-healing provisioning (idempotent env creation, deterministic var repair, retried proxy setup, tolerant teardown) makes mid-flight cancellation safe. Avoiding stale runs reduces Railway control-plane pressure.
  • Acknowledged and resolved the thread. No code changes needed.

Pullfrog  | View workflow run | Triggered by Pullfrog | Using Claude Opus𝕏

@vnv-varun
Copy link
Copy Markdown
Contributor Author

A note on the main design choice in this PR.

The important change here is not just more retries. It is moving preview Railway automation onto a single GraphQL control plane and treating same-PR cancellation as an intentional tradeoff.

Why we made that tradeoff:

  • older runs on the same PR were continuing to hit Railway even after a newer push made them obsolete
  • that stale work was part of the control-plane pressure problem we were seeing
  • leaving workflow-level cancel-in-progress: false would keep wasting Railway capacity on runs we no longer care about

Why cancel-in-progress: true is acceptable here:

  • preview provisioning is now designed to self-heal on rerun
  • environment creation is idempotent
  • runtime vars are repaired deterministically from the template/override source of truth
  • TCP proxy setup is retried until it converges
  • teardown now tolerates the env disappearing between lookup and delete

So the expected failure mode of a mid-flight cancellation is partial preview state, but the next run for that PR should reconcile it instead of compounding drift. That is the right tradeoff for keeping Railway pressure under control.

One thing this PR does not solve is stale preview cleanup for already-merged or already-closed PRs. There are still a few old pr-* Railway envs hanging around today. That is not because of this cancellation choice. It is because we still do not have a scheduled stale-preview janitor, which should be follow-up work.

@itoqa
Copy link
Copy Markdown

itoqa bot commented Mar 30, 2026

Ito Test Report ✅

14 test cases ran. 14 passed.

The unified run passed all 14 of 14 test cases with zero failures, confirming non-production smoke coverage across API and UI: API health stabilized immediately at HTTP 204 (empty body), unauthenticated UI root correctly landed on /login, valid admin sign-in consistently reached /default/projects, and authenticated default-tenant projects API access returned HTTP 200 with expected JSON. Key security and robustness checks also succeeded, including stable session behavior under rapid submits/refresh/back-forward navigation, correct mobile (iPhone 12) post-login rendering without horizontal overflow, denial of unauthorized or cross-tenant data access (401/403), resistance to invalid-credential bursts and injection-like inputs, and blocking of external returnUrl open-redirect attempts while preserving safe internal redirects.

✅ Passed (14)
Category Summary Screenshot
Adversarial Five rapid invalid password submissions all failed with visible error and did not authenticate the user; UI stayed on login flow. ADV-1
Adversarial Injection-like payloads in email/password did not execute scripts or corrupt the page, and authentication was denied with safe error output. ADV-2
Adversarial Direct unauthenticated API access returned 401 Unauthorized and exposed no project data. ADV-3
Adversarial Authenticated cross-tenant API probe to not-default tenant returned 403, confirming tenant boundary enforcement. ADV-4
Adversarial External returnUrl payload did not redirect off-origin; browser remained on local UI origin and ended on /default/projects. ADV-5
Edge Rapid submit sequence stabilized to a single authenticated state; after reload the session persisted on the protected projects page with no auth bounce loop. EDGE-1
Edge After refreshing during the login transition, the flow recovered and ended in a stable authenticated state without transient lockups. EDGE-2
Edge Repeated history navigation did not prevent sign-in; fields/buttons stayed functional and authentication completed to the protected projects page. EDGE-3
Edge iPhone-12 viewport login succeeded and the initial projects screen rendered without horizontal overflow while primary actions stayed reachable. EDGE-4
Edge Login initiated from /login?returnUrl=%2F completed successfully and redirected to in-app /default/projects on localhost origin. EDGE-5
Happy-path API health check reached HTTP 204 on the first polling attempt with an empty response body. ROUTE-1
Happy-path UI root is reachable and correctly routes to /login for unauthenticated state; earlier /default/projects observation aligns with existing local auth/dev session behavior. ROUTE-2
Happy-path Login via email+password succeeded and redirected to authenticated manage UI at /default/projects with project chrome visible and no auth error banner. ROUTE-3
Happy-path Credential bootstrap returned HTTP 200 with session cookie, and authenticated tenant projects endpoint returned HTTP 200 JSON payload in authorized default-tenant scope. ROUTE-4

Commit: 53598e8

View Full Run


Tell us how we did: Give Ito Feedback

@amikofalvy amikofalvy added this pull request to the merge queue Mar 30, 2026
Merged via the queue into main with commit 21311f8 Mar 30, 2026
27 of 28 checks passed
@amikofalvy amikofalvy deleted the varun/fix-preview-runtime-var-sync branch March 30, 2026 21:31
tim-inkeep pushed a commit that referenced this pull request Mar 31, 2026
* ci: sync preview runtime vars from template overrides

* ci: retry flaky Railway preview operations

* ci: throttle Railway preview mutations

* ci: reduce Railway preview polling

* ci: move preview Railway automation to GraphQL

* ci: surface Railway GraphQL helper errors

* ci: add jitter to preview teardown polling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants