Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
|
TL;DR — Adds retry-with-jitter loops around Railway preview environment and service ID lookups so that transient API delays no longer break TCP proxy setup during CI preview deployments, and ensures discovery failures propagate correctly via Key changes
Summary | 1 file | 2 commits | base: Retry loop for Railway service discovery
Both helpers share the same pattern: loop up to
|
There was a problem hiding this comment.
Clean change — the retry wrappers follow the same pattern as railway_extract_runtime_var and correctly address the Railway eventual-consistency flake. One observation about the retry budget worth considering.
.github/scripts/preview/common.sh
Outdated
| env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" | ||
| service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" |
There was a problem hiding this comment.
Nit (non-blocking): railway_ensure_tcp_proxy forwards its full max_attempts/sleep_seconds to each of the two new wait functions, then runs its own polling loop with the same budget. Worst-case wall time tripled from ~60 s to ~180 s (30 × 2 s × 3 phases) compared to the previous single-phase version.
This is probably fine in practice — the environment and service IDs should resolve quickly — but it's worth being aware of. If CI timeout pressure becomes a concern later, consider giving the discovery phases a smaller attempt budget (e.g., 10) while keeping the TCP-proxy polling at 30.
There was a problem hiding this comment.
Looks good — approving.
There was a problem hiding this comment.
PR Review Summary
(1) Total Issues | Risk: Medium
🟠⚠️ Major (1) 🟠⚠️
Inline Comments:
- 🟠 Major:
common.sh:251-252Missing error propagation from wait functions — failures waste 60s retrying with invalid input
💭 Consider (4) 💭
💭 1) common.sh:196,221 Add progress logging during retry loops
Issue: When retry loops take up to 60 seconds, operators have no visibility into whether the script is progressing or stuck.
Why: Silent retry loops make incident debugging harder since there's no indication of which attempt failed or how long the wait has been.
Fix: Add progress logging to stderr after the first attempt:
if [ "${attempt}" -gt 1 ]; then
echo "Waiting for environment ID (attempt ${attempt}/${max_attempts})..." >&2
fi💭 2) common.sh:191,217 Default max_attempts differs from existing pattern
Issue: The new wait functions default to max_attempts=30 while the existing railway_extract_runtime_var uses max_attempts=20.
Why: This inconsistency could lead to different retry behavior expectations. 30 attempts × ~2s = ~60-90s max wait vs 20 × ~2s = ~40-60s for the existing function.
Fix: Consider aligning with the existing default (20) or documenting why service discovery needs more attempts.
Refs:
💭 3) system Be aware of cascading retry latency
Issue: The railway_ensure_tcp_proxy function now chains two wait functions plus its own internal retry loop, resulting in worst-case latency of ~180 seconds.
Why: Environment wait (30 × 2s) + service wait (30 × 2s) + TCP proxy wait (30 × 2s) = 180s total. This is acceptable for CI but operators should be aware of the timeout budget.
Fix: Consider documenting the total timeout budget or adding a comment noting the worst-case latency.
💭 4) common.sh:191 Consider exponential backoff
Issue: The retry strategy uses fixed 2-second base interval with jitter rather than exponential backoff.
Why: Exponential backoff is the industry standard for retry patterns and would reduce load on Railway's API during extended outages. However, the current approach follows the existing railway_extract_runtime_var pattern in the file.
Fix: For consistency, keeping linear backoff is reasonable. If changing, consider: sleep_seconds * (2 ** min(attempt - 1, 4)) * jitter capped at 30s.
Refs:
🚫 REQUEST CHANGES
Summary: The retry logic approach is sound and addresses the Railway eventual consistency issue. However, the removal of explicit error handling creates a regression: if environment lookup fails, the script wastes ~60 seconds retrying service lookup with an empty environment ID before failing with a confusing error. Adding || return 1 to both wait function calls (see inline comment) restores defensive error handling without relying on caller shell strictness.
Discarded (0)
No findings were discarded.
Reviewers (3)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
pr-review-standards |
1 | 0 | 0 | 0 | 1 | 0 | 0 |
pr-review-sre |
5 | 0 | 3 | 0 | 1 | 0 | 1 |
pr-review-devops |
2 | 0 | 1 | 0 | 1 | 0 | 0 |
| Total | 8 | 0 | 4 | 0 | 1 | 0 | 1 |
Note: Error propagation finding was raised by all 3 reviewers and consolidated into single inline comment. SRE's exponential backoff suggestion was kept as Consider since it conflicts with existing file patterns.
.github/scripts/preview/common.sh
Outdated
| env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" | ||
| service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" |
There was a problem hiding this comment.
🟠 MAJOR: Missing error propagation from wait functions
Issue: If railway_wait_for_environment_id fails and returns exit code 1, the script continues with an empty env_id. The subsequent railway_wait_for_service_id_for_env call then retries 30 times with an empty environment ID before failing with a misleading error.
Why: Since this file is sourced and "callers own shell strictness" (per line 2), set -e cannot be assumed. The original code had explicit -z checks that prevented this cascading failure. Without explicit error handling, failures in the first lookup waste ~60 seconds retrying the second lookup with invalid input.
Fix: (1-click apply)
| env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" | |
| service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" | |
| env_id="$(railway_wait_for_environment_id "${project_id}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" || return 1 | |
| service_id="$(railway_wait_for_service_id_for_env "${env_id}" "${service_name}" "${env_name}" "${max_attempts}" "${sleep_seconds}")" || return 1 |
Refs:
There was a problem hiding this comment.
Fixed in 54759a603 by restoring explicit error propagation on both wait-command substitutions with || return 1. That preserves the original fail-fast behavior when environment discovery fails.
There was a problem hiding this comment.
PR Review Summary
(0) Total Issues | Risk: Low
Re-review Delta Assessment
The previous review identified 1 MAJOR issue: missing error propagation from wait functions at lines 251-252. The author addressed this in commit 54759a603 by adding || return 1 to both wait function calls.
Verification:
# Before (missing error propagation)
env_id="$(railway_wait_for_environment_id ...)"
service_id="$(railway_wait_for_service_id_for_env ...)"
# After (correct error propagation)
env_id="$(railway_wait_for_environment_id ...)" || return 1
service_id="$(railway_wait_for_service_id_for_env ...)" || return 1The fix is correct and complete:
- If
railway_wait_for_environment_idfails, the function now returns immediately with exit code 1 - If
railway_wait_for_service_id_for_envfails, the function returns immediately with exit code 1 - No wasted retry cycles with empty/invalid input
🕐 Pending Recommendations (4)
The following suggestions from the previous review remain open for the author's consideration. These are non-blocking improvements:
- 💭 Progress logging during retry loops — Would improve operator visibility during long waits
- 💭 Default max_attempts consistency — New functions default to 30 vs existing pattern of 20
- 💭 Cascading retry latency awareness — Worst-case ~180s total timeout
- 💭 Exponential backoff consideration — Industry standard, but current linear approach is consistent with existing file patterns
✅ APPROVE
Summary: The blocking issue from the previous review has been correctly addressed. Error propagation now works as expected — failures in environment or service discovery will immediately abort railway_ensure_tcp_proxy rather than wasting ~60 seconds retrying with invalid input. The retry logic cleanly handles Railway's eventual consistency without introducing silent failures. Ship it! 🚀
Reviewers (0)
| Reviewer | Returned | Main Findings | Consider | While You're Here | Inline Comments | Pending Recs | Discarded |
|---|---|---|---|---|---|---|---|
| Total | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
Note: No reviewers dispatched for this delta re-review. The change was a 2-line fix directly addressing the previous MAJOR finding. Manual verification confirmed correctness.
Ito Test Report ✅15 test cases ran. 15 passed. All 15 test cases passed with zero failures, indicating no likely production-code defects across preview-environment provisioning, workflow routing, diagnostics/error handling, UI auth navigation, and security-critical API/auth paths. Key validated behaviors included successful and idempotent Tier 1 provisioning with TCP proxies converging to ACTIVE (including partial-readiness, transient create-failure recovery, and rapid concurrent dispatches without race issues), correct closed-PR teardown-only gating, failure diagnostics that redact secrets while preserving actionable context, explicit missing-service errors that block downstream jobs, and consistent rejection of unauthorized access patterns (wrong passwords, tampered/stale cookies, cross-tenant path tampering, and origin-mismatched sign-in) while mobile deep-link and refresh/back-forward flows maintained coherent protected-content access rules. ✅ Passed (15)Commit: Tell us how we did: Give Ito Feedback |
















Summary
Retry Railway environment and service ID discovery when provisioning per-PR preview infrastructure so preview setup does not fail on freshly cloned Railway environments.
Changes
railway_wait_for_environment_id()in.github/scripts/preview/common.shrailway_wait_for_service_id_for_env()in.github/scripts/preview/common.shrailway_ensure_tcp_proxy()to wait for Railway to materialize cloned service instances before creating TCP proxies|| return 1Why
Preview provisioning can fail immediately after cloning
preview-basebecause Railway has created the environment but has not yet populatedserviceInstancesfor the cloned services. In that state,railway_ensure_tcp_proxy()fails withUnable to resolve Railway service ID for <service> in pr-<n>.This is a Railway timing issue, not a PR-specific logic problem. Retrying environment and service discovery makes the preview workflow resilient to Railway's eventual consistency while still failing quickly if discovery never succeeds.
Test Plan
bash -n .github/scripts/preview/*.sh.github/workflows/preview-environments.ymlwithyaml.safe_loadgit diff --check origin/main...HEADProvision Tier 1 (Railway)on this PRpr-review,ci, andCypress E2E Testspass on this PRNotes
This is intentionally scoped to the pre-existing Railway provisioning flake that surfaced while validating the stable preview URL comment change in #2799.