fix(seed-forecasts): pipeline timeout 10s→45s + BATCH_SIZE 10→5#3090
Merged
Conversation
Root cause (validated via STRLEN probe + Railway log): readInputKeys()
batched GETs against Upstash REST /pipeline deterministically timed out
at the 10s budget. ~40 input keys totaling ~2.27 MB; top 5 keys (ucdp
657KB + chokepoints 500KB + cyber-threats 390KB + commodities 192KB +
gpsjam 174KB) = 90% of payload. Worst-case co-located batch at
BATCH_SIZE=10 was ~1.55 MB; at Upstash REST observed slow-spike floor
(~100 KB/s implied by failure pattern), 1.55 MB needs ~16s, exceeding
the 10s budget.
Production proof — Railway log 2026-04-14 10:01 UTC:
Reading input data from Redis...
Retry 1/2 in 1000ms: The operation was aborted due to timeout
... 12 consecutive abort-timeouts (4 outer × ~3 inner) ...
FETCH FAILED: The operation was aborted due to timeout
=== Failed gracefully (188070ms) ===
Fix:
BATCH_SIZE 10 → 5 (reduces probability of tail co-location)
timeout 10s → 45s (2.4× headroom at observed floor)
Round-trips 4 → 8; per-batch overhead ~150-500ms total amortized by
undici keep-alive. Negligible vs hourly cadence.
What this PR does NOT do (5-agent deepen-plan review caught these):
- Does NOT remove input keys. Initial draft proposed dropping 3
"stub" keys. All 3 are LIVE: producers traced to seed-insights,
seed-conflict-intel, and seed-forecasts itself (L14919 — self-
referential EMA windows state key). Zero-byte STRLEN snapshot
caught inter-cycle gaps, not dead keys. Removing reads would
break newsDigest, acledEvents, and EMA windows.
- Does NOT bump api/health.js maxStaleMin. Right fix = make read
succeed, not widen alarm.
- Does NOT extract shared batchedPipelineGet helper. Tracked.
Latent sibling bugs (separate PRs per feedback_no_pr_pollution):
- seed-cross-source-signals.mjs:163 (15s, 23 keys)
- seed-correlation.mjs:26 (10s, 9 market keys)
- seed-energy-spine.mjs:71 (30s, 300 cmds/batch)
- seed-resilience-scores.mjs:73 (30s, BATCH=50 writes)
Plan: docs/plans/2026-04-14-001-fix-seed-forecasts-pipeline-timeout-plan.md
Skill: ~/.claude/skills/upstash-pipeline-payload-timeout/SKILL.md
Tests: node --check + typecheck + typecheck:api clean.
|
The latest updates on your projects. Learn more about Vercel for GitHub. 1 Skipped Deployment
|
Architecture-strategist review on PR #3090 caught the same anti-pattern flagged on PR #3088: the inline comment referenced ~/.claude/skills/... which only resolves on the author's machine. Replaced with a self-contained "diagnostic methodology" paragraph so the rationale is portable and contributors on CI / other machines see complete context. No code change.
Contributor
Greptile SummaryThis PR fixes a deterministic hourly seeder failure by reducing Confidence Score: 5/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant S as seed-forecasts.mjs
participant W as withRetry (max 2 retries)
participant U as Upstash REST /pipeline
Note over S: ~40 keys split into<br/>8 batches of 5 (was 4×10)
loop "Each batch (i = 0..7)"
S->>W: invoke batch fetch
W->>U: "POST /pipeline (5 GETs)<br/>AbortSignal.timeout(45 000ms)"
alt Success (≤45 s)
U-->>W: 200 JSON array
W-->>S: batchResults
else Timeout / error (retry ≤2×)
U-->>W: abort / HTTP error
W->>U: "retry with exponential backoff<br/>(1 s, 2 s)"
U-->>W: 200 JSON array
W-->>S: batchResults
end
S->>S: results.push(...batchResults)
end
Note over S: parse all results via parsedByKey
Reviews (1): Last reviewed commit: "docs(seed-forecasts): drop personal-mach..." | Re-trigger Greptile |
…te (Greptile P2) Greptile review on PR #3090 caught: BATCH_SIZE divides the keys array deterministically by index, so the worst batch is FIXED by array order (not random co-location as my comment implied). Verified live with STRLEN: batch 2 (indices 5-9 = chokepoints + iran + ucdp + unrest + outages) is 1.17 MB, not the 1.9 MB worst-random-case I claimed. Updated comment to reflect: - Actual deterministic worst-case batch (1.17 MB). - Headroom recalc: 1.17 MB at 100 KB/s = ~12s; 45s gives 3.7× margin. - Architectural insight as follow-up: interleaving heavies (chokepoints + ucdp) with smalls in the keys array would split the deterministic worst-case across two batches, halving per-request payload. Tracked for a future PR (no PR pollution). No code change; comment-only correction.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why this PR?
`/api/health` 2026-04-14 10:43 UTC reported `forecasts: STALE_SEED, seedAgeMin: 97, maxStaleMin: 90` — the hourly seeder missed exactly one cycle. Railway log for the 10:01 run shows the read failed deterministically:
```
=== forecast:predictions Seed ===
Reading input data from Redis...
Retry 1/2 in 1000ms: The operation was aborted due to timeout
... 12 consecutive abort-timeouts (4 outer × ~3 inner) ...
FETCH FAILED: The operation was aborted due to timeout
=== Failed gracefully (188070ms) ===
```
Inter-attempt gaps cluster at ~10–11s — exactly the `AbortSignal.timeout(10_000)` budget. Every retry hits the wall. Not transient.
Root cause (validated by direct STRLEN probe)
`scripts/seed-forecasts.mjs:696-712` batches 10 GETs per Upstash REST `/pipeline` POST with a 10s timeout. STRLEN-probed every input key on 2026-04-14:
Worst-case batch payload at BATCH_SIZE=10 ≈ 1.55 MB (3 heavies + 7 small). At Upstash REST's observed slow-spike floor (~100 KB/s, implied by the failure), 1.55 MB needs ~16s — exceeding the 10s budget.
Fix
```diff
...
```
What this PR does NOT do (5-agent deepen-plan review caught these)
Does NOT remove input keys. Initial draft proposed dropping 3 "stub" keys (`news:digest:v1:full:en`, `conflict:acled:v1:all:0:0`, `conflict:ema-windows:v1`). Architecture review traced all 3 to LIVE producers/consumers:
Removing those reads would have broken the forecaster. Classic zero-byte-snapshot misdiagnosis (per `feedback_audit_upstream_before_curating`).
Does NOT bump `api/health.js` maxStaleMin. Current 90 min for hourly cadence is borderline tight, but the right fix is to make the read succeed, not widen the alarm.
Does NOT extract a shared `batchedPipelineGet` helper. Architectural refactor; tracked as follow-up.
Testing
Post-Deploy Monitoring & Validation
Latent sibling bugs (NOT this PR — separate per `feedback_no_pr_pollution`)
Architecture + scan review surfaced 4 other seeders with the same risk class:
Each will be its own STRLEN-validated PR.
Architectural follow-ups (out of scope)
Related
🤖 Generated with Claude Opus 4.6 (1M context, high reasoning) via Claude Code