Skip to content

fix(seed-forecasts): pipeline timeout 10s→45s + BATCH_SIZE 10→5#3090

Merged
koala73 merged 3 commits into
mainfrom
fix/seed-forecasts-pipeline-timeout
Apr 14, 2026
Merged

fix(seed-forecasts): pipeline timeout 10s→45s + BATCH_SIZE 10→5#3090
koala73 merged 3 commits into
mainfrom
fix/seed-forecasts-pipeline-timeout

Conversation

@koala73
Copy link
Copy Markdown
Owner

@koala73 koala73 commented Apr 14, 2026

Why this PR?

`/api/health` 2026-04-14 10:43 UTC reported `forecasts: STALE_SEED, seedAgeMin: 97, maxStaleMin: 90` — the hourly seeder missed exactly one cycle. Railway log for the 10:01 run shows the read failed deterministically:

```
=== forecast:predictions Seed ===
Reading input data from Redis...
Retry 1/2 in 1000ms: The operation was aborted due to timeout
... 12 consecutive abort-timeouts (4 outer × ~3 inner) ...
FETCH FAILED: The operation was aborted due to timeout
=== Failed gracefully (188070ms) ===
```

Inter-attempt gaps cluster at ~10–11s — exactly the `AbortSignal.timeout(10_000)` budget. Every retry hits the wall. Not transient.

Root cause (validated by direct STRLEN probe)

`scripts/seed-forecasts.mjs:696-712` batches 10 GETs per Upstash REST `/pipeline` POST with a 10s timeout. STRLEN-probed every input key on 2026-04-14:

Key Size % of total
`conflict:ucdp-events:v1` 657 KB 29%
`supply_chain:chokepoints:v4` 500 KB 22%
`cyber:threats-bootstrap:v2` 390 KB 17%
`market:commodities-bootstrap:v1` 192 KB 8%
`intelligence:gpsjam:v2` 174 KB 8%
34 other keys ~220 KB 10%
Total (~40 keys) ~2.27 MB

Worst-case batch payload at BATCH_SIZE=10 ≈ 1.55 MB (3 heavies + 7 small). At Upstash REST's observed slow-spike floor (~100 KB/s, implied by the failure), 1.55 MB needs ~16s — exceeding the 10s budget.

Fix

```diff

  • const BATCH_SIZE = 10;
  • const BATCH_SIZE = 5; // reduces probability of tail co-location
    ...
  •   signal: AbortSignal.timeout(10_000),
    
  •   signal: AbortSignal.timeout(45_000),  // 2.4× headroom at observed floor
    

```

  • BATCH_SIZE=5 still allows ~1.9 MB co-located batch (3 heavies + 2 mediums); 45s gives ~30s headroom even at slow-spike rates.
  • Round-trips 4 → 8; per-batch overhead ~150–500ms total amortized by undici keep-alive. Negligible vs hourly cadence.

What this PR does NOT do (5-agent deepen-plan review caught these)

  1. Does NOT remove input keys. Initial draft proposed dropping 3 "stub" keys (`news:digest:v1:full:en`, `conflict:acled:v1:all:0:0`, `conflict:ema-windows:v1`). Architecture review traced all 3 to LIVE producers/consumers:

    • `news:digest:v1:full:en` → produced by `seed-insights.mjs:10`, consumed at `seed-forecasts.mjs:746` (`newsDigest`) AND `server/worldmonitor/intelligence/v1/chat-analyst-context.ts:666`. 0-byte probe = inter-cycle gap, not dead.
    • `conflict:acled:v1:all:0:0` → produced by `seed-conflict-intel.mjs:23` (TTL 900s), consumed at `:761` (`acledEvents`), core to EMA windows. 13-byte probe = 15-min cycle gap.
    • `conflict:ema-windows:v1` → produced by seed-forecasts itself at `:14919` (self-referential state). 2-byte probe = documented cold-start `{}` path. Removing the read would break EMA continuity.

    Removing those reads would have broken the forecaster. Classic zero-byte-snapshot misdiagnosis (per `feedback_audit_upstream_before_curating`).

  2. Does NOT bump `api/health.js` maxStaleMin. Current 90 min for hourly cadence is borderline tight, but the right fix is to make the read succeed, not widen the alarm.

  3. Does NOT extract a shared `batchedPipelineGet` helper. Architectural refactor; tracked as follow-up.

Testing

  • `node --check scripts/seed-forecasts.mjs` clean
  • `npm run typecheck` clean
  • `npm run typecheck:api` clean
  • No unit test possible without real Redis (the change is config tuning); verification is post-deploy.
  • `grep` confirms all 3 contested keys preserved in both `keys` array AND consumer accesses (`parsedByKey[...]`, `results[keys.indexOf(...)]`).

Post-Deploy Monitoring & Validation

  • Logs: Railway service `seed-forecasts` next hourly run — `Reading input data from Redis...` should be followed within seconds by clean `Done (Nms)`. Zero `Retry N/M ... aborted due to timeout` lines is the success signal. The 188s failure duration should drop to baseline ~30–60s.
  • Health endpoint: `curl -sL https://worldmonitor.app/api/health | jq '.checks.forecasts'` — should report `{ status: "OK", seedAgeMin < 60, ... }` across 24 consecutive hourly cycles. Baseline today: 1 STALE per ~24 h.
  • Failure signal / rollback trigger: if Railway log still shows `aborted due to timeout` on the 45s budget, the underlying issue is bigger than payload (Upstash regional throttling on this Railway IP). Escalate to Phase 2 (dedicated reads for the 3 heaviest keys outside the batch loop).
  • Validation window: 24 h post-deploy.
  • Owner: @koala73
  • Rollback: `git revert` this commit. Railway auto-redeploys on push to `main` (NIXPACKS watch on `scripts/**`). Zero data loss — read-only change.

Latent sibling bugs (NOT this PR — separate per `feedback_no_pr_pollution`)

Architecture + scan review surfaced 4 other seeders with the same risk class:

File Current timeout Batch / key count Priority
`scripts/seed-cross-source-signals.mjs:163` 15s unbatched, 23 keys (same heavy-key set) High follow-up
`scripts/seed-correlation.mjs:26` 10s unbatched, 9 market keys High follow-up
`scripts/seed-energy-spine.mjs:71` 30s 300 GET commands per batch (60 countries × 5 keys) Medium
`scripts/seed-resilience-scores.mjs:73` 30s BATCH=50 writes Medium

Each will be its own STRLEN-validated PR.

Architectural follow-ups (out of scope)

  • Phase 2: dedicated single-key reads for ucdp-events, chokepoints, cyber-threats outside the batch loop — caps tail at max(individual-key-size).
  • Phase 3: pre-compute a derived `forecast:inputs:v1` blob upstream → seed-forecasts reads ONE ~50 KB key. Highest payoff, biggest change.
  • Phase 4: switch to `ioredis` binary protocol for Node-side seeders (no docs-backed guidance either way; defensible engineering choice).
  • Architectural smell: seed-forecasts is coupled to 15+ producers' internal Redis schemas with no versioned contract. Add SEED_META assertion test.

Related


Compound Engineering v2.49.0
🤖 Generated with Claude Opus 4.6 (1M context, high reasoning) via Claude Code

Root cause (validated via STRLEN probe + Railway log): readInputKeys()
batched GETs against Upstash REST /pipeline deterministically timed out
at the 10s budget. ~40 input keys totaling ~2.27 MB; top 5 keys (ucdp
657KB + chokepoints 500KB + cyber-threats 390KB + commodities 192KB +
gpsjam 174KB) = 90% of payload. Worst-case co-located batch at
BATCH_SIZE=10 was ~1.55 MB; at Upstash REST observed slow-spike floor
(~100 KB/s implied by failure pattern), 1.55 MB needs ~16s, exceeding
the 10s budget.

Production proof — Railway log 2026-04-14 10:01 UTC:
  Reading input data from Redis...
  Retry 1/2 in 1000ms: The operation was aborted due to timeout
  ... 12 consecutive abort-timeouts (4 outer × ~3 inner) ...
  FETCH FAILED: The operation was aborted due to timeout
  === Failed gracefully (188070ms) ===

Fix:
  BATCH_SIZE 10 → 5    (reduces probability of tail co-location)
  timeout    10s → 45s (2.4× headroom at observed floor)

Round-trips 4 → 8; per-batch overhead ~150-500ms total amortized by
undici keep-alive. Negligible vs hourly cadence.

What this PR does NOT do (5-agent deepen-plan review caught these):
  - Does NOT remove input keys. Initial draft proposed dropping 3
    "stub" keys. All 3 are LIVE: producers traced to seed-insights,
    seed-conflict-intel, and seed-forecasts itself (L14919 — self-
    referential EMA windows state key). Zero-byte STRLEN snapshot
    caught inter-cycle gaps, not dead keys. Removing reads would
    break newsDigest, acledEvents, and EMA windows.
  - Does NOT bump api/health.js maxStaleMin. Right fix = make read
    succeed, not widen alarm.
  - Does NOT extract shared batchedPipelineGet helper. Tracked.

Latent sibling bugs (separate PRs per feedback_no_pr_pollution):
  - seed-cross-source-signals.mjs:163 (15s, 23 keys)
  - seed-correlation.mjs:26 (10s, 9 market keys)
  - seed-energy-spine.mjs:71 (30s, 300 cmds/batch)
  - seed-resilience-scores.mjs:73 (30s, BATCH=50 writes)

Plan: docs/plans/2026-04-14-001-fix-seed-forecasts-pipeline-timeout-plan.md
Skill: ~/.claude/skills/upstash-pipeline-payload-timeout/SKILL.md

Tests: node --check + typecheck + typecheck:api clean.
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 14, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
worldmonitor Ignored Ignored Preview Apr 14, 2026 11:24am

Request Review

Architecture-strategist review on PR #3090 caught the same anti-pattern
flagged on PR #3088: the inline comment referenced ~/.claude/skills/...
which only resolves on the author's machine. Replaced with a self-contained
"diagnostic methodology" paragraph so the rationale is portable and
contributors on CI / other machines see complete context.

No code change.
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Apr 14, 2026

Greptile Summary

This PR fixes a deterministic hourly seeder failure by reducing BATCH_SIZE from 10→5 and raising the Upstash REST pipeline AbortSignal timeout from 10 s→45 s in readInputKeys(). The root cause — validated by a STRLEN probe — was that worst-case batches of ~1.2–1.5 MB exceeded the 10 s abort budget at Upstash's observed ~100 KB/s slow-spike floor, causing all three retry attempts to fail and the seed to miss its cycle.

Confidence Score: 5/5

  • Safe to merge — read-only config tuning with no data-path changes; both remaining findings are P2 comment quality issues.
  • The fix is narrowly scoped (two numeric literals), grounded in probe data, and the timeout math checks out. Both P2 findings are comment inaccuracies that don't affect runtime behaviour.
  • No files require special attention.

Important Files Changed

Filename Overview
scripts/seed-forecasts.mjs Reduces BATCH_SIZE 10→5 and raises AbortSignal timeout 10 s→45 s in the Upstash REST pipeline loop; two minor comment accuracy/portability issues, core logic is correct and well-justified by STRLEN probe data.

Sequence Diagram

sequenceDiagram
    participant S as seed-forecasts.mjs
    participant W as withRetry (max 2 retries)
    participant U as Upstash REST /pipeline

    Note over S: ~40 keys split into<br/>8 batches of 5 (was 4×10)

    loop "Each batch (i = 0..7)"
        S->>W: invoke batch fetch
        W->>U: "POST /pipeline (5 GETs)<br/>AbortSignal.timeout(45 000ms)"
        alt Success (≤45 s)
            U-->>W: 200 JSON array
            W-->>S: batchResults
        else Timeout / error (retry ≤2×)
            U-->>W: abort / HTTP error
            W->>U: "retry with exponential backoff<br/>(1 s, 2 s)"
            U-->>W: 200 JSON array
            W-->>S: batchResults
        end
        S->>S: results.push(...batchResults)
    end

    Note over S: parse all results via parsedByKey
Loading

Reviews (1): Last reviewed commit: "docs(seed-forecasts): drop personal-mach..." | Re-trigger Greptile

Comment thread scripts/seed-forecasts.mjs Outdated
Comment thread scripts/seed-forecasts.mjs Outdated
…te (Greptile P2)

Greptile review on PR #3090 caught: BATCH_SIZE divides the keys array
deterministically by index, so the worst batch is FIXED by array order
(not random co-location as my comment implied). Verified live with
STRLEN: batch 2 (indices 5-9 = chokepoints + iran + ucdp + unrest +
outages) is 1.17 MB, not the 1.9 MB worst-random-case I claimed.

Updated comment to reflect:
  - Actual deterministic worst-case batch (1.17 MB).
  - Headroom recalc: 1.17 MB at 100 KB/s = ~12s; 45s gives 3.7× margin.
  - Architectural insight as follow-up: interleaving heavies (chokepoints
    + ucdp) with smalls in the keys array would split the deterministic
    worst-case across two batches, halving per-request payload. Tracked
    for a future PR (no PR pollution).

No code change; comment-only correction.
@koala73 koala73 merged commit 51c9d2c into main Apr 14, 2026
10 checks passed
@koala73 koala73 deleted the fix/seed-forecasts-pipeline-timeout branch April 14, 2026 11:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant