fix(resilience-ranking): chunked warm SET, always-on rebuild, truthful meta (Slice B) by koala73 · Pull Request #3124 · koala73/worldmonitor

koala73 · 2026-04-16T06:36:05Z

Summary

Follow-up to PR #3121. Live behavior post-merge confirmed three coupled failures that left resilience:ranking:v9 absent even after Slice A landed (per-country score persistence) — the EMPTY_ON_DEMAND symptom in /api/health persisted.

Manual cron run log captured the smoking gun:

[Resilience-Scores] 0/222 scores pre-warmed
[Resilience-Scores] Warming 222 missing via ranking endpoint...
[Resilience-Scores] Ranking: 0 ranked, 222 greyed out          ← handler dropped publish
[Resilience-Scores] Rebuilt resilience:ranking:v9 with 222 countries (bulk-call race left ranking:v9 null)

Root causes

Pipeline body too big. warmMissingResilienceScores issued one 222-command pipeline SET (~600KB). It exceeded REDIS_PIPELINE_TIMEOUT_MS (5s) on Vercel Edge → runRedisPipeline returned [] → persistence guard correctly returned an empty warmed map → coverage 0/222 < 75% → handler skipped both ranking SET and meta SET. (Upstash actually persisted the writes a moment later — that's why the seeder's second HTTP call succeeded.)
Rebuild path gated on missing>0. seed-resilience-scores.mjs only probed/rebuilt the ranking inside if (missing > 0). Once per-country scores were warm from a previous cron, every subsequent cron skipped rebuild — but the ranking aggregate has the same 6h TTL and could expire alone, leaving multi-hour gaps.
Lying meta. writeRankingSeedMeta fired whenever recordCount > 0, with no check that resilience:ranking:v9 actually existed. Health endpoint then reported a fresh seedAge over a missing data key.

Fixes

server/worldmonitor/resilience/v1/_shared.ts: split the warm pipeline SETs into SET_BATCH = 30-command chunks so each pipeline body fits well under timeout. Failed batches pad with empty entries to preserve per-command alignment.
scripts/seed-resilience-scores.mjs: extract ensureRankingPresent({url, token, laggardsWarmed}) helper, call it from BOTH the warm and skip-warm branches so the rebuild fires every cron. Adds a post-rebuild STRLEN verification — rebuild HTTP can return 200 with a payload but still skip the SET internally.
main(): gate writeRankingSeedMeta on result.rankingPresent. If the rebuild didn't actually produce a key, log and let the next cron retry.

Test plan

Full resilience suite: 373/373 pass (3 new tests).
resilience-ranking.test.mts: pipeline-size assertion (no single SET pipeline >30 commands).
resilience-scores-seed.test.mjs: structural checks — ensureRankingPresent called from ≥2 sites, STRLEN-verify present, meta gated on result.rankingPresent.
After deploy: live probe TTL resilience:ranking:v9 should be positive within one cron tick; /api/health should flip resilienceRanking to status: OK; the [Resilience-Scores] Skipping seed-meta write… warning should be rare.

Builds on PR #3121 (now on main).

…l meta Slice B follow-up to PR #3121. Three coupled production failures observed: 1. Per-country score persistence works (Slice A), but the 222-SET single pipeline body (~600KB) exceeds REDIS_PIPELINE_TIMEOUT_MS (5s) on Vercel Edge. runRedisPipeline returns []; persistence guard correctly returns empty; coverage = 0/222 < 75%; ranking publish silently dropped. Live Railway log: "Ranking: 0 ranked, 222 greyed out" → "Rebuilt … with 222 countries (bulk-call race left ranking:v9 null)" — second call only succeeded because Upstash had finally caught up between attempts. 2. The seeder's probe + rebuild block lives inside `if (missing > 0)`. When per-country scores survive a cron tick (TTL 6h, cron every 6h), missing=0 and the rebuild path is skipped. Ranking aggregate then expires alone and is never refreshed until scores also expire — multi-hour gaps where `resilience:ranking:v9` is gone while seed-meta still claims freshness. 3. `writeRankingSeedMeta` fires whenever finalWarmed > 0, regardless of whether the ranking key is actually present. Health endpoint sees fresh meta + missing data → EMPTY_ON_DEMAND with a misleading seedAge. Fixes: - _shared.ts: split the warm pipeline SET into SET_BATCH=30-command chunks so each pipeline body fits well under timeout. Pad missing-batch results with empty entries so the per-command alignment stays correct (failed batches stay excluded from `warmed`, no proof = no claim). - seed-resilience-scores.mjs: extract `ensureRankingPresent` helper, call it from BOTH the missing>0 and missing===0 branches so the ranking gets refreshed every cron. Add a post-rebuild STRLEN verification — rebuild HTTP can return 200 with a payload but still skip the SET (coverage gate, pipeline failure). - main(): only writeRankingSeedMeta when result.rankingPresent === true. Otherwise log and let the next cron retry. Tests: - resilience-ranking.test.mts: assert pipelines stay ≤30 commands. - resilience-scores-seed.test.mjs: structural checks that the rebuild is hoisted (≥2 callsites of ensureRankingPresent), STRLEN verification is present, and meta write is gated on rankingPresent. Full resilience suite: 373/373 pass (was 370 — 3 new tests).

vercel · 2026-04-16T06:36:11Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
worldmonitor	Ready	Preview, Comment	Apr 16, 2026 8:47am

greptile-apps · 2026-04-16T06:43:56Z

Greptile Summary

This PR addresses three coupled production failures that left resilience:ranking:v9 absent after the previous Slice A landing: (1) a 600KB single-pipeline SET timing out on Vercel Edge, (2) the ranking rebuild only running inside the missing > 0 branch, and (3) seed-meta being written even when the ranking key was absent.

The fixes — chunked SET batches (SET_BATCH = 30) in warmMissingResilienceScores, an ensureRankingPresent helper called from both cron branches, and a post-rebuild STRLEN verification gating the meta write — are logically sound and well-tested.

Confidence Score: 5/5

Safe to merge — all three root causes are addressed correctly with no new P0/P1 issues introduced.

All remaining observations are P2. The core logic — chunked SETs with index-alignment padding, always-on ranking probe/rebuild, and truthful meta gating — is sound and covered by both behavioral and structural tests. No data loss, security concerns, or broken contracts.

No files require special attention.

Important Files Changed

Filename	Overview
server/worldmonitor/resilience/v1/_shared.ts	Adds SET_BATCH=30 chunking in warmMissingResilienceScores with correct index-alignment padding for failed batches; the persistence guard correctly excludes any command whose batch returned [].
scripts/seed-resilience-scores.mjs	Extracts ensureRankingPresent helper (called from both warm and skip-warm branches), adds STRLEN post-rebuild verification, and gates writeRankingSeedMeta on result.rankingPresent. DEL-before-rebuild when laggardsWarmed>0 is an intentional tradeoff.
tests/resilience-ranking.test.mts	Adds two new behavioral tests: pipeline-size assertion (SET_BATCH ≤ 30) and persistence-guard regression (ranking not published when SETs return non-OK).
tests/resilience-scores-seed.test.mjs	Adds structural (source-text) assertions for the three invariants: ensureRankingPresent called ≥2 times, STRLEN verify present, and meta gated on rankingPresent.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A([Cron tick starts]) --> B[GET all country scores\nbulk pipeline]
    B --> C{preWarmed == total?}
    C -- missing > 0 --> D[Call /get-resilience-ranking\nto bulk-warm missing scores]
    D --> E[Re-check still-missing countries]
    E --> F{laggardsWarmed > 0?}
    F -- yes --> G[Warm laggards individually]
    G --> H[Final GET pipeline for intervals]
    F -- no --> H
    H --> I[computeAndWriteIntervals]
    I --> J[ensureRankingPresent\nlaggardsWarmed=N]
    C -- missing == 0 --> K[computeAndWriteIntervals\nwith preResults]
    K --> L[ensureRankingPresent\nlaggardsWarmed=0]
    J --> M{rankingExists != null\n&& laggardsWarmed == 0?}
    L --> M
    M -- yes --> P[return true]
    M -- no --> N{laggardsWarmed > 0?}
    N -- yes --> O[DEL resilience:ranking:v9]
    O --> Q[Rebuild via /get-resilience-ranking]
    N -- no --> Q
    Q --> R[STRLEN verify resilience:ranking:v9]
    R --> S{STRLEN > 0?}
    S -- yes --> T[return true]
    S -- no --> U[return false]
    P --> V{result.rankingPresent?}
    T --> V
    U --> V
    V -- true --> W[writeRankingSeedMeta]
    V -- false --> X[Log warning, next cron retries]

_{Reviews (1): Last reviewed commit: "fix(resilience-ranking): chunked warm SE..." | Re-trigger Greptile}

greptile-apps · 2026-04-16T06:44:00Z

+  for (let i = 0; i < allSetCommands.length; i += SET_BATCH) {
+    const batch = allSetCommands.slice(i, i + SET_BATCH);
+    const batchResults = await runRedisPipeline(batch);
+    if (batchResults.length !== batch.length) {
+      // runRedisPipeline returns [] on transport/HTTP failure. Pad with
+      // empty entries so the per-command index alignment downstream stays
+      // correct — those entries will fail the OK check and be excluded
+      // from `warmed`, which is the safe behavior (no proof = no claim).
+      for (let j = 0; j < batch.length; j++) persistResults.push({});
+    } else {
+      for (const result of batchResults) persistResults.push(result);
+    }
  }


Sequential batch awaits add round-trip latency

The batches are awaited one at a time. For 222 countries (⌈222/30⌉ = 8 batches), this serializes 7 extra Upstash round-trips (~100–500 ms each on Vercel Edge) that could run in parallel. Promise.all would reduce warm time to a single wave, though you'd need to flatten results in batch-offset order and keep the same alignment-padding logic.

Not blocking — the sequential approach is safer for backpressure and the total latency is still well within the function timeout.

greptile-apps · 2026-04-16T06:44:01Z

+    if (laggardsWarmed > 0) {
+      await redisPipeline(url, token, [['DEL', RESILIENCE_RANKING_CACHE_KEY]]);
+    }


DEL before rebuild confirmation creates a brief absence window

When laggardsWarmed > 0 and the ranking key currently exists, the DEL fires before the rebuild HTTP call completes. If the rebuild request returns non-OK (transient edge timeout, deploy in progress, etc.), the STRLEN verify returns 0, rankingPresent is false, and the ranking stays absent until the next cron tick. The meta write is correctly skipped so the health endpoint won't lie, but users will hit EMPTY_ON_DEMAND for up to one cron interval.

A safer sequencing would DEL only after receiving a successful 200 from the rebuild call — though this does mean the stale ranking briefly serves read traffic during the rebuild, which may be acceptable. Worth documenting the tradeoff explicitly if not already captured elsewhere.

…s sole writer) Reviewer P1: ensureRankingPresent() returning true only means the ranking key exists in Redis — not that THIS cron actually wrote it. The handler skips both the ranking SET and the meta SET when coverage < 75%, so an older ranking from a prior cron can linger while this cron's data didn't land. Under that scenario, the previous commit still wrote a fresh seed-meta:resilience:ranking, recreating the stale-meta-over-stale-data failure this PR is meant to eliminate. Fix: remove seeder-side seed-meta writes entirely. The ranking handler already writes ranking + meta atomically in the same pipeline when (and only when) coverage passes the gate. ensureRankingPresent() triggers the handler every cron, which addresses the original rationale for the seeder heartbeat (meta going stale during quiet Pro usage) without the seeder needing to lie. Consequence on failure: - Coverage gate trips → handler writes neither ranking nor meta. - seed-meta stays at its previous timestamp; api/health reports accurate staleness (STALE_SEED after maxStaleMin, then CRIT) instead of a fresh meta over stale/empty data. Tests updated: the "meta gated on rankingPresent" assertion is replaced with "seeder must not SET seed-meta:resilience:ranking" + "no writeRankingSeedMeta". Comments may still reference the key name for maintainer clarity — the assertion targets actual SET commands. Full resilience suite: 373/373 pass.

Reviewer P1+P2: - P1: ranking TTL == cron interval (both 6h) left a timing hole. If a cron wrote the key near the end of its run and the next cron fired near the start of its interval, the key was still alive at probe time → ensureRankingPresent() returned early → no rebuild → key expired a short while later and stayed absent until a cron eventually ran while the key was missing. Multi-hour EMPTY_ON_DEMAND gaps. - P2: probing only the ranking data key (not seed-meta) meant a partial handler pipeline (ranking SET ok, meta SET missed) would self-heal only when the ranking itself expired — never during its TTL window. Fix: 1. Bump RESILIENCE_RANKING_CACHE_TTL_SECONDS from 6h to 12h (2x cron interval). A single missed or slow cron no longer causes a gap. Server-side and seeder-side constants kept in sync. 2. Replace ensureRankingPresent() with refreshRankingAggregate(): drop the 'if key present, skip' short-circuit. Rebuild every cron, unconditionally. One cheap HTTP call keeps ranking + seed-meta rolling forward together and self-heals the partial-pipeline case — handler retries the atomic pair every 6h regardless of whether the keys are currently live. 3. Update health.js comment to reflect the new TTL and refresh cadence (12h data TTL, 6h refresh, 12h staleness threshold = 2 missed ticks). Tests: - RESILIENCE_RANKING_CACHE_TTL_SECONDS asserts 12h (was 6h). - New assertion: refreshRankingAggregate must NOT early-return on probe- hit, and the rebuild HTTP call must be unconditional in its body. - DEL-guard test relaxed to allow comments between '{' and the DEL line (structural property preserved). Full resilience suite: 375/375.

…a ?refresh=1 Reviewer P2s: - Warm path serialized the 8 batch pipelines with `await` in a for-loop, adding ~7 extra Upstash round-trips (100-500ms each on Edge) to the warm wall-clock. Batches are independent; Promise.all collapses them into one slowest-batch window. - DEL+rebuild created a brief absence window: if the rebuild request failed transiently, the ranking stayed absent until the next cron. Now seeder calls `/api/resilience/v1/get-resilience-ranking?refresh=1` and the handler bypasses its cache-hit early-return, recomputing and SETting atomically. On rebuild failure, the existing (possibly stale-but-present) ranking is preserved instead of being nuked. Handler: read ctx.request.url for the refresh query param; guard the URL parse with try/catch so an unparseable url falls back to the cached-first behavior. Tests: - New: ?refresh=1 must bypass the cache-hit early-return (fails on old code, passes now). - DEL-guard test replaced with 'does NOT DEL' + 'uses ?refresh=1'. - Batch chunking still asserted at SET_BATCH=30. Full resilience suite: 376/376.

…tric TTL hazard) Reviewer P1: in the 6h-12h window, per-country score keys have expired (TTL 6h) but the ranking aggregate is still alive (TTL 12h). The seeder's bulk-warm call was hitting get-resilience-ranking without ?refresh=1, so the handler's cache-hit early-return fired and the entire warm path was skipped. Scores stayed missing; coverage degraded; the only recovery was the per-country laggard loop (5-request batches) — which silently no-ops when WM_KEY is absent. This defeated the whole point of the chunked bulk warm introduced in this PR. Fix: the bulk-warm fetch at scripts/seed-resilience-scores.mjs:167 now appends ?refresh=1, matching the rebuild call. Every seeder-initiated hit on the ranking endpoint forces the handler to route through warmMissingResilienceScores and its chunked pipeline SET, regardless of whether the aggregate is still cached. Test extended: structural assertion now scans ALL occurrences of get-resilience-ranking in the seeder and requires every one of them to carry ?refresh=1. Fails the moment a future change adds a bare call. Full resilience suite: 376/376.

… pipeline publish Reviewer P1: ?refresh=1 was honored for any caller — including valid Pro bearer tokens. A full warm is ~222 score computations + chunked pipeline SETs; a Pro user looping on refresh=1 (or an automated client) could DoS Upstash quota and Edge budget. Gate refresh behind WORLDMONITOR_VALID_KEYS / WORLDMONITOR_API_KEY (X-WorldMonitor-Key header) — the same allowlist the cron uses. Pro bearer tokens get the standard cache-first path; refresh requires the seed service key. Reviewer P2: the handler's atomic runRedisPipeline SET of ranking + meta is non-transactional on Upstash REST — either SET can fail independently. If the ranking landed but meta missed, the seeder's STRLEN verify would pass (ranking present) while /api/health stays stuck on stale meta. Two-part fix: - Handler inspects pipelineResult[0] and [1] and logs a warning when either SET didn't return OK. Ops-greppable signal. - Seeder's verify now checks BOTH keys in parallel: STRLEN on ranking data, and GET + fetchedAt freshness (<5min) on seed-meta. Partial publish logs a warning; next cron retries (SET is idempotent). Tests: - New: ?refresh=1 without/with-wrong X-WorldMonitor-Key must NOT trigger recompute (falls back to cached response). Existing bypass test updated to carry a valid seed key header. Full resilience suite: 376/376 + 1 new = 377/377.

… so refresh runs Live log 2026-04-16 09:25 showed the bundle runner SKIPPING Resilience-Scores (last seeded 203min ago, interval 360min -> 288min skip threshold). Every Railway cron fire within the 4.8h skip window bypassed the section entirely, so refreshRankingAggregate() -- the whole point of the Slice B work merged in #3124 -- never ran. Ranking could then silently expire in the gap. Lower intervalMs to 2h. The bundle runner skip threshold becomes 96min; hourly Railway fires run the section about every 2h. Well within the 12h ranking TTL, and cheap per warm-path run: - computeAndWriteIntervals (~100ms local CPU + one pipeline write) - refreshRankingAggregate -> /api/resilience/v1/get-resilience-ranking?refresh=1 (handler recompute + 2-SET pipeline, ~2-5s) - STRLEN + GET-meta verify in parallel (~200ms) Total ~5-10s per warm-scores run. The expensive 222-country warm still only runs when scores are actually missing. Structural test pins intervalMs <= 2 hours so this doesn't silently regress. Full resilience suite: 378/378.

… so refresh runs (#3126) Live log 2026-04-16 09:25 showed the bundle runner SKIPPING Resilience-Scores (last seeded 203min ago, interval 360min -> 288min skip threshold). Every Railway cron fire within the 4.8h skip window bypassed the section entirely, so refreshRankingAggregate() -- the whole point of the Slice B work merged in #3124 -- never ran. Ranking could then silently expire in the gap. Lower intervalMs to 2h. The bundle runner skip threshold becomes 96min; hourly Railway fires run the section about every 2h. Well within the 12h ranking TTL, and cheap per warm-path run: - computeAndWriteIntervals (~100ms local CPU + one pipeline write) - refreshRankingAggregate -> /api/resilience/v1/get-resilience-ranking?refresh=1 (handler recompute + 2-SET pipeline, ~2-5s) - STRLEN + GET-meta verify in parallel (~200ms) Total ~5-10s per warm-scores run. The expensive 222-country warm still only runs when scores are actually missing. Structural test pins intervalMs <= 2 hours so this doesn't silently regress. Full resilience suite: 378/378.

greptile-apps bot reviewed Apr 16, 2026

View reviewed changes

vercel bot deployed to Preview April 16, 2026 06:53 View deployment

koala73 added 2 commits April 16, 2026 11:17

vercel bot deployed to Preview April 16, 2026 07:54 View deployment

koala73 added 2 commits April 16, 2026 12:19

vercel bot deployed to Preview April 16, 2026 08:47 View deployment

koala73 merged commit da1fa33 into main Apr 16, 2026
10 checks passed

koala73 deleted the fix/resilience-ranking-rebuild-always branch April 16, 2026 08:48

koala73 mentioned this pull request Apr 16, 2026

fix(seed-bundle-resilience): shorten Resilience-Scores interval so refresh actually runs #3126

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(resilience-ranking): chunked warm SET, always-on rebuild, truthful meta (Slice B)#3124

fix(resilience-ranking): chunked warm SET, always-on rebuild, truthful meta (Slice B)#3124
koala73 merged 6 commits intomainfrom
fix/resilience-ranking-rebuild-always

koala73 commented Apr 16, 2026

Uh oh!

vercel bot commented Apr 16, 2026 •

edited

Loading

Uh oh!

greptile-apps bot commented Apr 16, 2026

Uh oh!

greptile-apps bot Apr 16, 2026

Uh oh!

greptile-apps bot Apr 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

koala73 commented Apr 16, 2026

Summary

Root causes

Fixes

Test plan

Uh oh!

vercel bot commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

greptile-apps bot commented Apr 16, 2026

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps bot Apr 16, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Apr 16, 2026 •

edited

Loading