Skip to content

fix(resilience-ranking): return warmed scores from memory, skip lossy re-read#3121

Merged
koala73 merged 4 commits intomainfrom
fix/resilience-ranking-race
Apr 16, 2026
Merged

fix(resilience-ranking): return warmed scores from memory, skip lossy re-read#3121
koala73 merged 4 commits intomainfrom
fix/resilience-ranking-race

Conversation

@koala73
Copy link
Copy Markdown
Owner

@koala73 koala73 commented Apr 16, 2026

Summary

Fixes the silent ranking publish failure that leaves resilience:ranking:v9 absent while the seeder keeps writing a "fresh" seed-meta:resilience:ranking, producing resilienceRanking: { status: EMPTY_ON_DEMAND, records: 0, seedAgeMin: 277 } in /api/health even when the cron is firing normally.

Root cause

server/worldmonitor/resilience/v1/get-resilience-ranking.ts warmed ~222 country scores, then immediately re-read them via getCachedResilienceScores (a /pipeline GET) to compute coverage. Upstash REST has documented write→re-read visibility lag within a single Vercel invocation (PR #3057, feedback_upstash_write_reread_race_in_handler.md). The pipeline GET could return 0 results despite every SET succeeding — coverage collapsed to 0% < 75% threshold, handler logged [resilience] ranking not cached … and skipped both the ranking SET and the meta SET. Seeder then wrote its own meta anyway (based on its own pipeline GET, which by then had propagated), leaving the aggregate permanently absent.

Fix

  • warmMissingResilienceScores now returns Map<countryCode, GetResilienceScoreResponse> with every successfully computed score.
  • Handler merges warm results into cachedScores directly and drops the post-warm re-read — coverage reflects what was actually computed in-memory in this invocation.

No API changes, no schema changes, ~20 lines across 2 files.

Test plan

  • npx tsx --test tests/resilience-ranking.test.mts — 9/9 pass (8 existing + 1 new race regression).
  • Full resilience suite: 369/369 pass.
  • New test publishes ranking via in-memory warm results even when Upstash pipeline-GET lags after /set writes (race regression) simulates the exact Upstash lag: pipeline-GET of score keys returns null after /set writes. Fails against old code, passes with the fix.
  • After deploy, probe Redis: TTL resilience:ranking:v9 should be positive; /api/health should show resilienceRanking: { status: OK, records: ~150+, seedAgeMin: <360 }.
  • Seeder's [resilience] ranking not cached — coverage X/222 below 75% threshold warnings should stop appearing.

Follow-up (not in this PR)

Slice B: make seed-resilience-scores.mjs write seed-meta:resilience:ranking only when the ranking key is actually present (currently it lies about freshness based on per-country score count).

… re-read

Upstash REST writes via /set aren't always visible to an immediately-following
/pipeline GET in the same Vercel invocation (documented in PR #3057 /
feedback_upstash_write_reread_race_in_handler.md). The ranking handler was
warming 222 countries then re-reading them from Redis to compute a coverage
ratio; that re-read could return 0 despite every SET succeeding, collapsing
coverage to 0% < 75% and silently dropping the ranking publish. Consequence:
`resilience:ranking:v9` missing, per-country score keys absent, health
reports EMPTY_ON_DEMAND even while the seeder keeps writing a "fresh" meta.

Fix: warmMissingResilienceScores now returns Map<cc, GetResilienceScoreResponse>
with every successfully computed score. The handler merges those into
cachedScores directly and drops the post-warm re-read. Coverage now reflects
what was actually computed in-memory this invocation, not what Redis happens
to surface after write lag.

Adds a regression test that simulates pipeline-GET returning null for
freshly-SET score keys; it fails against the old code (coverage=0, no
ranking written) and passes with the fix (coverage=100%, ranking written).

Slice A of the resilience-ranking recovery plan; Slice B (seeder meta
truthfulness) follows.
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 16, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
worldmonitor Ignored Ignored Preview Apr 16, 2026 5:59am

Request Review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Apr 16, 2026

Greptile Summary

Fixes a silent ranking-publish failure caused by an Upstash REST write→read visibility lag: after warming missing scores, the handler now merges in-memory results directly into cachedScores instead of issuing a second pipeline GET that could return zero results and collapse coverage below the 75% gate. The change is minimal (~20 lines), well-commented, and accompanied by a regression test that reproduces the exact race condition.

Confidence Score: 5/5

Safe to merge — targeted fix with no API or schema changes and a direct regression test.

All changes are narrowly scoped to the race condition path. The return-type widening of warmMissingResilienceScores is backward-compatible, const instead of let is correct since the map is mutated not reassigned, country-code normalisation is idempotent so key alignment is guaranteed, and the scoreInterval field on warm results is not consumed by buildRankingItem (it reads from the separate fetchIntervals map), so no data inconsistency is introduced. No P0/P1 findings.

No files require special attention.

Important Files Changed

Filename Overview
server/worldmonitor/resilience/v1/_shared.ts Return type of warmMissingResilienceScores changed from void to Map<string, GetResilienceScoreResponse>; fulfilled results are now collected and returned to the caller for in-memory merge. Logic and normalization are correct.
server/worldmonitor/resilience/v1/get-resilience-ranking.ts Drops the post-warm getCachedResilienceScores re-read; merges the in-memory warmed map into cachedScores directly. Coverage gate now reflects in-memory truth rather than a Upstash-lagged pipeline GET.
tests/resilience-ranking.test.mts Adds a targeted race-regression test that intercepts all pipeline-GET calls for score keys and returns null, directly reproducing the Upstash lag; verifies ranking and seed-meta are published despite the simulated lag.

Sequence Diagram

sequenceDiagram
    participant H as Handler
    participant R as Redis (Upstash)
    participant W as warmMissing...

    Note over H,R: Before fix — pipeline-GET lag causes silent ranking drop
    H->>R: getCachedResilienceScores (pipeline GET)
    R-->>H: [] (lag: no scores yet)
    H->>W: warmMissingResilienceScores(missing)
    W->>R: ensureResilienceScoreCached × N (SET writes)
    R-->>W: OK
    W-->>H: void
    H->>R: getCachedResilienceScores (pipeline GET) ← race
    R-->>H: [] (Upstash lag: writes not yet visible)
    Note over H: coverage = 0/222 < 75% → ranking NOT published ❌

    Note over H,R: After fix — in-memory merge bypasses re-read lag
    H->>R: getCachedResilienceScores (pipeline GET)
    R-->>H: [] (lag: no scores yet)
    H->>W: warmMissingResilienceScores(missing)
    W->>R: ensureResilienceScoreCached × N (SET writes)
    R-->>W: OK
    W-->>H: Map<countryCode, score> (in-memory)
    H->>H: merge warmed into cachedScores
    Note over H: coverage = 222/222 = 100% ≥ 75% → ranking published ✅
    H->>R: SET resilience:ranking:v9 + seed-meta
Loading

Reviews (1): Last reviewed commit: "fix(resilience-ranking): return warmed s..." | Re-trigger Greptile

…T response

PR review P1: trusting every fulfilled ensureResilienceScoreCached() result as
"cached" turned the read-lag fix into a write-failure false positive.
cachedFetchJson's underlying setCachedJson only logs and swallows write errors,
so a transient /set failure on resilience:score:v9:* would leave per-country
scores absent while the ranking aggregate and its seed-meta got published on
top of them — worse than the bug this PR was meant to fix.

Fix: use the pipeline SET response as the authoritative persistence signal.

- Extract the score builder into a pure `buildResilienceScore()` with no
  caching side-effects (appendHistory stays — it's part of the score
  semantics).
- `ensureResilienceScoreCached()` still wraps it in cachedFetchJson so
  single-country RPC callers keep their log-and-return-anyway behavior.
- `warmMissingResilienceScores()` now computes in-memory, persists all
  scores in one pipeline SET, and only returns countries whose command
  reported `result: OK`. Pipeline SET's response is synchronous with the
  write, so OK means actually stored — no ambiguity with read-after-write
  lag.
- When runRedisPipeline returns fewer responses than commands (transport
  failure), return an empty map: no proof of persistence → coverage gate
  can't false-positive.

Adds regression test that blocks pipeline SETs to score keys and asserts
the ranking + meta are NOT published. Existing race-regression test still
passes.
PR review P1: the pipeline SET added to verify score-key persistence was
called with raw=true, bypassing the preview/dev key prefix (preview:<sha>:).
Two coupled regressions:

  1. Preview/dev deploys write unprefixed `resilience:score:v9:*` keys, but
     all reads (getCachedResilienceScores, ensureResilienceScoreCached via
     setCachedJson/cachedFetchJson) look in the prefixed namespace. Warmed
     scores become invisible to the same preview on the next read.
  2. Because production uses the empty prefix, preview writes land directly
     in the production-visible namespace, defeating the environment
     isolation guard in server/_shared/redis.ts.

Fix: drop the raw=true flag so runRedisPipeline applies prefixKey on each
command, symmetric with the reads. Adds __resetKeyPrefixCacheForTests in
redis.ts so tests can exercise a non-empty prefix without relying on
process-startup memoization order.

Adds regression test that simulates VERCEL_ENV=preview + a commit SHA and
asserts every score-key SET in the pipeline carries the preview:<sha>:
prefix. Fails on old code (raw writes), passes now. installRedis gains an
opt-in `keepVercelEnv` so the test can run under a forced env without
being clobbered by the helper's default reset.
… afterEach

PR review P2: the preview-prefix test mutates process.env.VERCEL_GIT_COMMIT_SHA
but the file's afterEach only restored VERCEL_ENV. A process started with a
real preview SHA (e.g. CI) would have that value unconditionally deleted after
the test ran, leaking changed state into later tests and producing different
prefix behavior locally vs. CI.

Fix: capture originalVercelSha at module load, restore it in afterEach, and
invalidate the memoized key prefix after each test so the next one recomputes
against the restored env. The preview-prefix test's finally block is no
longer needed — the shared teardown handles it.

Verified: suite still passes 11/11 under both `VERCEL_ENV=production` (unset)
and `VERCEL_ENV=preview VERCEL_GIT_COMMIT_SHA=ci-original-sha` process
environments.
@koala73 koala73 merged commit bdfb415 into main Apr 16, 2026
10 checks passed
@koala73 koala73 deleted the fix/resilience-ranking-race branch April 16, 2026 06:17
koala73 added a commit that referenced this pull request Apr 16, 2026
…l meta

Slice B follow-up to PR #3121. Three coupled production failures observed:

1. Per-country score persistence works (Slice A), but the 222-SET single
   pipeline body (~600KB) exceeds REDIS_PIPELINE_TIMEOUT_MS (5s) on Vercel
   Edge. runRedisPipeline returns []; persistence guard correctly returns
   empty; coverage = 0/222 < 75%; ranking publish silently dropped. Live
   Railway log: "Ranking: 0 ranked, 222 greyed out" → "Rebuilt … with 222
   countries (bulk-call race left ranking:v9 null)" — second call only
   succeeded because Upstash had finally caught up between attempts.

2. The seeder's probe + rebuild block lives inside `if (missing > 0)`. When
   per-country scores survive a cron tick (TTL 6h, cron every 6h), missing=0
   and the rebuild path is skipped. Ranking aggregate then expires alone and
   is never refreshed until scores also expire — multi-hour gaps where
   `resilience:ranking:v9` is gone while seed-meta still claims freshness.

3. `writeRankingSeedMeta` fires whenever finalWarmed > 0, regardless of
   whether the ranking key is actually present. Health endpoint sees fresh
   meta + missing data → EMPTY_ON_DEMAND with a misleading seedAge.

Fixes:
- _shared.ts: split the warm pipeline SET into SET_BATCH=30-command chunks
  so each pipeline body fits well under timeout. Pad missing-batch results
  with empty entries so the per-command alignment stays correct (failed
  batches stay excluded from `warmed`, no proof = no claim).
- seed-resilience-scores.mjs: extract `ensureRankingPresent` helper, call
  it from BOTH the missing>0 and missing===0 branches so the ranking gets
  refreshed every cron. Add a post-rebuild STRLEN verification — rebuild
  HTTP can return 200 with a payload but still skip the SET (coverage gate,
  pipeline failure).
- main(): only writeRankingSeedMeta when result.rankingPresent === true.
  Otherwise log and let the next cron retry.

Tests:
- resilience-ranking.test.mts: assert pipelines stay ≤30 commands.
- resilience-scores-seed.test.mjs: structural checks that the rebuild is
  hoisted (≥2 callsites of ensureRankingPresent), STRLEN verification is
  present, and meta write is gated on rankingPresent.

Full resilience suite: 373/373 pass (was 370 — 3 new tests).
koala73 added a commit that referenced this pull request Apr 16, 2026
…l meta (Slice B) (#3124)

* fix(resilience-ranking): chunked warm SET, always-on rebuild, truthful meta

Slice B follow-up to PR #3121. Three coupled production failures observed:

1. Per-country score persistence works (Slice A), but the 222-SET single
   pipeline body (~600KB) exceeds REDIS_PIPELINE_TIMEOUT_MS (5s) on Vercel
   Edge. runRedisPipeline returns []; persistence guard correctly returns
   empty; coverage = 0/222 < 75%; ranking publish silently dropped. Live
   Railway log: "Ranking: 0 ranked, 222 greyed out" → "Rebuilt … with 222
   countries (bulk-call race left ranking:v9 null)" — second call only
   succeeded because Upstash had finally caught up between attempts.

2. The seeder's probe + rebuild block lives inside `if (missing > 0)`. When
   per-country scores survive a cron tick (TTL 6h, cron every 6h), missing=0
   and the rebuild path is skipped. Ranking aggregate then expires alone and
   is never refreshed until scores also expire — multi-hour gaps where
   `resilience:ranking:v9` is gone while seed-meta still claims freshness.

3. `writeRankingSeedMeta` fires whenever finalWarmed > 0, regardless of
   whether the ranking key is actually present. Health endpoint sees fresh
   meta + missing data → EMPTY_ON_DEMAND with a misleading seedAge.

Fixes:
- _shared.ts: split the warm pipeline SET into SET_BATCH=30-command chunks
  so each pipeline body fits well under timeout. Pad missing-batch results
  with empty entries so the per-command alignment stays correct (failed
  batches stay excluded from `warmed`, no proof = no claim).
- seed-resilience-scores.mjs: extract `ensureRankingPresent` helper, call
  it from BOTH the missing>0 and missing===0 branches so the ranking gets
  refreshed every cron. Add a post-rebuild STRLEN verification — rebuild
  HTTP can return 200 with a payload but still skip the SET (coverage gate,
  pipeline failure).
- main(): only writeRankingSeedMeta when result.rankingPresent === true.
  Otherwise log and let the next cron retry.

Tests:
- resilience-ranking.test.mts: assert pipelines stay ≤30 commands.
- resilience-scores-seed.test.mjs: structural checks that the rebuild is
  hoisted (≥2 callsites of ensureRankingPresent), STRLEN verification is
  present, and meta write is gated on rankingPresent.

Full resilience suite: 373/373 pass (was 370 — 3 new tests).

* fix(resilience-ranking): seeder no longer writes seed-meta (handler is sole writer)

Reviewer P1: ensureRankingPresent() returning true only means the ranking
key exists in Redis — not that THIS cron actually wrote it. The handler
skips both the ranking SET and the meta SET when coverage < 75%, so an
older ranking from a prior cron can linger while this cron's data didn't
land. Under that scenario, the previous commit still wrote a fresh
seed-meta:resilience:ranking, recreating the stale-meta-over-stale-data
failure this PR is meant to eliminate.

Fix: remove seeder-side seed-meta writes entirely. The ranking handler
already writes ranking + meta atomically in the same pipeline when (and
only when) coverage passes the gate. ensureRankingPresent() triggers the
handler every cron, which addresses the original rationale for the seeder
heartbeat (meta going stale during quiet Pro usage) without the seeder
needing to lie.

Consequence on failure:
- Coverage gate trips → handler writes neither ranking nor meta.
- seed-meta stays at its previous timestamp; api/health reports accurate
  staleness (STALE_SEED after maxStaleMin, then CRIT) instead of a fresh
  meta over stale/empty data.

Tests updated: the "meta gated on rankingPresent" assertion is replaced
with "seeder must not SET seed-meta:resilience:ranking" + "no
writeRankingSeedMeta". Comments may still reference the key name for
maintainer clarity — the assertion targets actual SET commands.

Full resilience suite: 373/373 pass.

* fix(resilience-ranking): always refresh + 12h TTL (close timing hole)

Reviewer P1+P2:

- P1: ranking TTL == cron interval (both 6h) left a timing hole. If a cron
  wrote the key near the end of its run and the next cron fired near the
  start of its interval, the key was still alive at probe time →
  ensureRankingPresent() returned early → no rebuild → key expired a short
  while later and stayed absent until a cron eventually ran while the key
  was missing. Multi-hour EMPTY_ON_DEMAND gaps.

- P2: probing only the ranking data key (not seed-meta) meant a partial
  handler pipeline (ranking SET ok, meta SET missed) would self-heal only
  when the ranking itself expired — never during its TTL window.

Fix:

1. Bump RESILIENCE_RANKING_CACHE_TTL_SECONDS from 6h to 12h (2x cron
   interval). A single missed or slow cron no longer causes a gap.
   Server-side and seeder-side constants kept in sync.

2. Replace ensureRankingPresent() with refreshRankingAggregate(): drop the
   'if key present, skip' short-circuit. Rebuild every cron, unconditionally.
   One cheap HTTP call keeps ranking + seed-meta rolling forward together
   and self-heals the partial-pipeline case — handler retries the atomic
   pair every 6h regardless of whether the keys are currently live.

3. Update health.js comment to reflect the new TTL and refresh cadence
   (12h data TTL, 6h refresh, 12h staleness threshold = 2 missed ticks).

Tests:
- RESILIENCE_RANKING_CACHE_TTL_SECONDS asserts 12h (was 6h).
- New assertion: refreshRankingAggregate must NOT early-return on probe-
  hit, and the rebuild HTTP call must be unconditional in its body.
- DEL-guard test relaxed to allow comments between '{' and the DEL line
  (structural property preserved).

Full resilience suite: 375/375.

* fix(resilience-ranking): parallelize warm batches + atomic rebuild via ?refresh=1

Reviewer P2s:

- Warm path serialized the 8 batch pipelines with `await` in a for-loop,
  adding ~7 extra Upstash round-trips (100-500ms each on Edge) to the warm
  wall-clock. Batches are independent; Promise.all collapses them into one
  slowest-batch window.

- DEL+rebuild created a brief absence window: if the rebuild request failed
  transiently, the ranking stayed absent until the next cron. Now seeder
  calls `/api/resilience/v1/get-resilience-ranking?refresh=1` and the
  handler bypasses its cache-hit early-return, recomputing and SETting
  atomically. On rebuild failure, the existing (possibly stale-but-present)
  ranking is preserved instead of being nuked.

Handler: read ctx.request.url for the refresh query param; guard the URL
parse with try/catch so an unparseable url falls back to the cached-first
behavior.

Tests:
- New: ?refresh=1 must bypass the cache-hit early-return (fails on old code,
  passes now).
- DEL-guard test replaced with 'does NOT DEL' + 'uses ?refresh=1'.
- Batch chunking still asserted at SET_BATCH=30.

Full resilience suite: 376/376.

* fix(resilience-ranking): bulk-warm call also needs ?refresh=1 (asymmetric TTL hazard)

Reviewer P1: in the 6h-12h window, per-country score keys have expired
(TTL 6h) but the ranking aggregate is still alive (TTL 12h). The seeder's
bulk-warm call was hitting get-resilience-ranking without ?refresh=1, so
the handler's cache-hit early-return fired and the entire warm path was
skipped. Scores stayed missing; coverage degraded; the only recovery was
the per-country laggard loop (5-request batches) — which silently no-ops
when WM_KEY is absent. This defeated the whole point of the chunked bulk
warm introduced in this PR.

Fix: the bulk-warm fetch at scripts/seed-resilience-scores.mjs:167 now
appends ?refresh=1, matching the rebuild call. Every seeder-initiated hit
on the ranking endpoint forces the handler to route through
warmMissingResilienceScores and its chunked pipeline SET, regardless of
whether the aggregate is still cached.

Test extended: structural assertion now scans ALL occurrences of
get-resilience-ranking in the seeder and requires every one of them to
carry ?refresh=1. Fails the moment a future change adds a bare call.

Full resilience suite: 376/376.

* fix(resilience-ranking): gate ?refresh=1 on seed key + detect partial pipeline publish

Reviewer P1: ?refresh=1 was honored for any caller — including valid Pro
bearer tokens. A full warm is ~222 score computations + chunked pipeline
SETs; a Pro user looping on refresh=1 (or an automated client) could DoS
Upstash quota and Edge budget. Gate refresh behind
WORLDMONITOR_VALID_KEYS / WORLDMONITOR_API_KEY (X-WorldMonitor-Key
header) — the same allowlist the cron uses. Pro bearer tokens get the
standard cache-first path; refresh requires the seed service key.

Reviewer P2: the handler's atomic runRedisPipeline SET of ranking + meta
is non-transactional on Upstash REST — either SET can fail independently.
If the ranking landed but meta missed, the seeder's STRLEN verify would
pass (ranking present) while /api/health stays stuck on stale meta.

Two-part fix:
- Handler inspects pipelineResult[0] and [1] and logs a warning when
  either SET didn't return OK. Ops-greppable signal.
- Seeder's verify now checks BOTH keys in parallel: STRLEN on ranking
  data, and GET + fetchedAt freshness (<5min) on seed-meta. Partial
  publish logs a warning; next cron retries (SET is idempotent).

Tests:
- New: ?refresh=1 without/with-wrong X-WorldMonitor-Key must NOT trigger
  recompute (falls back to cached response). Existing bypass test updated
  to carry a valid seed key header.

Full resilience suite: 376/376 + 1 new = 377/377.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant