fix(health/seeders): clear DEGRADED — suppressed spread, bot-walled NZ, FRED proxy TLS retry#4007
Conversation
…EMPTY_DATA
/api/health was DEGRADED with consumerPricesSpread=EMPTY_DATA (crit). Root
cause is a health false-positive, not a pipeline outage: the consumer-prices
aggregate job deliberately writes retailer_spread_pct: 0 when a market's
retailers share < MIN_SPREAD_ITEMS (4) common basket items
(consumer-prices-core/src/jobs/aggregate.ts:247 — "prevent stale noisy value
persisting", logs "spread suppressed (N/4 common items)"). The AE basket's
cross-retailer overlap shrank 2/4 → 1/4 → 0/4 over late-May 2026 while the
publish job kept running fresh (seedAgeMin=111, well inside maxStaleMin=1500)
and every sibling key (overview/categories/movers/basket-series) published
normally. health then classified the present-but-zero key as EMPTY_DATA (crit)
and tipped the whole endpoint to DEGRADED.
Add consumerPricesSpread to EMPTY_DATA_OK_KEYS — same shape as the existing
newsThreatSummary precedent (quiet periods = 0, valid). 0 records is OK while
the seed is fresh; STALE_SEED still fires if the publish job itself stops, so a
real outage is not masked.
Tests (tests/health-classify.test.mjs): suppressed-spread-while-fresh → OK
(proven to FAIL as EMPTY_DATA without this change), and suppressed-spread-gone-
stale → STALE_SEED (exemption does not mask a publish outage).
NOTE — the other 3 DEGRADED crits are runtime/infra, tracked separately, NOT in
this PR:
- fredBatch + economicStress: seed-economy Railway service stopped at
2026-05-29 22:01Z (last-run + 3264min == health checkedAt). FRED proxy
errors in its logs are a red herring — fredFetchJson retries direct and
succeeded 24/24 every run. Needs service restart.
- militaryCii: seed-military-cii.mjs was never provisioned as a Railway cron
(PR #3864's own commit message flags this: "the seed job must be wired into
the Railway schedule … that config lives in the Railway dashboard"). No
seed-meta has ever been written. Needs the cron provisioned.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Greptile SummaryThis PR fixes a health-check false positive where
Confidence Score: 5/5Safe to merge — the change is a one-line addition to a well-understood allow-list, the classifier logic is untouched, and both happy-path and outage-detection paths are covered by new tests. The fix is minimal and narrowly scoped: adding one key to EMPTY_DATA_OK_KEYS targets exactly the hasData=true, records=0 branch, leaving all other classification paths unchanged. The stale-seed guard remains active, so a stopped publish job still surfaces as a warn. The seed-meta key used in both tests matches the SEED_META_KEYS definition in health.js, and the staleness boundary (2000 min vs maxStaleMin: 1500) is correct. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["classifyKey('consumerPricesSpread', ...)"] --> B{hasData?}
B -- "No (key absent)" --> C{EMPTY_DATA_OK_KEYS?}
B -- "Yes (296-byte payload)" --> D{records == 0?}
D -- No --> E{seedStale?}
D -- "Yes (spread suppressed)" --> F{EMPTY_DATA_OK_KEYS?}
F -- "No (before fix)" --> G["EMPTY_DATA → CRIT ❌"]
F -- "Yes (after fix: consumerPricesSpread added)" --> H{seedStale?}
H -- false --> I["OK ✅"]
H -- true --> J["STALE_SEED → WARN ⚠️"]
C -- Yes --> K{seedStale?}
K -- false --> L["OK ✅"]
K -- true --> M["STALE_SEED → WARN ⚠️"]
E -- false --> N["OK ✅"]
E -- true --> O["STALE_SEED → WARN ⚠️"]
Reviews (1): Last reviewed commit: "fix(health): treat suppressed retailer-s..." | Re-trigger Greptile |
…locked fuelPrices has been STALE_SEED for ~12 days. Root cause: mbie.govt.nz moved its entire domain (apex, data page, weekly-table.csv asset) behind an Incapsula (Imperva) JS bot-wall ~2026-05-20. The source now returns HTTP 200 text/html (~212B `_Incapsula_Resource` stub) instead of CSV — verified blocked from a residential IP, a datacenter IP, AND the Decodo residential proxy (the older "proxy-preferred + retry" path was written for an IP-reputation 403, which a JS challenge is not; data.govt.nz / figure.nz fallbacks are also Incapsula-walled). fetchNewZealand() returns [] → NZ lands in failedSources → validateFuel rejected the WHOLE multi-source snapshot every run (untoleratedFailures.length !== 0) → "validation failed (empty data) — seed-meta NOT refreshed" → 12d STALE_SEED, even though ≥30 countries + US/GB/MY were present. Add 'New Zealand' to TOLERATED_FAILURES — exactly the existing Brazil precedent (a structurally unreachable source must not gate the validated publish). NZ still runs every cycle and is carried automatically the moment the source returns CSV again. The gate is NOT weakened for real outages: an untolerated failure (e.g. Mexico) still rejects even alongside a tolerated NZ. Restoring NZ properly needs a headless/challenge-solver fetch (FlareSolverr / browserless / Zyte) — tracked as a follow-up, out of scope here. Tests (tests/seed-fuel-prices.test.mjs): NZ-only-failed → accepts (proven to FAIL without this change), Brazil+NZ both failed → accepts, NZ+Mexico → still rejects (tolerating NZ doesn't mask an untolerated critical-source outage).
…ailing
seed-economy has been failing on Railway every ~15min run (red dot in the cron
list; runs take ~2m34s vs ~37s healthy = a step hanging to timeout), so
fredBatch + economicStress + macroSignals go stale → /api/health DEGRADED.
Root cause is the proxy retry classifier in fredFetchJson. The Decodo proxy
flaps mid-TLS-handshake; the failing-run logs show hundreds of:
- `...SSL routines:tls_get_more_records:packet length too long...`
- `Client network socket disconnected before secure TLS connection was
established`
Neither matched the old transient regex (`HTTP 5xx|522|timeout|ECONNRESET|
ETIMEDOUT|EAI_AGAIN`), so `transient` was false → the 3× proxy retry loop broke
on attempt 1 and fell straight to a DIRECT FRED fetch. Direct fetches from
Railway's datacenter IP get rate-limited/blocked, so each of the 24 series
burned its full retry+20s-direct-timeout budget (the ~2min slowdown) and the
batch came back empty. (FRED_API_KEY is valid — verified HTTP 200 — so this is
transport, not auth.)
Fix: extract the predicate to an exported `isTransientProxyError()` and add the
TLS-tear + socket signatures so the proxy is retried (it rotates exit IP per
attempt — a fresh IP usually completes the handshake, which is why the captured
window still hit 24/24 via retries). seed-economy's section pipeline is already
fail-soft (Promise.allSettled + per-series try/catch + extra-key writes before
the validate-gated primary), so restoring FRED fetch success is the fix.
Test (tests/fred-proxy-transient-classify.test.mjs): the exact TLS-tear strings
from the logs classify transient (proven to FAIL against the old regex), classic
5xx/timeout still transient, and 4xx/missing-key/empty stay non-transient.
Summary
Triaged the live DEGRADED
/api/healthagainst Railway logs + cron history. Three genuine code defects fixed here; the remaining crit (militaryCii) is infra-only and noted below.Fix 1 —
consumerPricesSpreadwrongly crit (api/health.js)The consumer-prices aggregate job deliberately writes
retailer_spread_pct: 0when a market's retailers share <MIN_SPREAD_ITEMS(4) common basket items (aggregate.ts:247). AE's cross-retailer overlap fell2/4 -> 1/4 -> 0/4while the publish job stayed fresh (seedAgeMin=111<< 1500) and siblings published normally — yet health crit'd the present-but-zero key asEMPTY_DATA-> DEGRADED.Fix: added a zero-record-only exemption for
consumerPricesSpread(ZERO_RECORD_DATA_OK_KEYS) while keeping the missing-payload branch strict. That meansrecordCount: 0is OK while the seed is fresh only when the payload key exists; a missing retailer-spread payload still classifiesEMPTY, andSTALE_SEEDstill fires if the publish job stops.Fix 2 — fuel-prices blocked by JS-bot-walled NZ (
scripts/seed-fuel-prices.mjs)mbie.govt.nzmoved its whole domain behind an Incapsula JS bot-wall (~2026-05-20) — returns a 212B_Incapsula_Resourcestub, not CSV. Verified blocked from residential IP, datacenter IP, AND the Decodo proxy (data.govt.nz/figure.nz also walled). NZ failing rejected the whole validated publish every run -> 12d STALE_SEED. Added'New Zealand'toTOLERATED_FAILURES(Brazil precedent; gate not weakened — NZ+Mexico still rejects). Proper restore tracked in #4010.Fix 3 — seed-economy FRED proxy TLS-tear not retried (
scripts/_seed-utils.mjs)seed-economy has been running every ~15min and failing every run (red dot; ~2m34s vs ~37s healthy = a step hanging to timeout) ->
fredBatch+economicStress+macroSignalsstale. Root cause:fredFetchJson's transient classifier. The Decodo proxy flaps mid-TLS-handshake; the failing-run logs show hundreds oftls_get_more_records: packet length too longandClient network socket disconnected before secure TLS connection was established— neither matched the old5xx|522|timeout|ECONNRESET|ETIMEDOUT|EAI_AGAINregex, sotransient=false-> the 3x proxy retry broke on attempt 1 and fell to a direct FRED fetch, which Railway's datacenter IP gets rate-limited on -> all 24 series burned their full retry+20s-timeout budget (the ~2min slowdown) and the batch came back empty. (FRED_API_KEYis valid — tested HTTP 200 — so it's transport, not auth.) Fix: extractedisTransientProxyError()and added the TLS-tear/socket signatures so the proxy is retried (it rotates exit IP per attempt -> a fresh IP completes the handshake, which is why the captured window still hit 24/24 via retries). seed-economy's pipeline is already fail-soft, so restoring FRED success is the fix.NOT in this PR
militaryCii:seed-military-cii.mjswas never provisioned as a Railway cron (PR feat(cii): unify the Country Instability Index (Phases 0-3b) #3864's commit message flags it). No seed-meta ever written; key still live-consumed byget-risk-scores.ts. -> provision the cron.Test plan
tests/health-classify.test.mjs16/16 (3 new, including the missing-payload regression)tests/seed-fuel-prices.test.mjs14/14 (4 new, fail->pass + NZ+Mexico still-rejects)tests/fred-proxy-transient-classify.test.mjs3/3 (TLS-tear strings from the real logs; fail->pass against old regex)Follow-up for NZ restore: #4010