Skip to content

fix(health/seeders): clear DEGRADED — suppressed spread, bot-walled NZ, FRED proxy TLS retry#4007

Merged
koala73 merged 4 commits into
mainfrom
feat/health-degraded-triage
Jun 1, 2026
Merged

fix(health/seeders): clear DEGRADED — suppressed spread, bot-walled NZ, FRED proxy TLS retry#4007
koala73 merged 4 commits into
mainfrom
feat/health-degraded-triage

Conversation

@koala73
Copy link
Copy Markdown
Owner

@koala73 koala73 commented Jun 1, 2026

Summary

Triaged the live DEGRADED /api/health against Railway logs + cron history. Three genuine code defects fixed here; the remaining crit (militaryCii) is infra-only and noted below.

Fix 1 — consumerPricesSpread wrongly crit (api/health.js)

The consumer-prices aggregate job deliberately writes retailer_spread_pct: 0 when a market's retailers share < MIN_SPREAD_ITEMS (4) common basket items (aggregate.ts:247). AE's cross-retailer overlap fell 2/4 -> 1/4 -> 0/4 while the publish job stayed fresh (seedAgeMin=111 << 1500) and siblings published normally — yet health crit'd the present-but-zero key as EMPTY_DATA -> DEGRADED.

Fix: added a zero-record-only exemption for consumerPricesSpread (ZERO_RECORD_DATA_OK_KEYS) while keeping the missing-payload branch strict. That means recordCount: 0 is OK while the seed is fresh only when the payload key exists; a missing retailer-spread payload still classifies EMPTY, and STALE_SEED still fires if the publish job stops.

Fix 2 — fuel-prices blocked by JS-bot-walled NZ (scripts/seed-fuel-prices.mjs)

mbie.govt.nz moved its whole domain behind an Incapsula JS bot-wall (~2026-05-20) — returns a 212B _Incapsula_Resource stub, not CSV. Verified blocked from residential IP, datacenter IP, AND the Decodo proxy (data.govt.nz/figure.nz also walled). NZ failing rejected the whole validated publish every run -> 12d STALE_SEED. Added 'New Zealand' to TOLERATED_FAILURES (Brazil precedent; gate not weakened — NZ+Mexico still rejects). Proper restore tracked in #4010.

Fix 3 — seed-economy FRED proxy TLS-tear not retried (scripts/_seed-utils.mjs)

seed-economy has been running every ~15min and failing every run (red dot; ~2m34s vs ~37s healthy = a step hanging to timeout) -> fredBatch + economicStress + macroSignals stale. Root cause: fredFetchJson's transient classifier. The Decodo proxy flaps mid-TLS-handshake; the failing-run logs show hundreds of tls_get_more_records: packet length too long and Client network socket disconnected before secure TLS connection was establishedneither matched the old 5xx|522|timeout|ECONNRESET|ETIMEDOUT|EAI_AGAIN regex, so transient=false -> the 3x proxy retry broke on attempt 1 and fell to a direct FRED fetch, which Railway's datacenter IP gets rate-limited on -> all 24 series burned their full retry+20s-timeout budget (the ~2min slowdown) and the batch came back empty. (FRED_API_KEY is valid — tested HTTP 200 — so it's transport, not auth.) Fix: extracted isTransientProxyError() and added the TLS-tear/socket signatures so the proxy is retried (it rotates exit IP per attempt -> a fresh IP completes the handshake, which is why the captured window still hit 24/24 via retries). seed-economy's pipeline is already fail-soft, so restoring FRED success is the fix.

NOT in this PR

  • militaryCii: seed-military-cii.mjs was never provisioned as a Railway cron (PR feat(cii): unify the Country Instability Index (Phases 0-3b) #3864's commit message flags it). No seed-meta ever written; key still live-consumed by get-risk-scores.ts. -> provision the cron.
  • Decodo proxy TLS stability: this PR retries transient TLS-handshake tears so seed-economy survives the current flapping proxy behavior. It does not cure an upstream proxy/provider degradation; if the TLS failures become persistent, the job will still burn the bounded retry budget and needs provider/proxy follow-up.

Test plan

  • tests/health-classify.test.mjs 16/16 (3 new, including the missing-payload regression)
  • tests/seed-fuel-prices.test.mjs 14/14 (4 new, fail->pass + NZ+Mexico still-rejects)
  • tests/fred-proxy-transient-classify.test.mjs 3/3 (TLS-tear strings from the real logs; fail->pass against old regex)
  • seed-utils consumers (envelope/retry/dockerfile-import guards) 31/31 · full health suite 38/38
  • docs:check · biome · esbuild bundle — clean

Follow-up for NZ restore: #4010

…EMPTY_DATA

/api/health was DEGRADED with consumerPricesSpread=EMPTY_DATA (crit). Root
cause is a health false-positive, not a pipeline outage: the consumer-prices
aggregate job deliberately writes retailer_spread_pct: 0 when a market's
retailers share < MIN_SPREAD_ITEMS (4) common basket items
(consumer-prices-core/src/jobs/aggregate.ts:247 — "prevent stale noisy value
persisting", logs "spread suppressed (N/4 common items)"). The AE basket's
cross-retailer overlap shrank 2/4 → 1/4 → 0/4 over late-May 2026 while the
publish job kept running fresh (seedAgeMin=111, well inside maxStaleMin=1500)
and every sibling key (overview/categories/movers/basket-series) published
normally. health then classified the present-but-zero key as EMPTY_DATA (crit)
and tipped the whole endpoint to DEGRADED.

Add consumerPricesSpread to EMPTY_DATA_OK_KEYS — same shape as the existing
newsThreatSummary precedent (quiet periods = 0, valid). 0 records is OK while
the seed is fresh; STALE_SEED still fires if the publish job itself stops, so a
real outage is not masked.

Tests (tests/health-classify.test.mjs): suppressed-spread-while-fresh → OK
(proven to FAIL as EMPTY_DATA without this change), and suppressed-spread-gone-
stale → STALE_SEED (exemption does not mask a publish outage).

NOTE — the other 3 DEGRADED crits are runtime/infra, tracked separately, NOT in
this PR:
  - fredBatch + economicStress: seed-economy Railway service stopped at
    2026-05-29 22:01Z (last-run + 3264min == health checkedAt). FRED proxy
    errors in its logs are a red herring — fredFetchJson retries direct and
    succeeded 24/24 every run. Needs service restart.
  - militaryCii: seed-military-cii.mjs was never provisioned as a Railway cron
    (PR #3864's own commit message flags this: "the seed job must be wired into
    the Railway schedule … that config lives in the Railway dashboard"). No
    seed-meta has ever been written. Needs the cron provisioned.
@vercel
Copy link
Copy Markdown

vercel Bot commented Jun 1, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
worldmonitor Ready Ready Preview, Comment Jun 1, 2026 5:51am

Request Review

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented Jun 1, 2026

Greptile Summary

This PR fixes a health-check false positive where consumerPricesSpread was classified as EMPTY_DATA (crit) when the aggregate job deliberately suppresses the spread to 0 records due to insufficient retailer overlap — a valid data-coverage state, not a pipeline outage.

  • Adds 'consumerPricesSpread' to EMPTY_DATA_OK_KEYS with a detailed inline comment explaining the suppression semantics, consistent with the existing newsThreatSummary precedent.
  • Adds two regression tests: suppressed-spread-while-fresh → OK, and suppressed-spread-gone-stale (2000 min > maxStaleMin 1500) → STALE_SEED, confirming the exemption does not mask a genuine publish-job outage.

Confidence Score: 5/5

Safe to merge — the change is a one-line addition to a well-understood allow-list, the classifier logic is untouched, and both happy-path and outage-detection paths are covered by new tests.

The fix is minimal and narrowly scoped: adding one key to EMPTY_DATA_OK_KEYS targets exactly the hasData=true, records=0 branch, leaving all other classification paths unchanged. The stale-seed guard remains active, so a stopped publish job still surfaces as a warn. The seed-meta key used in both tests matches the SEED_META_KEYS definition in health.js, and the staleness boundary (2000 min vs maxStaleMin: 1500) is correct.

No files require special attention.

Important Files Changed

Filename Overview
api/health.js Adds 'consumerPricesSpread' to EMPTY_DATA_OK_KEYS so that a suppressed-spread payload (hasData=true, records=0) classifies as OK while fresh and STALE_SEED when the publish job stops; no logic change to the classifier itself.
tests/health-classify.test.mjs Adds two targeted regression tests: suppressed-spread-while-fresh → OK, and suppressed-spread-gone-stale → STALE_SEED; both use the correct seed-meta key and maxStaleMin boundary (2000 min > 1500 min).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["classifyKey('consumerPricesSpread', ...)"] --> B{hasData?}
    B -- "No (key absent)" --> C{EMPTY_DATA_OK_KEYS?}
    B -- "Yes (296-byte payload)" --> D{records == 0?}
    D -- No --> E{seedStale?}
    D -- "Yes (spread suppressed)" --> F{EMPTY_DATA_OK_KEYS?}
    F -- "No (before fix)" --> G["EMPTY_DATA → CRIT ❌"]
    F -- "Yes (after fix: consumerPricesSpread added)" --> H{seedStale?}
    H -- false --> I["OK ✅"]
    H -- true --> J["STALE_SEED → WARN ⚠️"]
    C -- Yes --> K{seedStale?}
    K -- false --> L["OK ✅"]
    K -- true --> M["STALE_SEED → WARN ⚠️"]
    E -- false --> N["OK ✅"]
    E -- true --> O["STALE_SEED → WARN ⚠️"]
Loading

Reviews (1): Last reviewed commit: "fix(health): treat suppressed retailer-s..." | Re-trigger Greptile

…locked

fuelPrices has been STALE_SEED for ~12 days. Root cause: mbie.govt.nz moved its
entire domain (apex, data page, weekly-table.csv asset) behind an Incapsula
(Imperva) JS bot-wall ~2026-05-20. The source now returns HTTP 200 text/html
(~212B `_Incapsula_Resource` stub) instead of CSV — verified blocked from a
residential IP, a datacenter IP, AND the Decodo residential proxy (the older
"proxy-preferred + retry" path was written for an IP-reputation 403, which a JS
challenge is not; data.govt.nz / figure.nz fallbacks are also Incapsula-walled).

fetchNewZealand() returns [] → NZ lands in failedSources → validateFuel rejected
the WHOLE multi-source snapshot every run (untoleratedFailures.length !== 0) →
"validation failed (empty data) — seed-meta NOT refreshed" → 12d STALE_SEED,
even though ≥30 countries + US/GB/MY were present.

Add 'New Zealand' to TOLERATED_FAILURES — exactly the existing Brazil precedent
(a structurally unreachable source must not gate the validated publish). NZ
still runs every cycle and is carried automatically the moment the source
returns CSV again. The gate is NOT weakened for real outages: an untolerated
failure (e.g. Mexico) still rejects even alongside a tolerated NZ.

Restoring NZ properly needs a headless/challenge-solver fetch (FlareSolverr /
browserless / Zyte) — tracked as a follow-up, out of scope here.

Tests (tests/seed-fuel-prices.test.mjs): NZ-only-failed → accepts (proven to
FAIL without this change), Brazil+NZ both failed → accepts, NZ+Mexico → still
rejects (tolerating NZ doesn't mask an untolerated critical-source outage).
@koala73 koala73 changed the title fix(health): treat suppressed retailer-spread (0 records) as OK, not EMPTY_DATA fix(health): clear two DEGRADED false-positives — suppressed retailer-spread + JS-bot-walled NZ fuel source Jun 1, 2026
…ailing

seed-economy has been failing on Railway every ~15min run (red dot in the cron
list; runs take ~2m34s vs ~37s healthy = a step hanging to timeout), so
fredBatch + economicStress + macroSignals go stale → /api/health DEGRADED.

Root cause is the proxy retry classifier in fredFetchJson. The Decodo proxy
flaps mid-TLS-handshake; the failing-run logs show hundreds of:
  - `...SSL routines:tls_get_more_records:packet length too long...`
  - `Client network socket disconnected before secure TLS connection was
    established`
Neither matched the old transient regex (`HTTP 5xx|522|timeout|ECONNRESET|
ETIMEDOUT|EAI_AGAIN`), so `transient` was false → the 3× proxy retry loop broke
on attempt 1 and fell straight to a DIRECT FRED fetch. Direct fetches from
Railway's datacenter IP get rate-limited/blocked, so each of the 24 series
burned its full retry+20s-direct-timeout budget (the ~2min slowdown) and the
batch came back empty. (FRED_API_KEY is valid — verified HTTP 200 — so this is
transport, not auth.)

Fix: extract the predicate to an exported `isTransientProxyError()` and add the
TLS-tear + socket signatures so the proxy is retried (it rotates exit IP per
attempt — a fresh IP usually completes the handshake, which is why the captured
window still hit 24/24 via retries). seed-economy's section pipeline is already
fail-soft (Promise.allSettled + per-series try/catch + extra-key writes before
the validate-gated primary), so restoring FRED fetch success is the fix.

Test (tests/fred-proxy-transient-classify.test.mjs): the exact TLS-tear strings
from the logs classify transient (proven to FAIL against the old regex), classic
5xx/timeout still transient, and 4xx/missing-key/empty stay non-transient.
@koala73 koala73 changed the title fix(health): clear two DEGRADED false-positives — suppressed retailer-spread + JS-bot-walled NZ fuel source fix(health/seeders): clear DEGRADED — suppressed spread, bot-walled NZ, FRED proxy TLS retry Jun 1, 2026
@koala73 koala73 merged commit 66f1a77 into main Jun 1, 2026
17 checks passed
@koala73 koala73 deleted the feat/health-degraded-triage branch June 1, 2026 06:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant