Skip to content

Health and Observability

Bob edited this page Jun 25, 2026 · 4 revisions

Health & Observability

Three real gaps got found and closed during a deliberate operational-maturity review — not in response to a reported failure, but because the question "would I actually know if this silently broke?" turned out to have an honest "no" for several things. This page covers what /health and /logs/stats report today, and the reasoning behind each piece.

/health — is everything actually reachable right now

{
  "status": "ok",
  "kiwix_books_loaded": 27,
  "cache_entries": 12,
  "cache_max_size": 500,
  "routing_cache_entries": 340,
  "routing_cache_max_size": 1000,
  "snapshot_jobs": { "...": "see below" },
  "adversarial_testing": { "...": "see below" },
  "temporal_pattern_detection": { "...": "see below" },
  "sources": { "...": "see below" }
}

sources runs a real, live check against every configured backend — not a check that a config value is merely present. _check_kiwix() actually queries the catalog; _check_llm() actually pings the configured model endpoint; _check_web() actually hits SearXNG. This is genuinely useful precisely because it catches real, current failures: a SearXNG instance whose request timeout is set too aggressively, or a Docker network that two containers have silently fallen off of, both show up here as "status": "error" with the actual underlying error message attached — not after search results degrade and someone has to debug it by hand, but immediately, the next time anyone checks.

cache_entries / cache_max_size / routing_cache_entries / routing_cache_max_size — see Caching for what these caches actually do. The pairing of current count against configured max exists specifically so growth toward either bound is visible at a glance, without needing to dig through logs or read code to even know a bound exists.

snapshot_jobs — see below.

adversarial_testing — same ok / stale / never_ran shape as snapshot_jobs, plus total_combinations_tried and flagged_for_review counts. See Adversarial Self-Testing for the full feature; flagged combinations themselves are reviewed via GET /adversarial/flagged, not through /health directly.

temporal_pattern_detection — same overall shape, plus one status the other two jobs don't have: insufficient_data, reported when the job ran successfully but the real event volume in its most recent window was below the floor needed to test even one pair. This is the expected, honest state for the first weeks of this feature's life on real homelab data volumes, not a failure — see Cross-Source Temporal Pattern Detection for the full set of states and what each one actually means.

Background job health

Every snapshot job already caught its own exceptions internally and just logged a warning on failure. That's good defensive coding for "don't crash the scheduler" — but it also meant a job that started failing on every single run would produce zero externally visible signal beyond a log line nobody was necessarily watching. The scheduler object itself had no external visibility either; it's a local variable inside main.py's startup code, never exposed to any endpoint.

get_snapshot_job_health() closes this using data that already existed — every snapshot is timestamped and stored, so there was no need for new instrumentation, just a query comparing "when did this job last actually succeed" against "how often is it supposed to run":

{
  "uptime":   {"status": "ok", "last_snapshot": "...", "minutes_since_last_snapshot": 1.3, "expected_interval_minutes": 2},
  "forecast": {"status": "ok", "...": "..."},
  "news":     {"status": "stale", "minutes_since_last_snapshot": 240.0, "expected_interval_minutes": 60},
  "ha":       {"status": "never_ran", "expected_interval_minutes": 5}
}

Four possible states: ok (recent enough), stale (more than 3x its expected interval since the last success — a generous grace window meant to absorb normal jitter without false-alarming on a slightly delayed scheduler tick), never_ran (zero snapshots ever stored for this source), and unknown (a corrupted or unparseable timestamp — degrades gracefully rather than raising an exception that would take down the rest of /health over one bad row).

/logs/stats — fallback visibility

Routing falls back from kiwix or news to web when a result looks empty. Whether that happened on any given query used to be completely invisible outside of reading raw logs — source_used correctly reported the actual source after a fallback, but nothing recorded that a fallback occurred at all.

{
  "fallback_count": 4,
  "fallback_rate_pct": 2.1,
  "fallback_by_target": {
    "kiwix_or_news_fallback_to_web": 4
  }
}

This is tracked with a single boolean column (fallback_occurred) on the query log, computed by comparing the source that was originally intended (from detect_intent() for auto requests, or the explicit request source otherwise) against the source that actually ended up answering. Deliberately not done by changing route_with_source()'s own return signature — that function recurses into itself at four internal call sites for conditional detection's condition and remainder handling, so widening its return type would have meant touching every one of those, a much larger and riskier change than a post-hoc comparison needed to be.

Why the breakdown says kiwix_or_news_fallback_to_web, not separate counts for eachkiwix and news both fall back to the same target, web. A single boolean genuinely cannot tell you which of the two originally intended sources triggered any specific fallback; querying per-original-source would run the identical SQL query under two different labels and double-count the same underlying rows. Reporting an honest, combined label is the correct tradeoff here — it's less granular than attributing fallbacks to a specific source, but it's true, which a confidently-wrong per-source breakdown would not have been.

/logs/stats also reports latency_by_source (average latency per source) and top_queries (your 10 most-repeated queries, each with its hit count and average latency).

latency_by_source is an overall average — it includes both cold queries (a real backend call) and warm ones (a cache hit), not warm-only. That's deliberate: a warm-only number would look almost identical across every source regardless of how slow that source actually is when it has real work to do, since cache hits are fast no matter the source.

Each entry in top_queries reports the source that answered it most recently, not whichever source happened to answer it most often historically. This matters if routing behavior for a query has changed over time — you'll see what it does now, not a stale answer from before a routing change.

/logs — the raw query log

GET /logs?limit=N returns recent entries directly: timestamp, query, source requested, source used, cached flag, success, latency in milliseconds, and fallback_occurred. limit is clamped to a sane range (1-1000) — SQLite treats a negative LIMIT as "no limit at all," so ?limit=-1 used to return the entire query log rather than the bounded, recent-entries view this endpoint is meant to provide. Useful for spot-checking a specific recent query's behavior without waiting for it to show up in aggregate stats.

A real example of this paying off immediately

Within minutes of /health first reporting snapshot_jobs and the cache fields, it also caught something unrelated and already real: a SearXNG request_timeout setting that had genuinely been edited correctly in the config file, but never took effect because the container running SearXNG had never actually been restarted since the edit. /health reported web: error, Read timed out (read timeout=3) — the literal old value, live, immediately — rather than that mismatch surfacing only the next time someone happened to notice degraded search results and traced it back by hand.

Clone this wiki locally