Skip to content

[HDX-4383] Skip string inference when body parses as JSON with a level field#2363

Merged
kodiakhq[bot] merged 4 commits into
mainfrom
dhable/HDX-4383-skip-string-inference-on-json-level
May 28, 2026
Merged

[HDX-4383] Skip string inference when body parses as JSON with a level field#2363
kodiakhq[bot] merged 4 commits into
mainfrom
dhable/HDX-4383-skip-string-inference-on-json-level

Conversation

@dhable

@dhable dhable commented May 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes a customer-reported bug where structured JSON logs are tagged with the
wrong severity because the OTel collector's string-keyword inference picks up
incidental severity words inside the body. For example, a Grafana sidecar log
with body {"level":"INFO", "msg":"... mimir-alertmanager-dashboard ..."}
gets SeverityText="fatal", SeverityNumber=21 today because the inference
regex \b(alert|crit|emerg|fatal|error|err|warn|notice|debug|dbug|trace)
matches alert at the start of alertmanager. The JSON-derived level: INFO
field is ignored entirely.

This PR adds a new OTTL log_statements block between the existing JSON-parse
block and the string-inference block in docker/otel-collector/config.yaml.
The new block promotes a JSON-derived level/severity field to
log.severity_text, which causes the string-inference block to be skipped via
its existing severity_number == 0 and severity_text == "" guard.

What it does

  • Reads only from log.attributes, which is where Block 1 (JSON parse +
    flatten(), or OTel map body merge) puts the parsed fields. So the
    promotion fires only when a JSON body was actually parsed.
  • Case-insensitive across keys by enumerating common casings of common
    field names used by mainstream logging frameworks: level / Level /
    LEVEL (pino, winston, zerolog, zap, logrus, slog, Serilog, NLog),
    severity / Severity / SEVERITY (Datadog, GCP Cloud Logging), and
    log.level (Elastic ECS, flattened from nested JSON).
  • Priority-ordered cascade: each set self-guards on
    severity_text == "", so the first match wins (level > severity >
    log.level). The block as a whole is gated on no producer-set severity,
    so producer-supplied values are always preserved.
  • severity_number mapping uses the same case-insensitive (?i) regex
    keyword set as the existing string-inference block, so values like
    "Information", "WARN", "Critical" resolve correctly. Unrecognized
    values (e.g. "verbose") fall through to INFO — same default as Block 2.
  • severity_text is normalized by the unchanged Block 3
    (ConvertCase(severity_text, "lower")), so "INFO" becomes "info".

Behavior matrix

Scenario Result
{"level":"warn", ...} ("warn", 13)
{"Level":"Information", ...} (Serilog) ("information", 9)
{"SEVERITY":"ERROR", ...} (GCP) ("error", 17)
{"log":{"level":"fatal"}, ...} (ECS, flattened) ("fatal", 21)
{"level":"verbose", ...} (unknown text) ("verbose", 9) — INFO default
{"msg":"something error happened"} (no level field) ("error", 17) — string inference still runs
Non-JSON body Unchanged — string inference still runs
Producer pre-set severity_text or severity_number Producer values preserved

How to test on Vercel preview

N/A — non-UI change. This modifies the OTel collector's OTTL config; no
frontend code is touched.

Testing

  • New make ci-lint and make ci-unit both green.
  • New bats smoke tests (8 cases) under
    smoke-tests/otel-collector/data/severity-inference/from-json-*/:
    • from-json-level/ — lowercase level
    • from-json-level-pascalcase/ — Serilog Level: "Information"
    • from-json-severity-uppercase/ — GCP SEVERITY: "ERROR"
    • from-json-ecs-log-level/ — Elastic ECS nested log.level
    • from-json-level-unknown/ — unrecognized value falls back to INFO
    • from-json-no-level/ — JSON body without level → string inference runs
    • from-json-level-producer-wins/ — producer-set severity preserved
    • from-json-level-body-keyword-conflict/ — the exact customer payload
      from the linked support escalation (body has alertmanager-dashboard
      plus level: INFO); confirms the customer scenario no longer
      reproduces.
  • All 21 bats tests in severity-inference.bats, auto-parse-json.bats,
    and normalize-severity.bats pass against a freshly-rebuilt collector
    image.

References

  • Linear Issue: HDX-4383
  • Support escalation: ClickHouse/support-escalation#7812

…s level/severity field (HDX-4383)

When the log body parses as JSON and contains a level-like field, promote
that value to log.severity_text so the downstream string-keyword inference
block is short-circuited via its existing 'severity_number == 0 and
severity_text == ""' guard.

The previous behavior scanned the raw body string for severity keywords with
a \b-prefixed regex, which produced false positives whenever a body
incidentally contained a word starting with a keyword (e.g. 'alertmanager'
matching 'alert' -> FATAL). With this fix, structured logs with a level
field get their severity from the field, not from substrings of the body.

Implementation:
- Insert a new OTTL log_statements block between the existing JSON-parse
  block and the string-inference block.
- The block is gated on no producer-set severity, then runs a priority
  cascade over common level/severity field names from mainstream logging
  frameworks (pino, winston, zerolog, zap, logrus, slog, Serilog,
  GCP Cloud Logging, Elastic ECS), in three case variants each:
  level / Level / LEVEL, severity / Severity / SEVERITY, log.level.
- Severity_number is mapped via case-insensitive (?i) regex matches that
  mirror the existing string-inference keyword set. Unrecognized level
  values get severity_number defaulted to INFO (mirrors block 2's else).
- Final lowercase normalization block (ConvertCase) is unchanged.

Tests:
- Eight new bats smoke tests under smoke-tests/otel-collector covering
  lowercase level, PascalCase Level (Serilog), uppercase SEVERITY (GCP),
  flattened log.level (ECS), unknown level fallthrough, JSON without level
  (string inference still runs), producer-set severity precedence, and
  the exact Grafana/mimir-alertmanager customer payload from the linked
  support escalation.
- All 17 severity-inference + 3 auto-parse + 1 normalize-severity bats
  tests pass against a freshly-built collector image.
@dhable dhable added the ai-generated AI-generated content; review carefully before merging. label May 28, 2026
@changeset-bot

changeset-bot Bot commented May 28, 2026

Copy link
Copy Markdown

🦋 Changeset detected

Latest commit: 73889e0

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 3 packages
Name Type
@hyperdx/otel-collector Patch
@hyperdx/api Patch
@hyperdx/app Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel

vercel Bot commented May 28, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 28, 2026 8:57pm
hyperdx-storybook Error Error May 28, 2026 8:57pm

Request Review

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 193 passed • 3 skipped • 1305s

Status Count
✅ Passed 193
❌ Failed 0
⚠️ Flaky 2
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@dhable dhable marked this pull request as ready for review May 28, 2026 18:32
@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 28, 2026
@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Critical-path files (1):
    • docker/otel-collector/config.yaml

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 26
  • Production lines changed: 452
  • Branch: dhable/HDX-4383-skip-string-inference-on-json-level
  • Author: dhable

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions

github-actions Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Deep Review

🟡 P2 -- recommended

  • docker/otel-collector/config.yaml:58-69 -- the new severity-number IsMatch rules drop both the \b word boundary used by the existing inference block and the per-rule guard, so a JSON level value that contains a keyword as a non-word-boundary substring (e.g. "deferred", "interrupted", "stacktrace") matches the wrong bin, and any compound value (e.g. "warn_info", "trace_error") is reclassified by whichever rule fires last.

    • Fix: anchor each pattern with \b...\b and add and log.severity_number == 0 to every set so the first match wins, mirroring the priority pattern used by the string-inference block below.
    • correctness, testing, maintainability, reliability, adversarial
  • smoke-tests/otel-collector/severity-inference.bats:106 -- no fixture covers the cascade-priority contract (multiple level-like keys present simultaneously) or the non-string level fallback (numeric level from bunyan/pino), so a future reorder or accidental drop of a self-guard would silently change behavior with green tests.

    • Fix: add fixtures for {"level":"warn","severity":"error"} (expect warn) and {"level":30,"msg":"connection error"} (expect fallback to body-string inference, error/17).
    • testing, correctness, reliability, adversarial
🔵 P3 nitpicks (5)
  • docker/otel-collector/config.yaml:30-56 -- promoted severity_text is kept verbatim, so logs end up with non-canonical values like "information", "warning", "verbose", "notice" (confirmed by the new fixtures) instead of the canonical six OTel severity strings the rest of the pipeline emits.
    • Fix: after each successful severity-number classification, normalize severity_text to the canonical short form (fatal/error/warn/info/debug/trace) like the string-inference block does.
    • correctness, reliability
  • docker/otel-collector/config.yaml:30-56 -- all seven promotion rules guard on IsString(...), so numeric levels emitted by bunyan/pino ({"level":30}) and other native OTLP-style numeric severities are skipped without any documentation or fallback handling.
    • Fix: either add a numeric-level mapping (e.g., 10/20/30/40/50/60 → trace/debug/info/warn/error/fatal) or call out the limitation in the changeset and a comment.
    • correctness, adversarial
  • docker/otel-collector/config.yaml:18-20 -- block-level guard only inspects the log record's severity_number, so a JSON body shaped like {"severity_number":17,"level":"info"} silently drops the producer-intended severity_number=17 and resolves to INFO via the promoted level.
    • Fix: if surfacing severity_number from JSON attributes is in scope for this fix, promote it too and gate text promotion on severity_number == 0; otherwise leave a comment naming the limitation.
  • docker/otel-collector/config.yaml:30-56 -- promotion accepts any non-empty string, so whitespace-only or numeric-string levels (e.g., " ", "30") propagate to severity_text unchanged and end up in storage / UI filter dropdowns as malformed values.
    • Fix: at the fallthrough on line 75, additionally set severity_text to "info" (or trim and re-check) so the column never carries garbage.
    • adversarial
  • smoke-tests/otel-collector/severity-inference.bats:106 -- three of the seven enumerated key variants (LEVEL, lowercase severity, PascalCase Severity) and the DEBUG/TRACE numeric mapping from JSON-promoted text have no direct fixtures; a typo in one of the unverified set statements would not surface.
    • Fix: add one combined fixture exercising each untested variant and assert each row's severity_text/SeverityNumber against the expected mapping.

Reviewers (6): correctness, testing, maintainability, project-standards, reliability, adversarial.

Testing gaps:

  • Cascade priority when both level and severity keys are present.
  • Non-string level value (numeric bunyan/pino) and empty-string level value.
  • DEBUG/TRACE classification from JSON-promoted severity_text.
  • JSON body delivered as an OTLP map (not stringified JSON), to exercise the merge_maps(log.attributes, log.body, ...) branch with the new block.
  • Untested key variants LEVEL, severity, Severity.

@wrn14897 wrn14897 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the fix

@kodiakhq kodiakhq Bot merged commit ad3f1c9 into main May 28, 2026
18 of 19 checks passed
@kodiakhq kodiakhq Bot deleted the dhable/HDX-4383-skip-string-inference-on-json-level branch May 28, 2026 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ai-generated AI-generated content; review carefully before merging. automerge review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants