Skip to content

feat: add direct_read map column optimization#2248

Merged
kodiakhq[bot] merged 6 commits into
mainfrom
aaron/kv-items-support
May 12, 2026
Merged

feat: add direct_read map column optimization#2248
kodiakhq[bot] merged 6 commits into
mainfrom
aaron/kv-items-support

Conversation

@knudtty
Copy link
Copy Markdown
Contributor

@knudtty knudtty commented May 8, 2026

Summary

We've known for a few weeks about an optimization that would allow for the direct_read optimization for the text index. This PR adds support for direct_read with the exact string matching in Lucene syntax based on a recommended column to add. Internal testing shows relevant queries sped up anywhere between 1.4-10x.

How to test locally

Steps:

  1. Checkout locally
  2. Log into clickhouse image via clickhouse-client
  3. Run ALTER TABLE otel_logs ADD COLUMN ResourceAttributeItems Array(String) MATERIALIZED arrayMap((k,v) -> concat(k, '=', v), mapKeys(ResourceAttributes), mapValues(ResourceAttributes)) CODEC(ZSTD(1));
  4. Run ALTER TABLE otel_logs ADD INDEX idx_res_attr_items ResourceAttributeItems TYPE text(tokenizer = 'array');
  5. Search for some exact ResourceAttributes element value
  6. Observe the generated sql includes has(ResourceAttributes, concat(<key>, '=', <value>)

References

  • Linear Issue: Closes HDX-4190

@changeset-bot
Copy link
Copy Markdown

changeset-bot Bot commented May 8, 2026

🦋 Changeset detected

Latest commit: 925f9f4

The changes in this PR will be included in the next version bump.

This PR includes changesets to release 1 package
Name Type
@hyperdx/common-utils Patch

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@vercel
Copy link
Copy Markdown

vercel Bot commented May 8, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
hyperdx-oss Ready Ready Preview, Comment May 12, 2026 6:39pm

Request Review

@knudtty knudtty requested review from a team and fleon and removed request for a team May 8, 2026 20:36
@github-actions github-actions Bot added the review/tier-2 Low risk — AI review + quick human skim label May 8, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

🟡 Tier 3 — Standard

Introduces new logic, modifies core functionality, or touches areas with non-trivial risk.

Why this tier:

  • Diff size: 271 production lines changed (Tier 2 max: < 250)

Review process: Full human review — logic, architecture, edge cases.
SLA: First-pass feedback within 1 business day.

Stats
  • Production files changed: 1
  • Production lines changed: 271 (+ 434 in test files, excluded from tier calculation)
  • Branch: aaron/kv-items-support
  • Author: knudtty

To override this classification, remove the review/tier-3 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@knudtty knudtty force-pushed the aaron/kv-items-support branch from 93a701a to ae7cac0 Compare May 8, 2026 20:37
@knudtty knudtty requested review from a team and teeohhem and removed request for a team and fleon May 8, 2026 20:37
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

E2E Test Results

All tests passed • 176 passed • 3 skipped • 1218s

Status Count
✅ Passed 176
❌ Failed 0
⚠️ Flaky 3
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

PR Review

  • ⚠️ PR description's ALTER TABLE uses mapKeys(X), mapValues(X) form, but the parser explicitly does NOT accept that form (see test returns undefined for mapKeys/mapValues form at queryParser.test.ts:1706). The parser only matches the tuple-cast form arrayMap((arr) -> concat(arr.1,'=',arr.2), X::Array(Tuple(String,String))). The doc comment at queryParser.ts:1041 also describes the mapKeys/mapValues form. → Either (a) verify ClickHouse normalizes default_expression to tuple-cast when stored (and update the comment/parser docs to reflect the stored form, not user input), or (b) extend KV_ITEMS_STRATEGIES with a parser for the mapKeys/mapValues form so the PR's documented usage actually triggers the optimization.

  • ⚠️ Silent false positives when keys or values contain the separator. arrayMap items are flattened to strings, so a Map with key 'a=b' and value 'c' produces item 'a=b=c', which matches a search for Col.a:"b=c" via has() even though Map['a'] would not equal 'b=c'. queryParser.ts:462 → Add a brief code comment near the has(...) emission noting this known limitation; consider whether the materialized expression should use a separator unlikely to appear in keys (\0 etc.) or whether to detect a user-chosen separator at config time.

  • ℹ️ Backtick-quoted identifiers in default_expression will fail tokenizeExpression (no backtick handling at queryParser.ts:848). If ClickHouse ever stores `LogAttributes` in the cast target, the optimization silently won't apply. Minor — easy to fix by skipping/stripping backticks in the tokenizer.

  • ✅ SQL safety looks good: key, separator, term all flow through SqlString.format placeholders.

  • ✅ Empty-term equality correctly preserves missing-key semantics in both positive and negated cases (with matching tests).

  • ✅ Fallback behavior (wrong tokenizer, wrong column, non-string value type, no KV column) is well-covered by tests.

  • ✅ Lookup is cached once per serializer via kvItemsLookupPromise, consistent with skipIndicesPromise / enableTextIndexPromise patterns.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 8, 2026

Deep Review

✅ No critical issues found.

🟡 P2 -- recommended

  • packages/common-utils/src/queryParser.ts:462-478 -- has(arr, concat(key, sep, value)) is not equivalent to Map['key'] = 'value' when the configured separator (= by default) can appear inside attribute keys or values: distinct (k, v) pairs materialize to the same indexed token, producing cross-row false positives or false negatives versus the legacy path.
    • Fix: Restrict the optimization to a separator-safe character (e.g. NUL) or post-filter the has() hit with the original Map[?] = ? equality so the index narrows candidates without changing the predicate's semantics.
    • correctness, adversarial
  • packages/common-utils/src/queryParser.ts:1357 -- The value-type extractor prefixMatch.type.match(/,\s+(\w+)\)$/) fails on Map(String, LowCardinality(String)) and Map(String, Nullable(String)), returning 'Unknown'; propertyType === JSDataType.String then fails and the optimization silently does nothing for the most common OTel attribute schemas.
    • Fix: Recursively unwrap LowCardinality(...) / Nullable(...) from the Map value type before passing to convertCHTypeToLuceneSearchType.
    • adversarial
  • packages/common-utils/src/queryParser.ts:469-477 -- Empty-term equality emits has(...) OR NOT mapContains(Map, 'key'), but NOT mapContains has no skip index, so the disjunction forces every granule to be read and the text(tokenizer=array) index becomes ineffective for the empty-term path -- the claimed 1.4-10x speedup likely does not apply here and may regress versus the prior Map['k']='' AND indexHint(mapContains(...)) shape that could leverage a bloom-filter on map keys.
    • Fix: Verify with EXPLAIN indexes=1 that the disjunction prunes granules; if it does not, fall through to the legacy Map[?] = '' path when term === ''.
    • performance
  • packages/common-utils/src/queryParser.ts:1031-1035 -- The constructor-time .catch on kvItemsLookupPromise swallows errors from metadata.getColumns / metadata.getSkipIndices and resolves to an empty Map, but no test rejects either promise, so a future regression where the error escapes (or the catch returns a non-Map) would silently break every query through this serializer.
    • Fix: Add a test that rejects getColumns and getSkipIndices and asserts the serializer still produces a working fallback Map[?] = ? query.
    • testing
🔵 P3 nitpicks (16)
  • packages/common-utils/src/queryParser.ts:848-906 -- tokenizeExpression returns null on any unrecognized character, so a default_expression that ClickHouse emits with backtick-quoted identifiers (`LogAttributes`::Array(...)) silently disables the optimization despite correct DDL.
    • Fix: Consume `…` as an identifier token (stripping the backticks) inside the tokenizer.
    • correctness, testing, adversarial
  • packages/common-utils/src/queryParser.ts:981, 1063-1071 -- KV_ITEMS_STRATEGIES = [parseKvItemsExpression] as const plus an IIFE-loop is a single-strategy abstraction with no second consumer; the indirection makes the call site harder to follow than a direct invocation.
    • Fix: Inline const parsed = parseKvItemsExpression(candidate.default_expression) and remove the strategies array until a second strategy actually exists.
    • maintainability, kieran-typescript
  • packages/common-utils/src/queryParser.ts:835-841, 917 -- KvItemsInfo, KvItemsLookup, and parseKvItemsExpression are exported but have no consumers outside this file and its test, widening the workspace-internal API surface for no current benefit.
    • Fix: Demote KvItemsInfo and KvItemsLookup to non-exported; keep parseKvItemsExpression exported only because the unit tests need it.
    • maintainability, kieran-typescript, api-contract
  • packages/common-utils/src/queryParser.ts:469-477 -- The negated empty-term branch re-builds SqlString.format('mapContains(??, ?)', ...) inline instead of reusing the notContains variable computed two lines earlier, so the two SQL strings must be diff'd by eye to confirm they target the same column/key.
    • Fix: Extract a single mapContainsExpr and build the two branches as ${mapContainsExpr} and NOT ${mapContainsExpr}.
    • maintainability
  • packages/common-utils/src/queryParser.ts:1086-1091 -- When two ALIAS/MATERIALIZED columns both parse to the same source map, the second iteration silently overwrites the first; whichever appears later in DESCRIBE ordering wins, which can pick the wrong separator if the columns disagree.
    • Fix: Detect collisions in lookup and either log a warning or refuse to install the optimization when more than one viable candidate maps to the same source column.
    • correctness, adversarial
  • packages/common-utils/src/queryParser.ts:991, 1032-1035 -- kvItemsLookupPromise is cached for the serializer lifetime; if the underlying ALIAS column is dropped between cache-build and query execution, the emitted SQL references a non-existent identifier and ClickHouse rejects the query, whereas the legacy path would have succeeded.
    • Fix: Either invalidate the lookup when schema metadata is refreshed, or wrap the optimization in a defensive fallback that degrades to Map[?] = ? on Unknown identifier errors.
    • adversarial
  • packages/common-utils/src/queryParser.ts:402, 735, 1371-1378 -- The intersection KvItemsInfo & { mapColumn: string; mapKey: string } is repeated verbatim in three positions; adding a required field later requires editing all three and is easy to miss.
    • Fix: Hoist the intersection into a named type alias (e.g. ResolvedKvItemsExpression) and reference it from each call site.
    • maintainability, kieran-typescript
  • packages/common-utils/src/queryParser.ts:843-978 -- A ~135-line hand-rolled lexer + recursive-descent parser is used to recognize a single fixed arrayMap((arr) -> concat(arr.1, '<sep>', arr.2), X::Array(Tuple(String, String))) shape; an anchored regex with two captures would deliver identical behavior in ~10 lines.
    • Fix: Either replace with a tightly anchored regex plus a small post-validation, or add a comment enumerating the parser pitfalls the hand-roll is guarding against.
    • maintainability
  • packages/common-utils/src/queryParser.ts:458-484 -- When kvItemsExpression and mapKeyIndexExpression are both populated (which is the normal case), the String branch silently wins for JSDataType.String while Bool/Number/JSON types fall through to the expressionPostfix path; the precedence and the rationale for dropping the index hint live only in the test expectations.
    • Fix: Document the precedence with a brief comment, or AND the two index hints together so both signals reach the planner.
    • maintainability
  • packages/common-utils/src/queryParser.ts:1371-1380 -- The conditional-spread ...(kvItemsInfo ? { kvItemsExpression: {…} } : {}) could be a direct ternary assignment now that the declared field is optional.
    • Fix: Write kvItemsExpression: kvItemsInfo ? { … } : undefined to keep the object shape stable.
    • kieran-typescript
  • packages/common-utils/src/queryParser.ts:917-978 -- parseKvItemsExpression accepts any quoted string as the separator, including the empty string; a misconfigured DDL with concat(arr.1, '', arr.2) would silently install an optimization whose token space collides catastrophically with arbitrary keys/values.
    • Fix: Reject zero-length separators (and optionally any separator that looks like a common attribute character) in parseKvItemsExpression.
    • adversarial
  • packages/common-utils/src/__tests__/queryParser.test.ts:1906-1910 -- buildSerializer types columns and skipIndices as any[], so a future production change that adds a required field to ColumnMeta / SkipIndexMetadata will not fail the tests.
    • Fix: Tighten to Partial<ColumnMeta>[] / Partial<SkipIndexMetadata>[] (or the real types).
    • kieran-typescript
  • packages/common-utils/src/__tests__/queryParser.test.ts:1700-1719 -- parseKvItemsExpression has ~12 structural short-circuits (trailing tokens, non-quoted separator, mismatched lambda variable, missing ::, etc.) but only the mapKeys/mapValues form and a single unrecognized-character case are asserted as rejected.
    • Fix: Add targeted tests for trailing-token rejection, non-constant separator argument, and mismatched lambda-variable usage between arr.1 and arr.2.
    • testing
  • packages/common-utils/src/__tests__/queryParser.test.ts:1777-1832 -- Only single-term queries are exercised against the kv-items index; a compound LogAttributes.a:"x" AND LogAttributes.b:"y" is not pinned, so a regression where the second branch loses lookup context would slip through.
    • Fix: Add a compound-AND test asserting both branches emit has(...) with the correct mapKey on each side.
    • testing
  • packages/common-utils/src/queryParser.ts:1057-1061 -- The default_type filter excludes 'DEFAULT' (and empty) but no test asserts that a default_type: 'DEFAULT' column with the matching expression is correctly skipped, so a future widening of the filter would not be caught.
    • Fix: Add a test fixture with default_type: 'DEFAULT' and assert the optimization does not engage.
    • testing
  • packages/common-utils/src/queryParser.ts:1075-1084 -- The index-match step requires normalizeChExpression(idx.expression) === normalizeChExpression(candidate.name), so an index declared as tokens(LogAttributeItems) or lower(LogAttributeItems) is silently considered non-matching even though it covers the column; users who already have such an index will get no optimization.
    • Fix: Document this as an intentional limitation (transformed indices cannot satisfy has()), or expand the match to unwrap simple top-level wrappers.
    • testing

Reviewers (8): correctness, testing, maintainability, kieran-typescript, performance, adversarial, api-contract, learnings-researcher.

Testing gaps:

  • No EXPLAIN indexes=1 test confirming has(arr, concat(literal, literal, literal)) actually triggers the text(tokenizer=array) skip index when all concat arguments are constants.
  • No end-to-end test for Map(LowCardinality(String), String) or Map(String, Nullable(String)) -- the most common OTel attribute schemas.
  • No test for default_expression returned with backtick-quoted source column name.
  • No test for separator-in-key or separator-in-value collision.
  • No test rejecting metadata.getColumns / metadata.getSkipIndices to exercise the swallowed-error path.

@github-actions github-actions Bot added review/tier-3 Standard — full human review required and removed review/tier-2 Low risk — AI review + quick human skim labels May 11, 2026
@kodiakhq kodiakhq Bot merged commit aa1a852 into main May 12, 2026
19 checks passed
@kodiakhq kodiakhq Bot deleted the aaron/kv-items-support branch May 12, 2026 18:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

automerge review/tier-3 Standard — full human review required

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants