Skip to content

feat: use text index to power filters and autocomplete#2376

Open
knudtty wants to merge 12 commits into
mainfrom
aaron/fts-index-as-aggregate-view
Open

feat: use text index to power filters and autocomplete#2376
knudtty wants to merge 12 commits into
mainfrom
aaron/fts-index-as-aggregate-view

Conversation

@knudtty

@knudtty knudtty commented May 29, 2026

Copy link
Copy Markdown
Contributor

Summary

The new text indexes can power filters and autocomplete and ease the metadataMVs. So let's do it!

References

  • Linear Issue: Closes HDX-4229

@changeset-bot

changeset-bot Bot commented May 29, 2026

Copy link
Copy Markdown

⚠️ No Changeset found

Latest commit: f39aa83

Merging this PR will not cause a version bump for any packages. If these changes should not result in a new version, you're good to go. If these changes should result in a version bump, you need to add a changeset.

This PR includes no changesets

When changesets are added to this PR, you'll see the packages that this PR includes changesets for and the associated semver types

Click here to learn what changesets are, and how to add one.

Click here if you're a maintainer who wants to add a changeset to this PR

@vercel

vercel Bot commented May 29, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

2 Skipped Deployments
Project Deployment Actions Updated (UTC)
hyperdx-oss Ignored Ignored Preview Jun 8, 2026 4:03pm
hyperdx-storybook Ignored Ignored Preview Jun 8, 2026 4:03pm

Request Review

@github-actions github-actions Bot added the review/tier-4 Critical — deep review + domain expert sign-off label May 29, 2026
@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

🔴 Tier 4 — Critical

Touches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD.

Why this tier:

  • Large diff: 1174 production lines changed (threshold: 1000)

Review process: Deep review from a domain expert. Synchronous walkthrough may be required.
SLA: Schedule synchronous review within 2 business days.

Stats
  • Production files changed: 4
  • Production lines changed: 1174 (+ 981 in test files, excluded from tier calculation)
  • Branch: aaron/fts-index-as-aggregate-view
  • Author: knudtty

To override this classification, remove the review/tier-4 label and apply a different review/tier-* label. Manual overrides are preserved on subsequent pushes.

@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

E2E Test Results

All tests passed • 191 passed • 3 skipped • 1310s

Status Count
✅ Passed 191
❌ Failed 0
⚠️ Flaky 4
⏭️ Skipped 3

Tests ran across 4 shards in parallel.

View full report →

@github-actions

github-actions Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Deep Review

🔴 P0/P1 — must fix

  • packages/common-utils/src/core/metadata.ts:498getMapKeys text-index queries have no try/catch, so a transient error from mergeTreeTextIndex propagates out instead of falling through to the MV/scan tiers as the sibling getMapValues cascade does.
    • Fix: Wrap the keysIndex block (lines ~498-518) and the itemsIndex block (lines ~519-539) in try/catch with console.warn, mirroring the pattern in getMapValues at lines 819-865.
    • correctness, reliability
  • packages/common-utils/src/core/metadata.ts:955getMapValues main-table-scan fallback uses Promise.all(keys.map(...)); one rejected per-key query discards every successful sibling result, concentrating blast radius vs. the prior per-key API.
    • Fix: Replace Promise.all with Promise.allSettled (or wrap each per-key body in try/catch returning []) and aggregate the successful entries.
    • adversarial, performance, reliability

🟡 P2 — recommended

  • packages/common-utils/src/core/metadata.ts:786getMapValues no longer wraps the query path in cache.getOrFetch; every call re-runs the full cascade and the fetchPerKeyFallback paths at lines 1670/1815 issue uncached ClickHouse queries on repeat.
    • Fix: Restore a result-level cache.getOrFetch keyed on (connectionId, db, table, column, sorted-keys, dateRange, timestampValueExpression), or cache inside the fallback paths.
    • correctness, kieran-typescript, performance, reliability
  • packages/common-utils/src/core/metadata.ts:847 — Text-index queries (this site and getMapKeys at lines ~505 and ~526) pass this.getClickHouseSettings() with no max_execution_time/timeout_overflow_mode, while the MV branch bounds execution at 15s; a slow mergeTreeTextIndex scan stalls to the server default before cascading.
    • Fix: Apply {...this.getClickHouseSettings(), timeout_overflow_mode: 'break', max_execution_time: 15, max_rows_to_read: '0'} consistent with the MV branch.
    • performance, reliability
  • packages/common-utils/src/core/metadata.ts:422partsOverlapFilter returns chSql\1`whentimestampValueExpressionis absent; theuseMetadata.tsx -> getAllFieldsAndValues -> getMapValueschain never threadstimestampValueExpression, so text-index queries scan ALL parts even when the caller passed a narrow dateRange`.
    • Fix: Thread timestampValueExpression through getAllFieldsAndValues -> getMapValues, or have partsOverlapFilter resolve it from source metadata when not supplied.
    • adversarial, correctness
  • packages/common-utils/src/core/metadata.ts:835 — Text-index split uses position(token, sep), which returns the FIRST occurrence; a map whose KEY contains the separator literal (e.g., key a=b with separator =) parses as key a / value b=c while the OR-chain startsWith still matches the row, returning corrupted pairs only on the text-index path.
    • Fix: Compute the split offset from each prefixed key's known length (the OR-chain already constrains the prefix), or validate at index-discovery time that map keys cannot contain the separator.
    • adversarial, correctness
  • packages/common-utils/src/core/metadata.ts:515getMapKeys text-index and rollup branches use cache.get + manual cache.set instead of cache.getOrFetch, losing the pendingQueries dedup that previously joined concurrent identical fetches; two simultaneous autocomplete callers now fire identical mergeTreeTextIndex queries.
    • Fix: Wrap each branch's query in await this.cache.getOrFetch(cacheKey, async () => {...}).
    • performance
  • packages/common-utils/src/core/metadata.ts:863 — New catch blocks at lines 578, 863, 922 blanket-swallow ALL errors as console.warn, including non-recoverable failures (auth, syntax, abort); the user-facing error eventually surfaces from the final tier and the root cause is invisible, doubling time-to-error.
    • Fix: Re-throw on AbortError and on non-transient ClickHouse error codes (ACCESS_DENIED, SYNTAX_ERROR), or attach an aggregated tier-failure diagnostic to the final thrown error.
    • reliability
🔵 P3 nitpicks (19)
  • packages/common-utils/src/queryParser.ts:19 — Re-export statements interleaved between two import blocks trip the project's simple-import-sort/imports: 'error' lint rule and the pre-commit hook will reformat.
    • Fix: Move export type { KvItemsInfo, KvItemsLookup } and export { parseKvItems... } below all imports.
    • maintainability, project-standards
  • packages/common-utils/src/core/metadata.ts:440getMapKeys does not accept or thread AbortSignal, while getMapValues was updated to accept one; aborts from autocomplete UI leave getMapKeys queries running server-side.
    • Fix: Add signal?: AbortSignal to getMapKeys and thread it into all three clickhouseClient.query calls.
    • reliability
  • packages/common-utils/src/core/metadata.ts:1720 — The new fetchPerKeyFallback repeats the spread-after-defaults clickhouse_settings pattern, so user settings silently override the in-function safety caps (same shape as the pre-existing site at line 951).
    • Fix: Put ...this.getClickHouseSettings() first, then the explicit max_rows_to_read and read_overflow_mode overrides last.
    • correctness
  • packages/api/src/tasks/checkAlerts/__tests__/renderAlertTemplate.test.ts:36 — Mock getMapValues: jest.fn().mockResolvedValue([]) returns a bare array, but the new production signature returns Map<string, string[]>; latent landmine for any future test that exercises a path consuming .get().
    • Fix: Change both renderAlertTemplate.test.ts:36 and checkAlerts.test.ts:1109 to mockResolvedValue(new Map()).
    • adversarial, api-contract, correctness, kieran-typescript, testing
  • packages/common-utils/src/core/kvItems.ts:241 — Two parallel dispatch shapes (KV_ITEMS_STRATEGIES array iterated inline in queryParser.ts, plus the parseKvItemsDefaultExpression dispatcher used by metadata.ts) duplicate the same strategy loop.
    • Fix: Drop the inline loop in queryParser.ts and call parseKvItemsDefaultExpression, or stop exporting one of the two shapes.
    • maintainability
  • packages/common-utils/src/core/metadata.ts:375 — Local norm = (s) => s.replace(/\s+/g, '').replace(/\/g, '')is byte-identical tonormalizeChExpressioninqueryParser.ts`; both expressions must stay in sync forever.
    • Fix: Hoist a single normalizeClickHouseExpression into core/utils.ts and import from both modules.
    • maintainability
  • packages/common-utils/src/core/metadata.ts:460getMapKeys cache key dropped the alignedDateRange MV branch; two callers in the same MV bucket no longer share cache entries.
    • Fix: Restore the aligned-bucket variant for the MV path or document the deliberate change.
    • correctness, maintainability
  • packages/common-utils/src/core/metadata.ts:582 — When getMapKeys text-index returns empty and the MV path succeeds, the MV value is cached under a shared key; a later call after the text index catches up still returns the cached MV result.
    • Fix: Distinguish cache entries by source path, or shorten TTL for fallback-tier results.
  • packages/common-utils/src/core/metadata.ts:1895 — Text-index cache key in getAllFieldsAndValues uses raw dateRange[0]/[1] while the MV cache key uses alignedDateRange; equivalent callers may double-cache identical results.
    • Fix: Use alignedDateRange for the text-index cache key, or document the divergence.
    • correctness, maintainability
  • packages/common-utils/src/core/metadata.ts:951settings spread ...this.getClickHouseSettings() after explicit max_rows_to_read/read_overflow_mode overrides the safety caps (pre-existing pattern, restated for parity with the new sibling at 1720).
    • Fix: Put the spread first, then the explicit overrides last.
    • correctness, kieran-typescript
  • packages/common-utils/src/core/metadata.ts:397getMapColumnTextIndexes silently drops a skip index when an alias matches by expression but its default_expression isn't a recognized KV-items shape.
    • Fix: Log a debug warning, or record a keys-only entry as a fallback.
    • correctness
  • packages/common-utils/src/core/metadata.ts:1947excludedIds.sort() mutates the array in place inside a cache-key template literal with no explanatory comment.
    • Fix: Use [...excludedIds].sort() and add // sort for deterministic cache key.
    • maintainability
  • packages/common-utils/src/core/metadata.ts:355getMapColumnTextIndexes is exposed publicly (no private modifier); if only intended for internal use plus tests, narrow the surface.
    • Fix: Mark private; tests can still spy via jest.spyOn(Metadata.prototype as any, 'getMapColumnTextIndexes').
    • api-contract
  • packages/common-utils/src/core/metadata.ts:134MapColumnTextIndexes allows the {} empty state because both fields are optional.
    • Fix: Tighten to a discriminated union requiring at least one of keysIndex or itemsIndex.
    • kieran-typescript, maintainability
  • packages/common-utils/src/core/kvItems.ts:71 — Tokenizer rejects backtick-quoted ClickHouse identifiers (e.g. `LogAttributes`), silently disabling text-index detection for those schemas.
    • Fix: Add a backtick-identifier branch to tokenizeExpression that reads `...` as a single identifier token and strips the backticks.
    • adversarial
  • packages/common-utils/src/core/metadata.ts:499 — Text-index keys-only path has no ORDER BY; the MV path ranks by sum(count) DESC, so autocomplete shows arbitrary (not popular) keys after this PR.
    • Fix: Accept the documented divergence or prefer the MV path when both are available for ranking quality.
    • adversarial, performance
  • packages/common-utils/src/__tests__/metadata.test.ts:1580 — Tests named "no cache poisoning" only verify a single invocation; a second-call probe is needed to prove the cache wasn't actually written.
    • Fix: After the first await, issue an identical call and assert that the underlying ClickHouse query fires again.
    • testing
  • packages/common-utils/src/__tests__/metadata.test.ts:2348 — Items-index .each table only covers default_type === 'ALIAS'; the MATERIALIZED predicate branch is uncovered.
    • Fix: Add a MATERIALIZED row to the parameterized table.
    • testing
  • packages/common-utils/src/__tests__/metadata.test.ts:2265(prototype.method as jest.Mock).mockRestore() + finally re-spy pattern leaks if finally doesn't run; couples test ordering to a global beforeAll spy shape.
    • Fix: Extract a withRealGetMapColumnTextIndexes(md, fn) helper, or scope the un-stubbing via beforeEach/afterEach inside the describe block.
    • maintainability, testing

Reviewers (7): adversarial, api-contract, correctness, kieran-typescript, maintainability, performance, project-standards, reliability, testing

Testing gaps:

  • No test covers a text-index tier failure in getMapKeys falling through to the MV/scan tiers (the cascade contract is not enforced by tests today).
  • No test exercises getMapValues Promise.all partial-failure semantics — one per-key query failing alongside successes.
  • No test covers getMapValues text-index with a multi-character separator (position(...) + separator.length arithmetic is untested for non-1 lengths).
  • No test for getMapValues({ keys: [] }) early-return or for getMapColumnTextIndexes engine === 'Distributed' early-return.
  • No test asserts AbortSignal propagation through getMapKeys (also blocked on threading the parameter through).

@greptile-apps

greptile-apps Bot commented Jun 8, 2026

Copy link
Copy Markdown

Greptile Summary

This PR wires up ClickHouse mergeTreeTextIndex (available from CH 26.3) as the preferred source for map-column key and value lookups, with a three-tier fallback chain: text index → MV rollup → main-table scan. The SQL seed file is trimmed to remove ResourceAttributes, LogAttributes, and ScopeAttributes from the KV rollup MV, since those are now served by the text index.

  • New metadata helpers (getMapColumnTextIndexes, partsOverlapFilter) introspect skip indices to discover which map columns have array-tokenized text indexes, then scope queries to overlapping data parts when a timestamp expression is available.
  • getMapValues signature change: key?: stringkeys: string[], batching all requested keys into a single OR-chain text-index query or a single IN clause MV query before falling back to parallel per-key scans.
  • KV items parsing extracted from queryParser.ts into a standalone kvItems.ts module and re-exported for backwards compatibility.

Confidence Score: 3/5

Two concrete correctness defects exist in the primary new code path; Distributed-table deployments and setups with multi-character separators will encounter runtime ClickHouse errors when the text-index path is activated.

The Distributed-table guard in getMapColumnTextIndexes is effectively dead because getTableMetadata already resolves the engine to the local engine before the guard is evaluated. On Distributed-table setups, mergeTreeTextIndex ends up being called with the Distributed table coordinates, which ClickHouse rejects at runtime. Separately, getMapKeys uses splitByChar which only accepts a one-character delimiter, while getMapValues uses position()/substring() that handle any length — an inconsistency that becomes a runtime error if the separator is ever more than one character.

packages/common-utils/src/core/metadata.ts — specifically the Distributed-engine guard in getMapColumnTextIndexes and the splitByChar call in getMapKeys; docker/otel-collector/schema/seed/00006_otel_logs_rollups.sql for the MV impact on pre-26.3 deployments.

Important Files Changed

Filename Overview
packages/common-utils/src/core/metadata.ts Core change: adds getMapColumnTextIndexes, partsOverlapFilter, and rewrites getMapKeys/getMapValues/getAllKeyValues/getAllFieldsAndValues to prefer mergeTreeTextIndex queries over the MV rollup. Contains a dead Distributed-engine guard (always false), a splitByChar single-char constraint mismatch, and dropped per-call caching in getMapValues.
packages/common-utils/src/core/kvItems.ts New file: extracts KV items expression parsers from queryParser.ts into a standalone module. Logic is unchanged; well-tested.
packages/common-utils/src/core/clickhouseVersion.ts Adds supportsMergeTreeTextIndex gating on CH >= 26.3. Simple, well-tested addition.
docker/otel-collector/schema/seed/00006_otel_logs_rollups.sql Drops ResourceAttributes, LogAttributes, and ScopeAttributes UNION ALL branches from the KV rollup MV. Safe for existing deployments but fresh installs on CH < 26.3 lose rollup data for those columns.
packages/common-utils/src/queryParser.ts KV items types and parsers moved to kvItems.ts; re-exported for backwards compatibility. No logic changes.
packages/common-utils/src/tests/metadata.test.ts Extensive new test suite covering text-index paths, fallback ordering, cache isolation, and MV exclusion. Global beforeAll spy on getMapColumnTextIndexes ensures old tests remain unaffected.
packages/common-utils/src/tests/clickhouseVersion.test.ts Adds parametrised tests for supportsMergeTreeTextIndex covering boundary versions around 26.3.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[getMapKeys / getMapValues] --> B{getMapColumnTextIndexes returns index?}
    B -- No --> E{metadataMVs + dateRange?}
    B -- keysIndex --> C[mergeTreeTextIndex SELECT token AS key]
    B -- itemsIndex --> D[mergeTreeTextIndex OR-chain startsWith]
    C -- empty --> D
    D -- empty --> E
    E -- yes --> F[KV Rollup MV query]
    E -- no --> G[Main-table scan]
    F -- empty --> G
    C -- non-empty --> Z[cache.set + return]
    D -- non-empty --> Z
    F -- non-empty --> Z
    G --> Z
Loading

Fix All in Claude Code Fix All in Conductor Fix All in Cursor Fix All in Codex

Reviews (1): Last reviewed commit: "fix: deduplicate rollups by not using MV" | Re-trigger Greptile

Comment on lines +362 to +370
async () => {
const [tableMeta, version] = await Promise.all([
this.getTableMetadata({ databaseName, tableName, connectionId }),
this.getServerVersion({ connectionId }),
]);
if (!supportsMergeTreeTextIndex(version)) return new Map();
if (!tableMeta) return new Map();
if (tableMeta.engine === 'Distributed') return new Map();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Dead guard — Distributed engine check never fires

getTableMetadata (the public helper called on line 363) already resolves a Distributed table's engine field to the local engine (e.g. MergeTree) before returning. So by the time the guard on line 369 is reached, tableMeta.engine is always the local engine, and the === 'Distributed' branch is dead code.

The practical consequence is that when the caller is a Distributed table and text indices exist on the underlying local table, the code falls through to the mergeTreeTextIndex(databaseName, tableName, indexName) SQL using the Distributed table coordinates — which ClickHouse rejects because mergeTreeTextIndex only works on local MergeTree-family tables. Callers with Distributed tables that have been given text indices will hit a runtime error on every key/value lookup.

Fix: replace the call to the public getTableMetadata with the private queryTableMetadata call (which returns the raw system.tables row, so engine is still 'Distributed'), or inspect the create_local_table_query field (only populated for Distributed tables) as a proxy.

Fix in Claude Code Fix in Conductor Fix in Cursor Fix in Codex

Comment on lines +521 to +522
SELECT splitByChar(${{ String: indexes.itemsIndex.separator }}, token)[1] AS key
FROM mergeTreeTextIndex(${{ String: databaseName }}, ${{ String: tableName }}, ${{ String: indexes.itemsIndex.indexName }})

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 splitByChar requires a single-character delimiter

ClickHouse's splitByChar(sep, s) only accepts a one-character separator; passing a multi-character string throws Only a single character separator is allowed. The separator is extracted from the user's ALIAS/MATERIALIZED column definition and is not validated to be exactly one character.

getMapValues uses position() + substring() which work correctly for any separator length, making the two paths inconsistent. Using splitByString here matches getMapValues and handles any separator safely.

Suggested change
SELECT splitByChar(${{ String: indexes.itemsIndex.separator }}, token)[1] AS key
FROM mergeTreeTextIndex(${{ String: databaseName }}, ${{ String: tableName }}, ${{ String: indexes.itemsIndex.indexName }})
SELECT splitByString(${{ String: indexes.itemsIndex.separator }}, token)[1] AS key
FROM mergeTreeTextIndex(${{ String: databaseName }}, ${{ String: tableName }}, ${{ String: indexes.itemsIndex.indexName }})

Fix in Claude Code Fix in Conductor Fix in Cursor Fix in Codex

Comment on lines +808 to 810
}): Promise<Map<string, string[]>> {
if (keys.length === 0) return new Map();

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 getMapValues no longer caches results

The previous implementation wrapped each per-key scan in cache.getOrFetch, so repeated calls (across components that each call getMapValues for the same column/key) hit the cache after the first query. The new batched implementation skips caching entirely in all three paths (text-index, MV, and main-table scan). Callers that invoke getMapValues independently for the same arguments — e.g., different parts of a filter panel — will always hit ClickHouse.

getAllFieldsAndValues wraps the combined result in cache.getOrFetch, which helps there, but getAllKeyValues and direct callers have no such protection.

Fix in Claude Code Fix in Conductor Fix in Cursor Fix in Codex

@@ -45,30 +45,6 @@ GROUP BY ColumnIdentifier, Key, Timestamp;
-- Single MV: CTE with UNION ALL across all columns, then aggregate

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Removal of ResourceAttributes / LogAttributes / ScopeAttributes from the MV rollup affects fresh deployments on CH < 26.3

The SQL uses CREATE MATERIALIZED VIEW IF NOT EXISTS, so existing deployments keep the old MV unchanged. But newly-seeded deployments will have an MV that only populates NativeColumn, meaning the rollup path in getMapKeys / getAllFieldsAndValues returns empty results for the three attribute map columns on those deployments.

For a ClickHouse instance below 26.3 (where supportsMergeTreeTextIndex returns false), no text-index path is available either, so the fallback is an unbounded main-table scan for every autocomplete or filter request targeting those columns.

Fix in Claude Code Fix in Conductor Fix in Cursor Fix in Codex

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

review/tier-4 Critical — deep review + domain expert sign-off

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant