feat: use text index to power filters and autocomplete#2376
Conversation
|
|
The latest updates on your projects. Learn more about Vercel for GitHub. 2 Skipped Deployments
|
🔴 Tier 4 — CriticalTouches auth, data models, config, tasks, OTel pipeline, ClickHouse, or CI/CD. Why this tier:
Review process: Deep review from a domain expert. Synchronous walkthrough may be required. Stats
|
E2E Test Results✅ All tests passed • 191 passed • 3 skipped • 1310s
Tests ran across 4 shards in parallel. |
Deep Review🔴 P0/P1 — must fix
🟡 P2 — recommended
🔵 P3 nitpicks (19)
Reviewers (7): adversarial, api-contract, correctness, kieran-typescript, maintainability, performance, project-standards, reliability, testing Testing gaps:
|
Greptile SummaryThis PR wires up ClickHouse
Confidence Score: 3/5Two concrete correctness defects exist in the primary new code path; Distributed-table deployments and setups with multi-character separators will encounter runtime ClickHouse errors when the text-index path is activated. The Distributed-table guard in packages/common-utils/src/core/metadata.ts — specifically the Distributed-engine guard in Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[getMapKeys / getMapValues] --> B{getMapColumnTextIndexes returns index?}
B -- No --> E{metadataMVs + dateRange?}
B -- keysIndex --> C[mergeTreeTextIndex SELECT token AS key]
B -- itemsIndex --> D[mergeTreeTextIndex OR-chain startsWith]
C -- empty --> D
D -- empty --> E
E -- yes --> F[KV Rollup MV query]
E -- no --> G[Main-table scan]
F -- empty --> G
C -- non-empty --> Z[cache.set + return]
D -- non-empty --> Z
F -- non-empty --> Z
G --> Z
Reviews (1): Last reviewed commit: "fix: deduplicate rollups by not using MV" | Re-trigger Greptile |
| async () => { | ||
| const [tableMeta, version] = await Promise.all([ | ||
| this.getTableMetadata({ databaseName, tableName, connectionId }), | ||
| this.getServerVersion({ connectionId }), | ||
| ]); | ||
| if (!supportsMergeTreeTextIndex(version)) return new Map(); | ||
| if (!tableMeta) return new Map(); | ||
| if (tableMeta.engine === 'Distributed') return new Map(); | ||
|
|
There was a problem hiding this comment.
Dead guard —
Distributed engine check never fires
getTableMetadata (the public helper called on line 363) already resolves a Distributed table's engine field to the local engine (e.g. MergeTree) before returning. So by the time the guard on line 369 is reached, tableMeta.engine is always the local engine, and the === 'Distributed' branch is dead code.
The practical consequence is that when the caller is a Distributed table and text indices exist on the underlying local table, the code falls through to the mergeTreeTextIndex(databaseName, tableName, indexName) SQL using the Distributed table coordinates — which ClickHouse rejects because mergeTreeTextIndex only works on local MergeTree-family tables. Callers with Distributed tables that have been given text indices will hit a runtime error on every key/value lookup.
Fix: replace the call to the public getTableMetadata with the private queryTableMetadata call (which returns the raw system.tables row, so engine is still 'Distributed'), or inspect the create_local_table_query field (only populated for Distributed tables) as a proxy.
| SELECT splitByChar(${{ String: indexes.itemsIndex.separator }}, token)[1] AS key | ||
| FROM mergeTreeTextIndex(${{ String: databaseName }}, ${{ String: tableName }}, ${{ String: indexes.itemsIndex.indexName }}) |
There was a problem hiding this comment.
splitByChar requires a single-character delimiter
ClickHouse's splitByChar(sep, s) only accepts a one-character separator; passing a multi-character string throws Only a single character separator is allowed. The separator is extracted from the user's ALIAS/MATERIALIZED column definition and is not validated to be exactly one character.
getMapValues uses position() + substring() which work correctly for any separator length, making the two paths inconsistent. Using splitByString here matches getMapValues and handles any separator safely.
| SELECT splitByChar(${{ String: indexes.itemsIndex.separator }}, token)[1] AS key | |
| FROM mergeTreeTextIndex(${{ String: databaseName }}, ${{ String: tableName }}, ${{ String: indexes.itemsIndex.indexName }}) | |
| SELECT splitByString(${{ String: indexes.itemsIndex.separator }}, token)[1] AS key | |
| FROM mergeTreeTextIndex(${{ String: databaseName }}, ${{ String: tableName }}, ${{ String: indexes.itemsIndex.indexName }}) |
| }): Promise<Map<string, string[]>> { | ||
| if (keys.length === 0) return new Map(); | ||
|
|
There was a problem hiding this comment.
getMapValues no longer caches results
The previous implementation wrapped each per-key scan in cache.getOrFetch, so repeated calls (across components that each call getMapValues for the same column/key) hit the cache after the first query. The new batched implementation skips caching entirely in all three paths (text-index, MV, and main-table scan). Callers that invoke getMapValues independently for the same arguments — e.g., different parts of a filter panel — will always hit ClickHouse.
getAllFieldsAndValues wraps the combined result in cache.getOrFetch, which helps there, but getAllKeyValues and direct callers have no such protection.
| @@ -45,30 +45,6 @@ GROUP BY ColumnIdentifier, Key, Timestamp; | |||
| -- Single MV: CTE with UNION ALL across all columns, then aggregate | |||
There was a problem hiding this comment.
Removal of ResourceAttributes / LogAttributes / ScopeAttributes from the MV rollup affects fresh deployments on CH < 26.3
The SQL uses CREATE MATERIALIZED VIEW IF NOT EXISTS, so existing deployments keep the old MV unchanged. But newly-seeded deployments will have an MV that only populates NativeColumn, meaning the rollup path in getMapKeys / getAllFieldsAndValues returns empty results for the three attribute map columns on those deployments.
For a ClickHouse instance below 26.3 (where supportsMergeTreeTextIndex returns false), no text-index path is available either, so the fallback is an unbounded main-table scan for every autocomplete or filter request targeting those columns.
Summary
The new text indexes can power filters and autocomplete and ease the metadataMVs. So let's do it!
References