fix(search): column bulk operations search not returning results at scale#27216
fix(search): column bulk operations search not returning results at scale#27216sonika-shah wants to merge 6 commits intomainfrom
Conversation
…cale When searching by column name pattern (e.g., "MAT") in column bulk operations, the composite aggregation returned ALL column names from matching documents, then post-filtered in Java. With 20000+ columns, the first composite page of 25 names rarely contained matches, so users saw 0 results. Switch to terms aggregation with `include` regex when a search pattern is set. This filters at the ES/OS aggregation level — only matching column names produce buckets. Two-phase approach: (1) lightweight names query to get all matching names + accurate total, (2) targeted data query with top_hits for the current page only.
a773a85 to
9f3b664
Compare
There was a problem hiding this comment.
Pull request overview
Fixes column-name search in Column Bulk Operations for very wide schemas (20k+ columns) by switching the columnNamePattern path from composite aggregation + Java post-filtering to a two-phase terms aggregation that filters bucket keys server-side using an include regexp.
Changes:
- Added
ColumnAggregator.toCaseInsensitiveRegex()to generate a Lucene-compatible, case-insensitive regexp forterms.include. - Implemented a pattern-search branch in both Elasticsearch and OpenSearch column aggregators using a two-phase
termsaggregation (names query + page data query). - Added unit tests for regex generation and edge cases.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| openmetadata-service/src/main/java/org/openmetadata/service/search/ColumnAggregator.java | Adds shared utility to build Lucene-compatible case-insensitive regex for terms include. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java | Adds pattern-search code path using two-phase terms aggregation and offset-based pagination cursor. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java | Mirrors the two-phase terms aggregation approach for OpenSearch and refactors bucket parsing. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/ColumnAggregatorTest.java | Adds unit tests validating regex generation behavior (case handling + escaping). |
Comments suppressed due to low confidence (2)
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:68
MAX_PATTERN_SEARCH_NAMESis hard-capped at 10,000 for the phase-1termsaggregation. On large schemas (e.g., 20k+ columns) a broad pattern (like a single character) can easily match >10k distinct column names, which will silently truncatematchingNames, undercounttotalUniqueColumns, and prevent users from paging to the missing matches. Consider paging the name collection (e.g., via composite agg withafter_key, or partitioning thetermsagg) or raising the limit to cover worst-case table sizes and explicitly detecting/tracking truncation when the limit is hit.
/** Max column names to retrieve in the names-only query during pattern search. */
private static final int MAX_PATTERN_SEARCH_NAMES = 10000;
/** Index configuration with field mappings for each entity type. Uses aliases defined in indexMapping.json */
private static final Map<String, IndexConfig> INDEX_CONFIGS =
Map.of(
"table",
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:70
- The phase-1 pattern search uses a
termsagg withsize=MAX_PATTERN_SEARCH_NAMES(10,000). If the pattern matches more than 10k distinct column names (common on 20k+ column tables for broad patterns), the names list andtotalUniqueColumnswill be truncated and the remaining matches become unreachable via pagination. Consider implementing a paged name scan (e.g., composite agg with cursor) or otherwise guaranteeing retrieval of all matching names (and/or surfacing a truncation indicator).
/** Max column names to retrieve in the names-only query during pattern search. */
private static final int MAX_PATTERN_SEARCH_NAMES = 10000;
/** Uses aliases defined in indexMapping.json */
private static final List<String> DATA_ASSET_INDEXES =
Arrays.asList("table", "dashboardDataModel", "topic", "searchIndex", "container");
🔴 Playwright Results — 1 failure(s), 22 flaky✅ 3633 passed · ❌ 1 failed · 🟡 22 flaky · ⏭️ 84 skipped
Genuine Failures (failed on all attempts)❌
|
…, eliminating Phase 2 ES query
|
The Python checkstyle failed. Please run You can install the pre-commit hooks with |
| @SuppressWarnings("unchecked") | ||
| private int decodeSearchOffset(String cursor) { | ||
| if (cursor == null) { | ||
| return 0; | ||
| } | ||
| try { | ||
| String json = new String(Base64.getDecoder().decode(cursor), StandardCharsets.UTF_8); | ||
| Map<String, Object> map = JsonUtils.readValue(json, Map.class); | ||
| Object offset = map.get("searchOffset"); | ||
| if (offset instanceof Number num) { | ||
| return num.intValue(); | ||
| } | ||
| return 0; | ||
| } catch (Exception e) { | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
decodeSearchOffset silently swallows all decode/JSON errors and returns 0. This makes invalid/mismatched cursors hard to debug and can cause clients to loop over page 1 again. Consider logging decode failures (similar to decodeCursor) and/or encoding a cursor "type" marker so search-offset cursors can be distinguished from composite after_key cursors.
| @SuppressWarnings("unchecked") | ||
| private int decodeSearchOffset(String cursor) { | ||
| if (cursor == null) { | ||
| return 0; | ||
| } | ||
| try { | ||
| String json = new String(Base64.getDecoder().decode(cursor), StandardCharsets.UTF_8); | ||
| Map<String, Object> map = JsonUtils.readValue(json, Map.class); | ||
| Object offset = map.get("searchOffset"); | ||
| if (offset instanceof Number num) { | ||
| return num.intValue(); | ||
| } | ||
| return 0; | ||
| } catch (Exception e) { | ||
| return 0; | ||
| } |
There was a problem hiding this comment.
decodeSearchOffset swallows Base64/JSON parsing errors and returns 0 without any logging. That makes invalid/mismatched cursors hard to diagnose and can cause repeated page-1 responses. Consider logging decode failures (similar to decodeCursor) and/or adding a cursor "type" marker to distinguish search-offset cursors from composite cursors.
| ColumnGridResponse lowerResponse = | ||
| getColumnGrid( | ||
| client, | ||
| "entityTypes=table&columnNamePattern=casemixcol&serviceName=" + service.getName()); | ||
|
|
||
| assertNotNull(lowerResponse); | ||
| boolean foundLower = | ||
| lowerResponse.getColumns().stream().anyMatch(c -> c.getColumnName().equals(colName)); | ||
| assertTrue(foundLower, "Lowercase search should find the mixed-case column"); | ||
|
|
||
| // Search with all uppercase | ||
| ColumnGridResponse upperResponse = | ||
| getColumnGrid( | ||
| client, | ||
| "entityTypes=table&columnNamePattern=CASEMIXCOL&serviceName=" + service.getName()); | ||
|
|
||
| assertNotNull(upperResponse); | ||
| boolean foundUpper = | ||
| upperResponse.getColumns().stream().anyMatch(c -> c.getColumnName().equals(colName)); | ||
| assertTrue(foundUpper, "Uppercase search should find the mixed-case column"); |
There was a problem hiding this comment.
These new ITs call waitForSearchIndexRefresh() (a fixed delay that doesn't assert the data is actually indexed) and then immediately query the column grid. This can be flaky under slower CI/ES refresh conditions. Consider wrapping the getColumnGrid(...) assertions in an await().untilAsserted(...) (similar to existing tests in this class) to wait until the expected column appears.
| ColumnGridResponse lowerResponse = | |
| getColumnGrid( | |
| client, | |
| "entityTypes=table&columnNamePattern=casemixcol&serviceName=" + service.getName()); | |
| assertNotNull(lowerResponse); | |
| boolean foundLower = | |
| lowerResponse.getColumns().stream().anyMatch(c -> c.getColumnName().equals(colName)); | |
| assertTrue(foundLower, "Lowercase search should find the mixed-case column"); | |
| // Search with all uppercase | |
| ColumnGridResponse upperResponse = | |
| getColumnGrid( | |
| client, | |
| "entityTypes=table&columnNamePattern=CASEMIXCOL&serviceName=" + service.getName()); | |
| assertNotNull(upperResponse); | |
| boolean foundUpper = | |
| upperResponse.getColumns().stream().anyMatch(c -> c.getColumnName().equals(colName)); | |
| assertTrue(foundUpper, "Uppercase search should find the mixed-case column"); | |
| await() | |
| .untilAsserted( | |
| () -> { | |
| ColumnGridResponse lowerResponse = | |
| getColumnGrid( | |
| client, | |
| "entityTypes=table&columnNamePattern=casemixcol&serviceName=" | |
| + service.getName()); | |
| assertNotNull(lowerResponse); | |
| boolean foundLower = | |
| lowerResponse.getColumns().stream() | |
| .anyMatch(c -> c.getColumnName().equals(colName)); | |
| assertTrue(foundLower, "Lowercase search should find the mixed-case column"); | |
| }); | |
| // Search with all uppercase | |
| await() | |
| .untilAsserted( | |
| () -> { | |
| ColumnGridResponse upperResponse = | |
| getColumnGrid( | |
| client, | |
| "entityTypes=table&columnNamePattern=CASEMIXCOL&serviceName=" | |
| + service.getName()); | |
| assertNotNull(upperResponse); | |
| boolean foundUpper = | |
| upperResponse.getColumns().stream() | |
| .anyMatch(c -> c.getColumnName().equals(colName)); | |
| assertTrue(foundUpper, "Uppercase search should find the mixed-case column"); | |
| }); |
|
|
||
| Aggregation topHitsAgg = Aggregation.of(a -> a.topHits(th -> th.size(10))); | ||
|
|
There was a problem hiding this comment.
In the pattern-search data query, top_hits is limited to size(10). Since ColumnMetadataGrouper sets totalOccurrences based on the number of parsed hits, this will undercount and drop occurrences when a column name appears in more than 10 entities. Consider using a higher/shared constant here (and aligning with the OpenSearch implementation which uses 100), or switching to an aggregation-based count if you only need totals.
| SearchRequest searchRequest = SearchRequest.of(s -> s.index(indexes).query(query).size(10000)); | ||
|
|
||
| SearchResponse<JsonData> response = client.search(searchRequest, JsonData.class); |
There was a problem hiding this comment.
The tag/glossary source-fetch query hard-codes .size(10000) and does not paginate/scroll. If more than 10k entities match the tag/glossary filter, results will be truncated and columns beyond the first 10k hits will never be considered (impacting both totals and pagination consistency). Consider implementing pagination (search_after/scroll) or splitting the query to avoid an implicit 10k ceiling.
| SearchRequest searchRequest = | ||
| SearchRequest.of(s -> s.index(resolvedIndexes).query(query).size(10000)); | ||
|
|
||
| SearchResponse<JsonData> response = client.search(searchRequest, JsonData.class); |
There was a problem hiding this comment.
The tag/glossary source-fetch query uses .size(10000) with no pagination. If more than 10k entities match the tag/glossary filter, the response will be truncated and later matches will never be included (affecting totals and paging). Consider adding pagination (search_after/scroll) or otherwise removing the implicit 10k cap.
| .withColumns(List.of(col1, col2)) | ||
| .execute(); | ||
|
|
||
| // Table 2: same column name as col1 but WITHOUT tag |
There was a problem hiding this comment.
The comment says "Table 2: same column name as col1 but WITHOUT tag", but untaggedMatchCol is a different column name from taggedMatchCol (it just shares the pattern prefix). Update the comment (or the test data) so the scenario description matches what the test is actually validating.
| // Table 2: same column name as col1 but WITHOUT tag | |
| // Table 2: untagged column whose name also matches the pattern |
| Set<String> allNames = new TreeSet<>(String.CASE_INSENSITIVE_ORDER); | ||
| allNames.addAll(taggedColumns.keySet()); |
There was a problem hiding this comment.
💡 Edge Case: TreeSet case-insensitive dedup vs HashMap case-sensitive lookup
In aggregateColumnsWithKnownNames (ES line 272-291, OS line 200-219), a TreeSet(String.CASE_INSENSITIVE_ORDER) is used to deduplicate column names, but taggedColumns is a regular HashMap with case-sensitive keys. If two documents contribute the same column name with different casing (e.g., "MyCol" vs "mycol"), the TreeSet will keep only one variant. When taggedColumns.get(name) is called with that variant, it will only find entries under the exact matching case key, silently dropping occurrences stored under the other case variant.
In practice this is unlikely (column names from the same logical column usually have consistent casing), but it could cause missing occurrences in edge cases.
Suggested fix:
Use a case-insensitive map (e.g., TreeMap with CASE_INSENSITIVE_ORDER) for taggedColumns from the start, or merge all case variants when building pageColumns:
for (String name : pageNames) {
for (Map.Entry<String, List<ColumnWithContext>> e : taggedColumns.entrySet()) {
if (e.getKey().equalsIgnoreCase(name)) {
pageColumns.computeIfAbsent(name, k -> new ArrayList<>()).addAll(e.getValue());
}
}
}
Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion
Code Review 👍 Approved with suggestions 3 resolved / 4 findingsBulk column operations search fix now returns results at scale with unit tests validating against the actual Lucene/ES regex engine. Consider aligning the TreeSet case-insensitive dedup with the HashMap case-sensitive lookup in aggregateColumnsWithKnownNames to prevent potential matching inconsistencies. 💡 Edge Case: TreeSet case-insensitive dedup vs HashMap case-sensitive lookup📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:272-273 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:290-291 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:200-201 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:218-219 In In practice this is unlikely (column names from the same logical column usually have consistent casing), but it could cause missing occurrences in edge cases. Suggested fix✅ 3 resolved✅ Bug: Unit tests validate Java regex, not Lucene/ES regex engine
✅ Bug:
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
|



Fixes #27227
Summary
columnNamePatternis set, switch from composite aggregation to terms aggregation withincluderegex — ES/OS filters at the aggregation level, so only matching column names produce bucketsHow it works: Two-phase terms aggregation
termsagg withincluderegex,size=10000, ordered by_keyasc → returns all matching column names + accurate total count in a single fast querytermsagg withinclude= exact page names +top_hits→ fetches full entity data for only the 25 names on the current pageWhy terms agg
include(regex)works even with flat objects (columns are not nested):include(regex)tests each ordinal independently against the regex — it doesn't matter that multiple values came from the same documentNon-search path (no
columnNamePattern): Unchanged — still uses composite aggregation with cursor-based pagination.Approaches considered and rejected
1. Composite agg + Java post-filter (previous approach — the bug)
String.contains()after2. Composite agg with query-level
regexpfilterregexpquery oncolumns.name.keywordto pre-filter documents before aggregation3. Composite agg + filter sub-agg +
bucket_selector(elastic/elasticsearch#29079)bucket_selectorpipeline agg to drop non-matching bucketsbucket_selectoris officially unsupported with composite (ES docs)4. Composite agg with runtime field + conditional
emit()emit(), composite paginates withafter_key5. Terms agg
include(regex)+exclude(array)for paginationinclude(regex). Next request addsexclude([...previously seen names...])to get the next batchincludewith array-basedexcludeis not supported on OpenSearch. Feature was added in ES 7.11 (elastic/elasticsearch#63325), but OpenSearch forked from ES 7.10.2 — before this was merged6. Terms agg
include(partition/num_partitions)+ query-levelregexppartitionandregexshare the sameincludeparameter — mutually exclusive. And query-level regexp has the same flat-object problem as Approach 27. Composite agg with
include/excludeon terms sourceWhy 10,000 cap on matching names
size— there is no cursor/pagination mechanismpartition/num_partitionscan't be combined withinclude(regex)(same field)Files changed
ColumnAggregator.javatoCaseInsensitiveRegex()utility (Lucene regex doesn't support(?i), so "MAT" →.*[mM][aA][tT].*)ElasticSearchColumnAggregator.javaaggregateColumnsWithPattern(),executeNamesQuery(),executePageDataQuery(). Extracted sharedparseBucketHits()andapplyTagPostFilter()to avoid duplication. Offset-based cursor for search paginationOpenSearchColumnAggregator.javaColumnAggregatorTest.javatoCaseInsensitiveRegex— case insensitivity, special char escaping, edge casesTest plan
ColumnAggregatorTest— 8 unit tests for regex generation (all pass)ColumnMetadataGrouperTest— 7 existing tests still pass (no regression)mvn compile— clean build🤖 Generated with Claude Code