fix(search): column bulk operations search not returning results at scale by sonika-shah · Pull Request #27216 · open-metadata/OpenMetadata

sonika-shah · 2026-04-09T19:46:42Z

Summary

Bug: Searching for a column name (e.g., "MAT", "MATNR") in Column Bulk Operations returned 0 results when tables have 20000+ columns (reported by RATP & Airbus)
Root cause: Composite aggregation returns ALL column names from matching documents, then Java post-filters. With page size 25, the first composite page rarely contained matching column names — they were hidden further in the alphabet
Fix: When columnNamePattern is set, switch from composite aggregation to terms aggregation with include regex — ES/OS filters at the aggregation level, so only matching column names produce buckets

How it works: Two-phase terms aggregation

Phase 1 — Names query (lightweight, no sub-aggs): terms agg with include regex, size=10000, ordered by _key asc → returns all matching column names + accurate total count in a single fast query
Java pagination: Sort all matching names, slice the requested page (offset-based cursor)
Phase 2 — Data query (targeted): terms agg with include = exact page names + top_hits → fetches full entity data for only the 25 names on the current page

Why terms agg include(regex) works even with flat objects (columns are not nested):

Terms agg scans the global ordinals dictionary — a pre-built sorted list of every unique value in the field across the entire index
include(regex) tests each ordinal independently against the regex — it doesn't matter that multiple values came from the same document
Non-matching ordinals never allocate a bucket, never scan documents — zero cost

Non-search path (no columnNamePattern): Unchanged — still uses composite aggregation with cursor-based pagination.

Approaches considered and rejected

1. Composite agg + Java post-filter (previous approach — the bug)

Composite returns ALL column names from matching documents, Java filters with String.contains() after
Why it fails: With 20,000+ columns, page size 25, matching names hidden deep in alphabet → first pages always return 0 matches

2. Composite agg with query-level `regexp` filter

Add regexp query on columns.name.keyword to pre-filter documents before aggregation
Rejected: Filters documents (tables), not column values. Since columns are flat objects (not nested), a table with 1 matching + 500 non-matching columns still returns all 501 column names in composite buckets

3. Composite agg + filter sub-agg + `bucket_selector` (elastic/elasticsearch#29079)

Use filter sub-aggregation + bucket_selector pipeline agg to drop non-matching buckets
Rejected:
- bucket_selector is officially unsupported with composite (ES docs)
- Columns are flat objects, not nested — filter sub-agg operates on documents, can't isolate which array value corresponds to which bucket key
- Still creates ALL buckets first then prunes — pages can come back empty

4. Composite agg with runtime field + conditional `emit()`

Runtime field script filters values via conditional emit(), composite paginates with after_key
Rejected: Requires OpenSearch 2.14+ (we support 2.6+). Also significant performance penalty — disables global ordinals and early termination optimizations

5. Terms agg `include(regex)` + `exclude(array)` for pagination

First request gets 10,000 names with include(regex). Next request adds exclude([...previously seen names...]) to get the next batch
Rejected: Mixing regex-based include with array-based exclude is not supported on OpenSearch. Feature was added in ES 7.11 (elastic/elasticsearch#63325), but OpenSearch forked from ES 7.10.2 — before this was merged

6. Terms agg `include(partition/num_partitions)` + query-level `regexp`

Move regex to query level, use hash-based partitioning for pagination
Rejected: partition and regex share the same include parameter — mutually exclusive. And query-level regexp has the same flat-object problem as Approach 2

7. Composite agg with `include`/`exclude` on terms source

The ideal solution — composite's native pagination + regex filtering
Does not exist: Requested in elastic/elasticsearch#50368, closed Feb 2024 as "not planned". Elastic's focus shifted to ESQL

Why 10,000 cap on matching names

Terms agg requires an upfront size — there is no cursor/pagination mechanism
partition/num_partitions can't be combined with include(regex) (same field)
No cross-platform (ES + OpenSearch) way to paginate a terms agg with regex filtering
10,000 unique column names matching a search pattern is an extreme edge case for a search feature
Phase 1 is lightweight (just string keys, no document data) so 10,000 buckets is cheap

Files changed

File	Change
`ColumnAggregator.java`	Added shared `toCaseInsensitiveRegex()` utility (Lucene regex doesn't support `(?i)`, so "MAT" → `.[mM][aA][tT].`)
`ElasticSearchColumnAggregator.java`	Added search branch with `aggregateColumnsWithPattern()`, `executeNamesQuery()`, `executePageDataQuery()`. Extracted shared `parseBucketHits()` and `applyTagPostFilter()` to avoid duplication. Offset-based cursor for search pagination
`OpenSearchColumnAggregator.java`	Same two-phase terms agg approach for OpenSearch client
`ColumnAggregatorTest.java`	Unit tests for `toCaseInsensitiveRegex` — case insensitivity, special char escaping, edge cases

Test plan

ColumnAggregatorTest — 8 unit tests for regex generation (all pass)
ColumnMetadataGrouperTest — 7 existing tests still pass (no regression)
mvn compile — clean build
Manual test: search "MAT" on a table with 20000+ columns → should return MAT, MATNR results
Manual test: pagination through search results works correctly
Manual test: search + tag filter combination works correctly
Manual test: non-search browsing (no pattern) still works as before

🤖 Generated with Claude Code

…cale When searching by column name pattern (e.g., "MAT") in column bulk operations, the composite aggregation returned ALL column names from matching documents, then post-filtered in Java. With 20000+ columns, the first composite page of 25 names rarely contained matches, so users saw 0 results. Switch to terms aggregation with `include` regex when a search pattern is set. This filters at the ES/OS aggregation level — only matching column names produce buckets. Two-phase approach: (1) lightweight names query to get all matching names + accurate total, (2) targeted data query with top_hits for the current page only.

Copilot

Pull request overview

Fixes column-name search in Column Bulk Operations for very wide schemas (20k+ columns) by switching the columnNamePattern path from composite aggregation + Java post-filtering to a two-phase terms aggregation that filters bucket keys server-side using an include regexp.

Changes:

Added ColumnAggregator.toCaseInsensitiveRegex() to generate a Lucene-compatible, case-insensitive regexp for terms.include.
Implemented a pattern-search branch in both Elasticsearch and OpenSearch column aggregators using a two-phase terms aggregation (names query + page data query).
Added unit tests for regex generation and edge cases.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File	Description
openmetadata-service/src/main/java/org/openmetadata/service/search/ColumnAggregator.java	Adds shared utility to build Lucene-compatible case-insensitive regex for `terms` include.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java	Adds pattern-search code path using two-phase `terms` aggregation and offset-based pagination cursor.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java	Mirrors the two-phase `terms` aggregation approach for OpenSearch and refactors bucket parsing.
openmetadata-service/src/test/java/org/openmetadata/service/search/ColumnAggregatorTest.java	Adds unit tests validating regex generation behavior (case handling + escaping).

Comments suppressed due to low confidence (2)

openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:68

MAX_PATTERN_SEARCH_NAMES is hard-capped at 10,000 for the phase-1 terms aggregation. On large schemas (e.g., 20k+ columns) a broad pattern (like a single character) can easily match >10k distinct column names, which will silently truncate matchingNames, undercount totalUniqueColumns, and prevent users from paging to the missing matches. Consider paging the name collection (e.g., via composite agg with after_key, or partitioning the terms agg) or raising the limit to cover worst-case table sizes and explicitly detecting/tracking truncation when the limit is hit.

  /** Max column names to retrieve in the names-only query during pattern search. */
  private static final int MAX_PATTERN_SEARCH_NAMES = 10000;

  /** Index configuration with field mappings for each entity type. Uses aliases defined in indexMapping.json */
  private static final Map<String, IndexConfig> INDEX_CONFIGS =
      Map.of(
          "table",

openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:70

The phase-1 pattern search uses a terms agg with size=MAX_PATTERN_SEARCH_NAMES (10,000). If the pattern matches more than 10k distinct column names (common on 20k+ column tables for broad patterns), the names list and totalUniqueColumns will be truncated and the remaining matches become unreachable via pagination. Consider implementing a paged name scan (e.g., composite agg with cursor) or otherwise guaranteeing retrieval of all matching names (and/or surfacing a truncation indicator).

  /** Max column names to retrieve in the names-only query during pattern search. */
  private static final int MAX_PATTERN_SEARCH_NAMES = 10000;

  /** Uses aliases defined in indexMapping.json */
  private static final List<String> DATA_ASSET_INDEXES =
      Arrays.asList("table", "dashboardDataModel", "topic", "searchIndex", "container");

github-actions · 2026-04-09T21:42:17Z

🔴 Playwright Results — 1 failure(s), 22 flaky

✅ 3633 passed · ❌ 1 failed · 🟡 22 flaky · ⏭️ 84 skipped

Shard	Passed	Failed	Flaky	Skipped
🟡 Shard 1	475	0	5	4
🟡 Shard 2	643	0	2	7
🔴 Shard 3	649	1	3	1
🟡 Shard 4	620	0	6	22
🟡 Shard 5	611	0	1	42
🟡 Shard 6	635	0	5	8

Genuine Failures (failed on all attempts)

❌ Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3)

Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoBeVisible�[2m(�[22m�[2m)�[22m failed

Locator: getByTestId('KnowledgePanel.DataProducts').getByTestId('data-products-list').getByTestId('data-product-"PW%dataProduct.8a6e84df"')
Expected: visible
Timeout: 15000ms
Error: element(s) not found

Call log:
�[2m  - Expect "toBeVisible" with timeout 15000ms�[22m
�[2m  - waiting for getByTestId('KnowledgePanel.DataProducts').getByTestId('data-products-list').getByTestId('data-product-"PW%dataProduct.8a6e84df"')�[22m

🟡 22 flaky test(s) (passed on retry)

Features/CustomizeDetailPage.spec.ts › Dashboard Data Model - customization should work (shard 1, 1 retry)
Features/CustomizeDetailPage.spec.ts › Domain - customization should work (shard 1, 1 retry)
Flow/Tour.spec.ts › Tour should work from welcome screen (shard 1, 1 retry)
Pages/AuditLogs.spec.ts › should create audit log entry when glossary is updated (shard 1, 1 retry)
Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
Features/BulkEditEntity.spec.ts › Glossary (shard 2, 1 retry)
Features/ChangeSummaryBadge.spec.ts › Automated badge should appear on entity description with Automated source (shard 2, 1 retry)
Features/Permissions/GlossaryPermissions.spec.ts › Team-based permissions work correctly (shard 3, 1 retry)
Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
Features/RestoreEntityInheritedFields.spec.ts › Validate restore with Inherited domain and data products assigned (shard 3, 1 retry)
Pages/Customproperties-part2.spec.ts › entityReferenceList shows item count, scrollable list, no expand toggle (shard 4, 1 retry)
Pages/DataContracts.spec.ts › Create Data Contract and validate for File (shard 4, 1 retry)
Pages/Domains.spec.ts › Verify domain tags and glossary terms (shard 4, 1 retry)
Pages/Domains.spec.ts › Rename domain with data products attached at domain and subdomain levels (shard 4, 1 retry)
Pages/Domains.spec.ts › Comprehensive domain rename with ALL relationships preserved (shard 4, 1 retry)
Pages/Entity.spec.ts › Tag Add, Update and Remove (shard 4, 1 retry)
Pages/Glossary.spec.ts › Add and Remove Assets (shard 5, 2 retries)
Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for searchIndex -> mlModel (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
Pages/Users.spec.ts › Permissions for table details page for Data Consumer (shard 6, 1 retry)
Pages/Users.spec.ts › Check permissions for Data Steward (shard 6, 1 retry)
VersionPages/EntityVersionPages.spec.ts › Directory (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

…against ES

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

…, eliminating Phase 2 ES query

github-actions · 2026-04-13T19:20:17Z

The Python checkstyle failed.

Please run make py_format and py_format_check in the root of your repository and commit the changes to this PR.
You can also use pre-commit to automate the Python code formatting.

You can install the pre-commit hooks with make install_test precommit_install.

Copilot

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Copilot · 2026-04-13T19:20:46Z

+  @SuppressWarnings("unchecked")
+  private int decodeSearchOffset(String cursor) {
+    if (cursor == null) {
+      return 0;
+    }
+    try {
+      String json = new String(Base64.getDecoder().decode(cursor), StandardCharsets.UTF_8);
+      Map<String, Object> map = JsonUtils.readValue(json, Map.class);
+      Object offset = map.get("searchOffset");
+      if (offset instanceof Number num) {
+        return num.intValue();
+      }
+      return 0;
+    } catch (Exception e) {
+      return 0;
+    }


decodeSearchOffset silently swallows all decode/JSON errors and returns 0. This makes invalid/mismatched cursors hard to debug and can cause clients to loop over page 1 again. Consider logging decode failures (similar to decodeCursor) and/or encoding a cursor "type" marker so search-offset cursors can be distinguished from composite after_key cursors.

Copilot · 2026-04-13T19:20:46Z

+  @SuppressWarnings("unchecked")
+  private int decodeSearchOffset(String cursor) {
+    if (cursor == null) {
+      return 0;
+    }
+    try {
+      String json = new String(Base64.getDecoder().decode(cursor), StandardCharsets.UTF_8);
+      Map<String, Object> map = JsonUtils.readValue(json, Map.class);
+      Object offset = map.get("searchOffset");
+      if (offset instanceof Number num) {
+        return num.intValue();
+      }
+      return 0;
+    } catch (Exception e) {
+      return 0;
+    }


decodeSearchOffset swallows Base64/JSON parsing errors and returns 0 without any logging. That makes invalid/mismatched cursors hard to diagnose and can cause repeated page-1 responses. Consider logging decode failures (similar to decodeCursor) and/or adding a cursor "type" marker to distinguish search-offset cursors from composite cursors.

Copilot · 2026-04-13T19:20:47Z

+    ColumnGridResponse lowerResponse =
+        getColumnGrid(
+            client,
+            "entityTypes=table&columnNamePattern=casemixcol&serviceName=" + service.getName());
+
+    assertNotNull(lowerResponse);
+    boolean foundLower =
+        lowerResponse.getColumns().stream().anyMatch(c -> c.getColumnName().equals(colName));
+    assertTrue(foundLower, "Lowercase search should find the mixed-case column");
+
+    // Search with all uppercase
+    ColumnGridResponse upperResponse =
+        getColumnGrid(
+            client,
+            "entityTypes=table&columnNamePattern=CASEMIXCOL&serviceName=" + service.getName());
+
+    assertNotNull(upperResponse);
+    boolean foundUpper =
+        upperResponse.getColumns().stream().anyMatch(c -> c.getColumnName().equals(colName));
+    assertTrue(foundUpper, "Uppercase search should find the mixed-case column");


These new ITs call waitForSearchIndexRefresh() (a fixed delay that doesn't assert the data is actually indexed) and then immediately query the column grid. This can be flaky under slower CI/ES refresh conditions. Consider wrapping the getColumnGrid(...) assertions in an await().untilAsserted(...) (similar to existing tests in this class) to wait until the expected column appears.

Suggested change

ColumnGridResponse lowerResponse =

getColumnGrid(

client,

"entityTypes=table&columnNamePattern=casemixcol&serviceName=" + service.getName());

assertNotNull(lowerResponse);

boolean foundLower =

lowerResponse.getColumns().stream().anyMatch(c -> c.getColumnName().equals(colName));

assertTrue(foundLower, "Lowercase search should find the mixed-case column");

// Search with all uppercase

ColumnGridResponse upperResponse =

getColumnGrid(

client,

"entityTypes=table&columnNamePattern=CASEMIXCOL&serviceName=" + service.getName());

assertNotNull(upperResponse);

boolean foundUpper =

upperResponse.getColumns().stream().anyMatch(c -> c.getColumnName().equals(colName));

assertTrue(foundUpper, "Uppercase search should find the mixed-case column");

await()

.untilAsserted(

() -> {

ColumnGridResponse lowerResponse =

getColumnGrid(

client,

"entityTypes=table&columnNamePattern=casemixcol&serviceName="

+ service.getName());

assertNotNull(lowerResponse);

boolean foundLower =

lowerResponse.getColumns().stream()

.anyMatch(c -> c.getColumnName().equals(colName));

assertTrue(foundLower, "Lowercase search should find the mixed-case column");

});

// Search with all uppercase

await()

.untilAsserted(

() -> {

ColumnGridResponse upperResponse =

getColumnGrid(

client,

"entityTypes=table&columnNamePattern=CASEMIXCOL&serviceName="

+ service.getName());

assertNotNull(upperResponse);

boolean foundUpper =

upperResponse.getColumns().stream()

.anyMatch(c -> c.getColumnName().equals(colName));

assertTrue(foundUpper, "Uppercase search should find the mixed-case column");

});

Copilot · 2026-04-13T19:20:47Z

+
+    Aggregation topHitsAgg = Aggregation.of(a -> a.topHits(th -> th.size(10)));
+


In the pattern-search data query, top_hits is limited to size(10). Since ColumnMetadataGrouper sets totalOccurrences based on the number of parsed hits, this will undercount and drop occurrences when a column name appears in more than 10 entities. Consider using a higher/shared constant here (and aligning with the OpenSearch implementation which uses 100), or switching to an aggregation-based count if you only need totals.

Copilot · 2026-04-13T19:20:47Z

    SearchRequest searchRequest = SearchRequest.of(s -> s.index(indexes).query(query).size(10000));

    SearchResponse<JsonData> response = client.search(searchRequest, JsonData.class);


The tag/glossary source-fetch query hard-codes .size(10000) and does not paginate/scroll. If more than 10k entities match the tag/glossary filter, results will be truncated and columns beyond the first 10k hits will never be considered (impacting both totals and pagination consistency). Consider implementing pagination (search_after/scroll) or splitting the query to avoid an implicit 10k ceiling.

Copilot · 2026-04-13T19:20:48Z

    SearchRequest searchRequest =
        SearchRequest.of(s -> s.index(resolvedIndexes).query(query).size(10000));

    SearchResponse<JsonData> response = client.search(searchRequest, JsonData.class);


The tag/glossary source-fetch query uses .size(10000) with no pagination. If more than 10k entities match the tag/glossary filter, the response will be truncated and later matches will never be included (affecting totals and paging). Consider adding pagination (search_after/scroll) or otherwise removing the implicit 10k cap.

Copilot · 2026-04-13T19:20:48Z

+        .withColumns(List.of(col1, col2))
+        .execute();
+
+    // Table 2: same column name as col1 but WITHOUT tag


The comment says "Table 2: same column name as col1 but WITHOUT tag", but untaggedMatchCol is a different column name from taggedMatchCol (it just shares the pattern prefix). Update the comment (or the test data) so the scenario description matches what the test is actually validating.

Suggested change

// Table 2: same column name as col1 but WITHOUT tag

// Table 2: untagged column whose name also matches the pattern

gitar-bot · 2026-04-13T19:25:14Z

+    Set<String> allNames = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
+    allNames.addAll(taggedColumns.keySet());


💡 Edge Case: TreeSet case-insensitive dedup vs HashMap case-sensitive lookup

In aggregateColumnsWithKnownNames (ES line 272-291, OS line 200-219), a TreeSet(String.CASE_INSENSITIVE_ORDER) is used to deduplicate column names, but taggedColumns is a regular HashMap with case-sensitive keys. If two documents contribute the same column name with different casing (e.g., "MyCol" vs "mycol"), the TreeSet will keep only one variant. When taggedColumns.get(name) is called with that variant, it will only find entries under the exact matching case key, silently dropping occurrences stored under the other case variant.

In practice this is unlikely (column names from the same logical column usually have consistent casing), but it could cause missing occurrences in edge cases.

Suggested fix:

Use a case-insensitive map (e.g., TreeMap with CASE_INSENSITIVE_ORDER) for taggedColumns from the start, or merge all case variants when building pageColumns: for (String name : pageNames) { for (Map.Entry<String, List<ColumnWithContext>> e : taggedColumns.entrySet()) { if (e.getKey().equalsIgnoreCase(name)) { pageColumns.computeIfAbsent(name, k -> new ArrayList<>()).addAll(e.getValue()); } } }

_{Was this helpful? React with 👍 / 👎 | Reply gitar fix to apply this suggestion}

gitar-bot · 2026-04-15T04:48:56Z

Code Review 👍 Approved with suggestions 3 resolved / 4 findings

Bulk column operations search fix now returns results at scale with unit tests validating against the actual Lucene/ES regex engine. Consider aligning the TreeSet case-insensitive dedup with the HashMap case-sensitive lookup in aggregateColumnsWithKnownNames to prevent potential matching inconsistencies.

💡 Edge Case: TreeSet case-insensitive dedup vs HashMap case-sensitive lookup

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:272-273 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:290-291 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:200-201 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:218-219

In aggregateColumnsWithKnownNames (ES line 272-291, OS line 200-219), a TreeSet(String.CASE_INSENSITIVE_ORDER) is used to deduplicate column names, but taggedColumns is a regular HashMap with case-sensitive keys. If two documents contribute the same column name with different casing (e.g., "MyCol" vs "mycol"), the TreeSet will keep only one variant. When taggedColumns.get(name) is called with that variant, it will only find entries under the exact matching case key, silently dropping occurrences stored under the other case variant.

In practice this is unlikely (column names from the same logical column usually have consistent casing), but it could cause missing occurrences in edge cases.

Suggested fix

Use a case-insensitive map (e.g., TreeMap with CASE_INSENSITIVE_ORDER) for taggedColumns from the start, or merge all case variants when building pageColumns:
for (String name : pageNames) {
  for (Map.Entry<String, List<ColumnWithContext>> e : taggedColumns.entrySet()) {
    if (e.getKey().equalsIgnoreCase(name)) {
      pageColumns.computeIfAbsent(name, k -> new ArrayList<>()).addAll(e.getValue());
    }
  }
}

✅ 3 resolved

✅ Bug: Unit tests validate Java regex, not Lucene/ES regex engine

📄 openmetadata-service/src/test/java/org/openmetadata/service/search/ColumnAggregatorTest.java:30-36
ColumnAggregatorTest uses java.util.regex.Pattern to validate the output of toCaseInsensitiveRegex, but at runtime the regex is executed by Elasticsearch/OpenSearch's Lucene-based regex engine, which has different syntax rules (e.g., no backreferences, different anchoring behavior). While the subset of features used here (character classes, .*, literal escaping) is compatible with both engines, the tests don't guarantee correctness against the actual runtime engine.

Consider adding an integration test (or noting this caveat in the test class) that validates the regex against a real ES/OS instance, especially for edge cases like Unicode characters or less common special characters.

✅ Bug: totalOccurrences is per-page, not global, in pattern search path

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:267-268 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:187
In the new aggregateColumnsWithPattern method, totalOccurrences is computed from only the current page's grid items (e.g., 25 columns), while totalUniqueColumns is computed from phase 1 across ALL matching names. This means the API response has an accurate totalUniqueColumns but an inaccurate (page-local) totalOccurrences that changes as users paginate.

If the UI displays "X total occurrences" alongside the correct unique-columns count, the numbers will appear inconsistent. On page 1 you might see totalOccurrences=50, on page 2 it becomes totalOccurrences=42, etc.

If getting a global total is too expensive, consider documenting in the response or API contract that totalOccurrences is approximate/page-local when a search pattern is active, or omit it entirely for pattern searches.

✅ Edge Case: Terms agg silently truncates at 10000 matching column names

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:63 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:664 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:65 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:575
The phase-1 names query uses size(MAX_PATTERN_SEARCH_NAMES = 10000) on the terms aggregation. If a broad pattern (e.g., single character "a") matches more than 10,000 distinct column names, the results are silently truncated: totalUniqueColumns will report 10,000, pagination will stop there, and the user won't know results are missing.

Consider adding a check: if the returned bucket count equals MAX_PATTERN_SEARCH_NAMES, either log a warning, return an indicator in the response (e.g., totalUniqueColumns set to -1 or an isTruncated flag), or require a minimum pattern length to avoid overly broad matches.

🤖 Prompt for agents

Code Review: Bulk column operations search fix now returns results at scale with unit tests validating against the actual Lucene/ES regex engine. Consider aligning the TreeSet case-insensitive dedup with the HashMap case-sensitive lookup in aggregateColumnsWithKnownNames to prevent potential matching inconsistencies.

1. 💡 Edge Case: TreeSet case-insensitive dedup vs HashMap case-sensitive lookup
   Files: openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:272-273, openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java:290-291, openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:200-201, openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java:218-219

   In `aggregateColumnsWithKnownNames` (ES line 272-291, OS line 200-219), a `TreeSet(String.CASE_INSENSITIVE_ORDER)` is used to deduplicate column names, but `taggedColumns` is a regular `HashMap` with case-sensitive keys. If two documents contribute the same column name with different casing (e.g., "MyCol" vs "mycol"), the TreeSet will keep only one variant. When `taggedColumns.get(name)` is called with that variant, it will only find entries under the exact matching case key, silently dropping occurrences stored under the other case variant.
   
   In practice this is unlikely (column names from the same logical column usually have consistent casing), but it could cause missing occurrences in edge cases.

   Suggested fix:
   Use a case-insensitive map (e.g., TreeMap with CASE_INSENSITIVE_ORDER) for taggedColumns from the start, or merge all case variants when building pageColumns:
   for (String name : pageNames) {
     for (Map.Entry<String, List<ColumnWithContext>> e : taggedColumns.entrySet()) {
       if (e.getKey().equalsIgnoreCase(name)) {
         pageColumns.computeIfAbsent(name, k -> new ArrayList<>()).addAll(e.getValue());
       }
     }
   }

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-04-15T05:55:17Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

Copilot AI review requested due to automatic review settings April 9, 2026 19:46

sonika-shah force-pushed the fix/column-bulk-search-at-scale branch from a773a85 to 9f3b664 Compare April 9, 2026 19:47

github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 9, 2026

Copilot started reviewing on behalf of sonika-shah April 9, 2026 19:47 View session

gitar-bot bot reviewed Apr 9, 2026

View reviewed changes

Comment thread ...c/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java

gitar-bot bot reviewed Apr 9, 2026

View reviewed changes

Comment thread ...c/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java

gitar-bot bot reviewed Apr 9, 2026

View reviewed changes

Comment thread openmetadata-service/src/test/java/org/openmetadata/service/search/ColumnAggregatorTest.java

Copilot AI reviewed Apr 9, 2026

View reviewed changes

sonika-shah temporarily deployed to test April 9, 2026 19:57 — with GitHub Actions Inactive

test(search): add integration tests for column pattern search regex …

4d6bc82

…against ES

sonika-shah had a problem deploying to test April 13, 2026 06:22 — with GitHub Actions Error

Merge branch 'main' into fix/column-bulk-search-at-scale

9a7bc8d

Copilot AI review requested due to automatic review settings April 13, 2026 06:30

Copilot started reviewing on behalf of sonika-shah April 13, 2026 06:30 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

Comment thread ...c/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchColumnAggregator.java

Comment thread ...ice/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchColumnAggregator.java

sonika-shah temporarily deployed to test April 13, 2026 06:39 — with GitHub Actions Inactive

sonika-shah enabled auto-merge (squash) April 13, 2026 10:02

sonika-shah added 2 commits April 14, 2026 00:44

fix(search): extract column metadata from _source in tag filter path…

f8dee2d

…, eliminating Phase 2 ES query

Merge branch 'main' into fix/column-bulk-search-at-scale

12e8d81

Copilot AI review requested due to automatic review settings April 13, 2026 19:15

Copilot started reviewing on behalf of sonika-shah April 13, 2026 19:15 View session

Copilot AI reviewed Apr 13, 2026

View reviewed changes

gitar-bot bot reviewed Apr 13, 2026

View reviewed changes

Merge branch 'main' into fix/column-bulk-search-at-scale

b97283c

sonika-shah temporarily deployed to test April 15, 2026 04:58 — with GitHub Actions Inactive

sonika-shah had a problem deploying to test April 15, 2026 04:58 — with GitHub Actions Failure


		Aggregation topHitsAgg = Aggregation.of(a -> a.topHits(th -> th.size(10)));

		SearchRequest searchRequest = SearchRequest.of(s -> s.index(indexes).query(query).size(10000));

		SearchResponse<JsonData> response = client.search(searchRequest, JsonData.class);

	// Table 2: same column name as col1 but WITHOUT tag
	// Table 2: untagged column whose name also matches the pattern

		Set<String> allNames = new TreeSet<>(String.CASE_INSENSITIVE_ORDER);
		allNames.addAll(taggedColumns.keySet());

Conversation

sonika-shah commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

How it works: Two-phase terms aggregation

Approaches considered and rejected

1. Composite agg + Java post-filter (previous approach — the bug)

2. Composite agg with query-level regexp filter

3. Composite agg + filter sub-agg + bucket_selector (elastic/elasticsearch#29079)

4. Composite agg with runtime field + conditional emit()

5. Terms agg include(regex) + exclude(array) for pagination

6. Terms agg include(partition/num_partitions) + query-level regexp

7. Composite agg with include/exclude on terms source

Why 10,000 cap on matching names

Files changed

Test plan

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

github-actions bot commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔴 Playwright Results — 1 failure(s), 22 flaky

Genuine Failures (failed on all attempts)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Apr 13, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

gitar-bot bot commented Apr 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sonarqubecloud bot commented Apr 15, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sonika-shah commented Apr 9, 2026 •

edited

Loading

2. Composite agg with query-level `regexp` filter

3. Composite agg + filter sub-agg + `bucket_selector` (elastic/elasticsearch#29079)

4. Composite agg with runtime field + conditional `emit()`

5. Terms agg `include(regex)` + `exclude(array)` for pagination

6. Terms agg `include(partition/num_partitions)` + query-level `regexp`

7. Composite agg with `include`/`exclude` on terms source

github-actions bot commented Apr 9, 2026 •

edited

Loading

gitar-bot bot commented Apr 15, 2026 •

edited

Loading