Skip to content

fix(search): prevent ES/OS document rejections via engine-native mapping hardening#28671

Open
mohityadav766 wants to merge 15 commits into
mainfrom
feat/schema-indexing-safety
Open

fix(search): prevent ES/OS document rejections via engine-native mapping hardening#28671
mohityadav766 wants to merge 15 commits into
mainfrom
feat/schema-indexing-safety

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

@mohityadav766 mohityadav766 commented Jun 3, 2026

Problem

Search documents were being silently rejected by Elasticsearch/OpenSearch and dead-lettered by SearchIndexRetryWorker (4xx marked non-retryable). Root causes found in the mappings:

  • 66% (2,728 / 4,079) of keyword field defs have no ignore_above → a value > 32,766 bytes throws an immense-term IllegalArgumentException and the whole document is rejected.
  • No numeric/date guard → out-of-range integer, NaN/Infinity float, or malformed date rejects the document.
  • Unbounded recursive flattening of columns and schemaFields → deep/wide structures explode the document.

Approach — engine-native hardening (zero per-document cost)

Harden the mapping once at index creation and let the engine enforce the bounds, instead of walking every document at index time (doesn't scale to large docs).

SearchIndexSettings.harden(content, limits):

  • ignore_above (= byte-safe 8,191 = keywordMaxBytes/4) on keyword fields + keyword multi-fields + flattened
  • ignore_malformed on numeric/date/boolean
  • depth_limit on flattened
  • tunable index.mapping.{depth,nested_objects,total_fields}.limit
  • never overwrites existing values

Applied on both the direct createIndex and the index-template paths, for ES and OS. OsUtils strips the ES-only ignore_above/depth_limit when converting flattened → flat_object.

Structural explosion (the one thing engines can't truncate gracefully) is capped at the source:

  • ColumnIndex / ColumnSearchIndex — depth + column-count caps
  • new SchemaFieldFlattener — shared depth + field-count cap for Topic & APIEndpoint schemaFields (also dedupes two identical copies)

Limits are config-tunable via ElasticSearchConfiguration.searchIndexingLimits (enableMappingHardening, keywordMaxBytes, mappingDepthLimit, nestedObjectsLimit, totalFieldsLimit, maxColumns).

Coverage (exhaustive)

  • Recursive flatten-into-one-doc exists in exactly two places (columns, schemaFields) — both capped.
  • All flattened fields hardened (extension, columns.extension, dataModel.columns.extension).
  • Non-recursive nested arrays (mcp_*, testSuites, …) bounded by entity size + nested_objects.limit + leaf ignore_above.

Tests

  • SearchIndexSettingsTest (25) — per-type hardening across all 16 field types, multi-fields, flattened + column-level extension, no-override, settings-limit injection
  • OsUtilsTest (+1) — flat_object transform strips ES-only params
  • ColumnIndexLimitTest (4), SchemaFieldFlattenerTest (2) — depth + count caps
  • IndexingLimitsIT — against the real engine (both ES and OS via the two CI profiles): over-limit keyword and malformed numbers are rejected on a raw mapping and accepted on a hardened one

Out of scope (follow-ups)

  • DB-backed + UI-editable mappings/limits (Settings page)
  • dynamic: strict (one dynamic:true hole: test_case_resolution_status:testCaseResolutionStatusDetails)
  • completion suggester reserved-char stripping (9 suggest fields)

🤖 Generated with Claude Code

…ing hardening

Documents were being silently rejected by Elasticsearch/OpenSearch (immense-term
on keyword > 32766 bytes, malformed numbers/dates, nested/depth explosion) and
dead-lettered by the retry worker. Root cause: 66% of keyword field defs had no
ignore_above, no numeric/date guards, and unbounded recursive flattening.

Harden mappings once at index creation (zero per-document cost; the engine
enforces the bounds):
- SearchIndexSettings.harden injects ignore_above (keyword + multi-fields +
  flattened), ignore_malformed (numeric/date/boolean), depth_limit (flattened),
  and tunable index.mapping.*.limit guardrails; never overwrites existing values.
- Applied on both the direct createIndex and the index-template paths (ES + OS).
- OsUtils strips ES-only ignore_above/depth_limit when converting flattened to
  flat_object for OpenSearch.

Cap structural explosion at the source (the one thing engines cannot truncate):
- ColumnIndex/ColumnSearchIndex: depth + column-count caps.
- New SchemaFieldFlattener: shared depth + field-count cap for Topic and
  APIEndpoint schemaFields (dedupes two identical copies).

Limits are config-tunable via ElasticSearchConfiguration.searchIndexingLimits
(enableMappingHardening, keywordMaxBytes, mappingDepthLimit, nestedObjectsLimit,
totalFieldsLimit, maxColumns).

Tests: SearchIndexSettingsTest (per-type hardening incl. all 16 field types +
flattened/extension), OsUtilsTest (flat_object strip), ColumnIndexLimitTest,
SchemaFieldFlattenerTest; IndexingLimitsIT proves raw mappings reject and
hardened mappings accept against the real engine (per ES/OS profile).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 3, 2026 15:25
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 3, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens Elasticsearch/OpenSearch index mappings at creation time (to prevent document rejections) and adds configurable caps to recursive search-document flattening (columns and schemaFields) to avoid structural explosions during indexing.

Changes:

  • Introduces searchIndexingLimits configuration (mapping hardening + engine limit settings + max columns cap).
  • Adds SearchIndexSettings + SearchFieldLimits to inject ignore_above, ignore_malformed, depth_limit, and index.mapping.*.limit settings into mappings on index/template creation (ES + OS).
  • Caps recursive flattening for columns and schemaFields (Topic/APIEndpoint) and adds unit + integration coverage.

Reviewed changes

Copilot reviewed 18 out of 19 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
openmetadata-spec/src/main/resources/json/schema/configuration/elasticSearchConfiguration.json Adds searchIndexingLimits config block and defaults for mapping/indexing limits.
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchIndexSettings.java Implements create-time mapping hardening (ignore_above / ignore_malformed / mapping limits).
openmetadata-service/src/main/java/org/openmetadata/service/search/SearchFieldLimits.java Resolves limits from config (with defaults) and exposes derived thresholds/caps.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OsUtils.java Ensures ES-only flattened params are removed when converting to OpenSearch flat_object.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchIndexManager.java Applies mapping hardening prior to OpenSearch mapping transformation on index creation.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchGenericManager.java Applies mapping hardening prior to OpenSearch mapping transformation on template creation.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManager.java Applies mapping hardening on Elasticsearch index creation.
openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchGenericManager.java Applies mapping hardening on Elasticsearch index-template creation.
openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/SchemaFieldFlattener.java New shared, bounded flattener for schemaFields used by Topic/APIEndpoint indexing.
openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/TopicIndex.java Switches schemaFields flattening to the shared bounded flattener.
openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/APIEndpointIndex.java Switches schemaFields flattening to the shared bounded flattener.
openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/ColumnSearchIndex.java Adds depth + max-columns caps to static column flattening.
openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/ColumnIndex.java Adds depth + max-columns caps to column flattening during index-doc building.
openmetadata-service/src/test/java/org/openmetadata/service/search/SearchIndexSettingsTest.java Unit tests for mapping hardening behavior across field types/settings.
openmetadata-service/src/test/java/org/openmetadata/service/search/opensearch/OsUtilsTest.java Verifies ES-only flattened params are stripped for flat_object.
openmetadata-service/src/test/java/org/openmetadata/service/search/indexes/SchemaFieldFlattenerTest.java Verifies schemaFields flattening stops at depth + count caps.
openmetadata-service/src/test/java/org/openmetadata/service/search/indexes/ColumnIndexLimitTest.java Verifies column flattening stops at depth + count caps (interface + static).
openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/IndexingLimitsIT.java Integration test proving hardened mappings accept docs that raw mappings reject.

Comment on lines +129 to +132
private static int clampKeywordBytes(Integer value) {
int resolved = orDefault(value, LUCENE_KEYWORD_MAX_BYTES);
return Math.min(resolved, LUCENE_KEYWORD_MAX_BYTES);
}
Comment on lines +28 to +32
@Execution(ExecutionMode.CONCURRENT)
public class IndexingLimitsIT {

private static final List<String> CREATED_INDICES = new ArrayList<>();

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

@github-actions github-actions Bot requested a review from a team as a code owner June 3, 2026 15:31
- SearchFieldLimits.loadActive: don't permanently cache defaults when
  IndexMappingLoader isn't initialized yet (would silently ignore configured
  limits for the JVM lifetime); only cache once the config is resolvable.
- SearchFieldLimits.clampKeywordBytes: floor keywordMaxBytes at 4 so
  ignore_above (= value/4) can never be 0 (which would disable keyword indexing);
  schema minimum bumped to 4.
- ColumnIndex / SchemaFieldFlattener: pass the fully-qualified name (not the
  local name) into the recursion so deeply nested (>2 levels) columns/fields get
  correct dotted paths (a.b.c, not b.c).
- IndexingLimitsIT: CopyOnWriteArrayList for CREATED_INDICES (was a plain
  ArrayList mutated under @execution(CONCURRENT)).
- Schema docs: clarify nested_objects.limit rejects (not truncates) and that
  maxColumns also caps schema-field flattening.
- Tests: FQN-path assertions for columns and schema fields; tiny keywordMaxBytes
  ignore_above >= 1.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 3, 2026 16:11
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 21 changed files in this pull request and generated 2 comments.

Comment on lines +58 to +62
Column col = columns.get(index);
if (col.getTags() != null) {
tags = col.getTags();
}
String columnName = addFlattenColumn(col, optParentColumn, tags, flattenColumns);
Comment on lines +79 to +83
Field field = fields.get(index);
if (field.getTags() != null) {
tags = field.getTags();
}
String fieldName = addFlattenField(field, optParentField, tags, flattenSchemaFields);
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

✅ TypeScript Types Auto-Updated

The generated TypeScript types have been automatically updated based on JSON schema changes in this PR.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

Jest test Coverage

UI tests summary

Lines Statements Branches Functions
Coverage: 63%
63.4% (67232/106039) 44.31% (36975/83440) 46.6% (11007/23618)

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 3, 2026

🔴 Playwright Results — 5 failure(s), 12 flaky

✅ 4252 passed · ❌ 5 failed · 🟡 12 flaky · ⏭️ 99 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 299 0 2 4
🟡 Shard 2 803 0 1 9
🔴 Shard 3 800 2 3 8
🔴 Shard 4 842 2 1 22
🟡 Shard 5 720 0 1 47
🔴 Shard 6 788 1 4 9

Genuine Failures (failed on all attempts)

Features/SearchIndexNestedColumns.spec.ts › 25-level oversized nested column indexes and is searchable by its deep column name (shard 3)
Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoBeVisible�[2m(�[22m�[2m)�[22m failed

Locator: locator('[data-testid="global-search-suggestion-box"]')
Expected: visible
Timeout: 15000ms
Error: element(s) not found

Call log:
�[2m  - Expect "toBeVisible" with timeout 15000ms�[22m
�[2m  - waiting for locator('[data-testid="global-search-suggestion-box"]')�[22m

Flow/ExploreAggregationCountsMatching.spec.ts › should verify left panel counts and tab search results for normal search (shard 3)
Error: Tab "topic" search total hits should match the aggregation count

�[2mexpect(�[22m�[31mreceived�[39m�[2m).�[22mtoBe�[2m(�[22m�[32mexpected�[39m�[2m) // Object.is equality�[22m

Expected: �[32m4�[39m
Received: �[31m24�[39m
Pages/CustomProperties.spec.ts › Create custom property and configure search for Dashboard (shard 4)
�[31mTest timeout of 180000ms exceeded.�[39m
Pages/CustomProperties.spec.ts › Create custom property and configure search for Pipeline (shard 4)
�[31mTest timeout of 180000ms exceeded.�[39m
Features/AutoPilot.spec.ts › Create Service and check the AutoPilot status (shard 6)
Error: �[2mexpect(�[22m�[31mlocator�[39m�[2m).�[22mtoBeVisible�[2m(�[22m�[2m)�[22m failed

Locator: locator('[data-testid="success-badge"]').or(locator('[data-testid="warning-badge"]'))
Expected: visible
Timeout: 210000ms
Error: element(s) not found

Call log:
�[2m  - Expect "toBeVisible" with timeout 210000ms�[22m
�[2m  - waiting for locator('[data-testid="success-badge"]').or(locator('[data-testid="warning-badge"]'))�[22m

🟡 12 flaky test(s) (passed on retry)
  • Features/CustomizeDetailPage.spec.ts › Glossary - customization should work (shard 1, 1 retry)
  • Pages/UserCreationWithPersona.spec.ts › Create user with persona and verify on profile (shard 1, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 2 retries)
  • Features/SettingsNavigationPage.spec.ts › should support drag and drop reordering of navigation items (shard 3, 1 retry)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Features/Tasks/TaskNavigation.spec.ts › navigating to /table/TASK-XXXXX should show 404 (invalid URL pattern) (shard 3, 1 retry)
  • Flow/PersonaDeletionUserProfile.spec.ts › User profile loads correctly before and after persona deletion (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Glossary.spec.ts › Column dropdown drag-and-drop functionality for Glossary Terms table (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

Copilot AI review requested due to automatic review settings June 4, 2026 05:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 18 out of 21 changed files in this pull request and generated 2 comments.

Comment on lines +66 to +83
Optional<String> optParentField =
Optional.ofNullable(parentField).filter(Predicate.not(String::isEmpty));
List<TagLabel> tags = new ArrayList<>();
int index = 0;
boolean capReached = false;
while (index < fields.size() && !capReached) {
if (flattenSchemaFields.size() >= limits.getMaxColumns()) {
LOG.warn(
"Reached max indexed schema fields {}; dropping remaining under '{}'",
limits.getMaxColumns(),
parentField);
capReached = true;
} else {
Field field = fields.get(index);
if (field.getTags() != null) {
tags = field.getTags();
}
String fieldName = addFlattenField(field, optParentField, tags, flattenSchemaFields);
Comment on lines 45 to +63
Optional<String> optParentColumn =
Optional.ofNullable(parentColumn).filter(Predicate.not(String::isEmpty));
List<TagLabel> tags = new ArrayList<>();
for (Column col : columns) {
String columnName = col.getName();
if (optParentColumn.isPresent()) {
columnName = FullyQualifiedName.add(optParentColumn.get(), columnName);
}
if (col.getTags() != null) {
tags = col.getTags();
int index = 0;
boolean capReached = false;
while (index < columns.size() && !capReached) {
if (flattenColumns.size() >= limits.getMaxColumns()) {
LOG.warn(
"Reached max indexed columns {}; dropping remaining columns under '{}'",
limits.getMaxColumns(),
parentColumn);
capReached = true;
} else {
Column col = columns.get(index);
if (col.getTags() != null) {
tags = col.getTags();
}
String columnName = addFlattenColumn(col, optParentColumn, tags, flattenColumns);
if (col.getChildren() != null) {
…path

Address PR review: createIndexInternal and the index-template path run
SearchIndexSettings.harden(...), but updateIndex(IndexMapping, content) (PutMapping)
did not, so a mapping update bypassed ignore_above / ignore_malformed / limits.

- ElasticSearchIndexManager.updateIndex / OpenSearchIndexManager.updateIndex now
  harden the mapping content (OS hardens before enrichIndexMappingForOpenSearch).
- IndexingLimitsIT.harden() now also runs OsUtils.enrichIndexMappingForOpenSearch
  on the OpenSearch profile, so the test validates the real OS mapping (e.g. the
  boolean ignore_malformed strip) instead of a mapping OpenSearch would reject.
- SearchIndexSettings javadoc clarifies the OpenSearch boolean ignore_malformed
  caveat.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n-pipeline doc

The per-run application config under pipelineStatuses.config is not searched (only
the derived applicationType is, extracted separately), can be large, and is what
triggered the dynamic-mapping type conflict (string then object) at reindex time.

IngestionPipelineIndex now strips config from the run status before indexing (on a
copy, without mutating the entity), keeping the searchable status fields
(pipelineState, runId, timestamps). This is the root-cause cleanup complementing
the pipelineStatuses dynamic:false guard: smaller docs and nothing free-form to
fail on, while dynamic:false remains the general safety net.

Test: IngestionPipelineIndexTest asserts config is stripped and runId preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 4, 2026 18:05
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

… fix

Removing "dynamic": false from pipelineStatuses (it was breaking things). The
doc-build config strip in IngestionPipelineIndex already removes the free-form
pipelineStatuses.config blob before indexing, so the dynamic-mapping type conflict
cannot occur and dynamic:false is unnecessary here. Also removes the now-moot
repro test and the SCHEMA_INDEXING_SAFETY.md planning doc.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

❌ PR checklist incomplete

This PR cannot be merged until the following are addressed on its linked issue:

  • No GitHub issue is linked. Add a closing reference such as Fixes #12345 to the PR description (accepted keywords: Fixes, Closes, Resolves).

The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically.

Maintainers can bypass this check by adding the skip-pr-checks label.

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Jun 5, 2026

Code Review ✅ Approved 2 resolved / 2 findings

Hardens search index mappings with engine-native limits and depth caps to prevent document rejections, resolving issues with the PutMapping path and static configuration caching. No issues found.

✅ 2 resolved
Bug: SearchFieldLimits caches defaults permanently if loaded pre-init

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/SearchFieldLimits.java:89-103
loadActive() unconditionally assigns active = result even when IndexMappingLoader.getInstance() throws IllegalStateException (not yet initialized). Because active() returns the cached value forever after first invocation, any call to SearchFieldLimits.active() that happens before IndexMappingLoader.init() permanently caches the hardcoded defaults(), after which the operator-configured searchIndexingLimits (keywordMaxBytes, depth/nested/total limits, maxColumns) are silently ignored for the rest of the JVM's lifetime. There is no production code that calls setActive(...) with the resolved configuration, so correctness relies entirely on active() never being invoked before the loader is initialized. Consider not caching when the loader is uninitialized (so a later call can pick up the real configuration), or explicitly calling setActive(SearchFieldLimits.from(config)) during search startup.

Bug: updateIndex (PutMapping) path bypasses mapping hardening

📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManager.java:83-97 📄 openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManager.java:150-164
createIndexInternal now runs SearchIndexSettings.harden(...) and the index-template path is hardened too, but updateIndex(IndexMapping, indexMappingContent) sends the raw indexMappingContent straight to PutMappingRequest without hardening. updateIndex is reached at runtime via SearchRepository.updateIndexElasticSearchClient.updateIndex (and the OpenSearch equivalent), which is exactly the path that adds new fields to existing indexes during schema/version changes. New keyword fields added through this path receive no ignore_above, and new numeric/date fields receive no ignore_malformed, so the very immense-term / malformed-value document rejections this PR sets out to prevent can be reintroduced on any index that is updated rather than freshly created. This is an exhaustiveness gap relative to the PR's stated coverage. Recommend hardening the field definitions in the update path as well (a field-only variant, since PutMapping accepts mapping properties rather than index settings).

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 5, 2026

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 5, 2026

@mohityadav766 mohityadav766 mentioned this pull request Jun 5, 2026
12 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants