fix(search): prevent ES/OS document rejections via engine-native mapping hardening#28671
fix(search): prevent ES/OS document rejections via engine-native mapping hardening#28671mohityadav766 wants to merge 15 commits into
Conversation
…ing hardening Documents were being silently rejected by Elasticsearch/OpenSearch (immense-term on keyword > 32766 bytes, malformed numbers/dates, nested/depth explosion) and dead-lettered by the retry worker. Root cause: 66% of keyword field defs had no ignore_above, no numeric/date guards, and unbounded recursive flattening. Harden mappings once at index creation (zero per-document cost; the engine enforces the bounds): - SearchIndexSettings.harden injects ignore_above (keyword + multi-fields + flattened), ignore_malformed (numeric/date/boolean), depth_limit (flattened), and tunable index.mapping.*.limit guardrails; never overwrites existing values. - Applied on both the direct createIndex and the index-template paths (ES + OS). - OsUtils strips ES-only ignore_above/depth_limit when converting flattened to flat_object for OpenSearch. Cap structural explosion at the source (the one thing engines cannot truncate): - ColumnIndex/ColumnSearchIndex: depth + column-count caps. - New SchemaFieldFlattener: shared depth + field-count cap for Topic and APIEndpoint schemaFields (dedupes two identical copies). Limits are config-tunable via ElasticSearchConfiguration.searchIndexingLimits (enableMappingHardening, keywordMaxBytes, mappingDepthLimit, nestedObjectsLimit, totalFieldsLimit, maxColumns). Tests: SearchIndexSettingsTest (per-type hardening incl. all 16 field types + flattened/extension), OsUtilsTest (flat_object strip), ColumnIndexLimitTest, SchemaFieldFlattenerTest; IndexingLimitsIT proves raw mappings reject and hardened mappings accept against the real engine (per ES/OS profile). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR hardens Elasticsearch/OpenSearch index mappings at creation time (to prevent document rejections) and adds configurable caps to recursive search-document flattening (columns and schemaFields) to avoid structural explosions during indexing.
Changes:
- Introduces
searchIndexingLimitsconfiguration (mapping hardening + engine limit settings + max columns cap). - Adds
SearchIndexSettings+SearchFieldLimitsto injectignore_above,ignore_malformed,depth_limit, andindex.mapping.*.limitsettings into mappings on index/template creation (ES + OS). - Caps recursive flattening for
columnsandschemaFields(Topic/APIEndpoint) and adds unit + integration coverage.
Reviewed changes
Copilot reviewed 18 out of 19 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| openmetadata-spec/src/main/resources/json/schema/configuration/elasticSearchConfiguration.json | Adds searchIndexingLimits config block and defaults for mapping/indexing limits. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchIndexSettings.java | Implements create-time mapping hardening (ignore_above / ignore_malformed / mapping limits). |
| openmetadata-service/src/main/java/org/openmetadata/service/search/SearchFieldLimits.java | Resolves limits from config (with defaults) and exposes derived thresholds/caps. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OsUtils.java | Ensures ES-only flattened params are removed when converting to OpenSearch flat_object. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchIndexManager.java | Applies mapping hardening prior to OpenSearch mapping transformation on index creation. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchGenericManager.java | Applies mapping hardening prior to OpenSearch mapping transformation on template creation. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchIndexManager.java | Applies mapping hardening on Elasticsearch index creation. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/elasticsearch/ElasticSearchGenericManager.java | Applies mapping hardening on Elasticsearch index-template creation. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/SchemaFieldFlattener.java | New shared, bounded flattener for schemaFields used by Topic/APIEndpoint indexing. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/TopicIndex.java | Switches schemaFields flattening to the shared bounded flattener. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/APIEndpointIndex.java | Switches schemaFields flattening to the shared bounded flattener. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/ColumnSearchIndex.java | Adds depth + max-columns caps to static column flattening. |
| openmetadata-service/src/main/java/org/openmetadata/service/search/indexes/ColumnIndex.java | Adds depth + max-columns caps to column flattening during index-doc building. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/SearchIndexSettingsTest.java | Unit tests for mapping hardening behavior across field types/settings. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/opensearch/OsUtilsTest.java | Verifies ES-only flattened params are stripped for flat_object. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/indexes/SchemaFieldFlattenerTest.java | Verifies schemaFields flattening stops at depth + count caps. |
| openmetadata-service/src/test/java/org/openmetadata/service/search/indexes/ColumnIndexLimitTest.java | Verifies column flattening stops at depth + count caps (interface + static). |
| openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/IndexingLimitsIT.java | Integration test proving hardened mappings accept docs that raw mappings reject. |
| private static int clampKeywordBytes(Integer value) { | ||
| int resolved = orDefault(value, LUCENE_KEYWORD_MAX_BYTES); | ||
| return Math.min(resolved, LUCENE_KEYWORD_MAX_BYTES); | ||
| } |
| @Execution(ExecutionMode.CONCURRENT) | ||
| public class IndexingLimitsIT { | ||
|
|
||
| private static final List<String> CREATED_INDICES = new ArrayList<>(); | ||
|
|
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
- SearchFieldLimits.loadActive: don't permanently cache defaults when IndexMappingLoader isn't initialized yet (would silently ignore configured limits for the JVM lifetime); only cache once the config is resolvable. - SearchFieldLimits.clampKeywordBytes: floor keywordMaxBytes at 4 so ignore_above (= value/4) can never be 0 (which would disable keyword indexing); schema minimum bumped to 4. - ColumnIndex / SchemaFieldFlattener: pass the fully-qualified name (not the local name) into the recursion so deeply nested (>2 levels) columns/fields get correct dotted paths (a.b.c, not b.c). - IndexingLimitsIT: CopyOnWriteArrayList for CREATED_INDICES (was a plain ArrayList mutated under @execution(CONCURRENT)). - Schema docs: clarify nested_objects.limit rejects (not truncates) and that maxColumns also caps schema-field flattening. - Tests: FQN-path assertions for columns and schema fields; tiny keywordMaxBytes ignore_above >= 1. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| Column col = columns.get(index); | ||
| if (col.getTags() != null) { | ||
| tags = col.getTags(); | ||
| } | ||
| String columnName = addFlattenColumn(col, optParentColumn, tags, flattenColumns); |
| Field field = fields.get(index); | ||
| if (field.getTags() != null) { | ||
| tags = field.getTags(); | ||
| } | ||
| String fieldName = addFlattenField(field, optParentField, tags, flattenSchemaFields); |
✅ TypeScript Types Auto-UpdatedThe generated TypeScript types have been automatically updated based on JSON schema changes in this PR. |
🔴 Playwright Results — 5 failure(s), 12 flaky✅ 4252 passed · ❌ 5 failed · 🟡 12 flaky · ⏭️ 99 skipped
Genuine Failures (failed on all attempts)❌
|
| Optional<String> optParentField = | ||
| Optional.ofNullable(parentField).filter(Predicate.not(String::isEmpty)); | ||
| List<TagLabel> tags = new ArrayList<>(); | ||
| int index = 0; | ||
| boolean capReached = false; | ||
| while (index < fields.size() && !capReached) { | ||
| if (flattenSchemaFields.size() >= limits.getMaxColumns()) { | ||
| LOG.warn( | ||
| "Reached max indexed schema fields {}; dropping remaining under '{}'", | ||
| limits.getMaxColumns(), | ||
| parentField); | ||
| capReached = true; | ||
| } else { | ||
| Field field = fields.get(index); | ||
| if (field.getTags() != null) { | ||
| tags = field.getTags(); | ||
| } | ||
| String fieldName = addFlattenField(field, optParentField, tags, flattenSchemaFields); |
| Optional<String> optParentColumn = | ||
| Optional.ofNullable(parentColumn).filter(Predicate.not(String::isEmpty)); | ||
| List<TagLabel> tags = new ArrayList<>(); | ||
| for (Column col : columns) { | ||
| String columnName = col.getName(); | ||
| if (optParentColumn.isPresent()) { | ||
| columnName = FullyQualifiedName.add(optParentColumn.get(), columnName); | ||
| } | ||
| if (col.getTags() != null) { | ||
| tags = col.getTags(); | ||
| int index = 0; | ||
| boolean capReached = false; | ||
| while (index < columns.size() && !capReached) { | ||
| if (flattenColumns.size() >= limits.getMaxColumns()) { | ||
| LOG.warn( | ||
| "Reached max indexed columns {}; dropping remaining columns under '{}'", | ||
| limits.getMaxColumns(), | ||
| parentColumn); | ||
| capReached = true; | ||
| } else { | ||
| Column col = columns.get(index); | ||
| if (col.getTags() != null) { | ||
| tags = col.getTags(); | ||
| } | ||
| String columnName = addFlattenColumn(col, optParentColumn, tags, flattenColumns); | ||
| if (col.getChildren() != null) { |
…path Address PR review: createIndexInternal and the index-template path run SearchIndexSettings.harden(...), but updateIndex(IndexMapping, content) (PutMapping) did not, so a mapping update bypassed ignore_above / ignore_malformed / limits. - ElasticSearchIndexManager.updateIndex / OpenSearchIndexManager.updateIndex now harden the mapping content (OS hardens before enrichIndexMappingForOpenSearch). - IndexingLimitsIT.harden() now also runs OsUtils.enrichIndexMappingForOpenSearch on the OpenSearch profile, so the test validates the real OS mapping (e.g. the boolean ignore_malformed strip) instead of a mapping OpenSearch would reject. - SearchIndexSettings javadoc clarifies the OpenSearch boolean ignore_malformed caveat. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n-pipeline doc The per-run application config under pipelineStatuses.config is not searched (only the derived applicationType is, extracted separately), can be large, and is what triggered the dynamic-mapping type conflict (string then object) at reindex time. IngestionPipelineIndex now strips config from the run status before indexing (on a copy, without mutating the entity), keeping the searchable status fields (pipelineState, runId, timestamps). This is the root-cause cleanup complementing the pipelineStatuses dynamic:false guard: smaller docs and nothing free-form to fail on, while dynamic:false remains the general safety net. Test: IngestionPipelineIndexTest asserts config is stripped and runId preserved. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… fix Removing "dynamic": false from pipelineStatuses (it was breaking things). The doc-build config strip in IngestionPipelineIndex already removes the free-form pipelineStatuses.config blob before indexing, so the dynamic-mapping type conflict cannot occur and dynamic:false is unnecessary here. Also removes the now-moot repro test and the SCHEMA_INDEXING_SAFETY.md planning doc. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
❌ PR checklist incompleteThis PR cannot be merged until the following are addressed on its linked issue:
The fields live on the linked issue in the Shipping project (open the issue → right sidebar → Projects). After you set them, re-run this check (or push a commit) — issue/project changes do not re-trigger it automatically. Maintainers can bypass this check by adding the |
Code Review ✅ Approved 2 resolved / 2 findingsHardens search index mappings with engine-native limits and depth caps to prevent document rejections, resolving issues with the PutMapping path and static configuration caching. No issues found. ✅ 2 resolved✅ Bug: SearchFieldLimits caches defaults permanently if loaded pre-init
✅ Bug: updateIndex (PutMapping) path bypasses mapping hardening
OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |
|
|



Problem
Search documents were being silently rejected by Elasticsearch/OpenSearch and dead-lettered by
SearchIndexRetryWorker(4xx marked non-retryable). Root causes found in the mappings:ignore_above→ a value > 32,766 bytes throws an immense-termIllegalArgumentExceptionand the whole document is rejected.columnsandschemaFields→ deep/wide structures explode the document.Approach — engine-native hardening (zero per-document cost)
Harden the mapping once at index creation and let the engine enforce the bounds, instead of walking every document at index time (doesn't scale to large docs).
SearchIndexSettings.harden(content, limits):ignore_above(= byte-safe 8,191 = keywordMaxBytes/4) on keyword fields + keyword multi-fields +flattenedignore_malformedon numeric/date/booleandepth_limitonflattenedindex.mapping.{depth,nested_objects,total_fields}.limitApplied on both the direct
createIndexand the index-template paths, for ES and OS.OsUtilsstrips the ES-onlyignore_above/depth_limitwhen convertingflattened → flat_object.Structural explosion (the one thing engines can't truncate gracefully) is capped at the source:
ColumnIndex/ColumnSearchIndex— depth + column-count capsSchemaFieldFlattener— shared depth + field-count cap for Topic & APIEndpointschemaFields(also dedupes two identical copies)Limits are config-tunable via
ElasticSearchConfiguration.searchIndexingLimits(enableMappingHardening,keywordMaxBytes,mappingDepthLimit,nestedObjectsLimit,totalFieldsLimit,maxColumns).Coverage (exhaustive)
columns,schemaFields) — both capped.flattenedfields hardened (extension,columns.extension,dataModel.columns.extension).mcp_*,testSuites, …) bounded by entity size +nested_objects.limit+ leafignore_above.Tests
SearchIndexSettingsTest(25) — per-type hardening across all 16 field types, multi-fields, flattened + column-level extension, no-override, settings-limit injectionOsUtilsTest(+1) —flat_objecttransform strips ES-only paramsColumnIndexLimitTest(4),SchemaFieldFlattenerTest(2) — depth + count capsIndexingLimitsIT— against the real engine (both ES and OS via the two CI profiles): over-limit keyword and malformed numbers are rejected on a raw mapping and accepted on a hardened oneOut of scope (follow-ups)
dynamic: strict(onedynamic:truehole:test_case_resolution_status:testCaseResolutionStatusDetails)suggestfields)🤖 Generated with Claude Code