Skip to content

Fixes #28138: Pass selective fields instead of "*" in batch entity fetches to prevent OOM#28151

Merged
harshach merged 2 commits into
mainfrom
fix-did-field
May 18, 2026
Merged

Fixes #28138: Pass selective fields instead of "*" in batch entity fetches to prevent OOM#28151
harshach merged 2 commits into
mainfrom
fix-did-field

Conversation

@mohityadav766
Copy link
Copy Markdown
Member

@mohityadav766 mohityadav766 commented May 15, 2026

Fixes #28138

I worked on replacing blind "*" (all-fields) entity fetches with selective field lists across background apps, workflows, and governance tasks because hydrating every field —
including heavy embedded arrays like columns, tags, followers, owners, lineage, sampleData, and changeDescription — for entities processed in bulk causes excessive memory
consumption and OOMs.

This continues the earlier selective-fields work: DataInsightsApp and SearchIndexApp already pass selective fields, but several adjacent batch processes still passed "*", so
the OOM symptoms persisted. This change closes those remaining gaps.

Tier 1 — reuse ReindexingUtil.getSearchIndexFields(entityType) (the same per-entity allow-list SearchIndexApp uses):

  • DataAssetsWorkflow — Data Insights pipeline over all 14 asset types
  • SearchIndexRetryWorker — cascade-reindex retry path
  • SinkTaskDelegate — governance sink, batch and single-entity modes

Tier 1 — reuse ReindexingUtil.getSearchIndexFields(entityType) (the same per-entity allow-list SearchIndexApp uses):

  • DataAssetsWorkflow — Data Insights pipeline over all 14 asset types
  • SearchIndexRetryWorker — cascade-reindex retry path
  • SinkTaskDelegate — governance sink, batch and single-entity modes

Tier 2 — minimal hand-picked field lists (tied 1:1 to the getters each consumer actually calls):

  • CostAnalysisWorkflow — DatabaseService → none, Table → lifeCycle
  • ApplicationContext — app startup list → pipelines only
  • SetEntityCertificationImpl → certification
  • CheckChangeDescriptionTaskImpl → none (base-row field)
  • SetGlossaryTermStatusImpl → none (base-row field)
  • RollbackEntityImpl → none (uses repository.getVersion() for the payload)

CheckEntityAttributesImpl, DataCompletenessImpl, SetEntityAttributeImpl, CreateTask, and RdfIndexApp were intentionally left as-is — they genuinely need full entity state
(rule-engine / arbitrary field-path evaluation), or warrant their own analysis (RDF).

Type of change:

  • Improvement — performance / memory reduction
  • Bug fix
  • New feature
  • Breaking change
  • Documentation

High-level design:

N/A — small change.

Tests:

  • Manual testing performed — mvn -pl openmetadata-service compile succeeds; mvn -pl openmetadata-service spotless:check is clean.
  • Use cases covered
  • Unit tests
  • Backend integration tests
  • Ingestion integration tests
  • Playwright (UI) tests

▎ Note: no automated tests were added in this PR. The Tier 1 changes reuse an allow-list already exercised by SearchIndexApp's test coverage. The Tier 2 governance-task
▎ changes (which feed JSON-patch flows) would benefit from an integration test confirming no unintended field clobber — happy to add openmetadata-integration-tests coverage if
▎ reviewers want it before merge.

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes :
  • My PR is linked to a GitHub issue via Fixes [Bug]: OutOfMemoryError (Java heap space) in scheduled Data Insights application #28138 above.
  • I have commented on my code, particularly in hard-to-understand areas. — N/A, no comments needed; changes are self-explanatory.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed. — N/A, no schema changes.
  • For UI changes: I attached a screen recording and/or screenshots above. — N/A, no UI changes.
  • I have added tests (unit / integration / Playwright as applicable) and listed them above. — see Tests note above.

@mohityadav766 mohityadav766 self-assigned this May 15, 2026
Copilot AI review requested due to automatic review settings May 15, 2026 14:06
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels May 15, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces ad-hoc "*" fields arguments at several Entity.getEntity / EntityRepository.listAfter / PaginatedEntitiesSource call sites with explicit, narrower field selections. The goal is to stop fetching every relation/extension field when only a small subset is actually consumed, reducing DB joins and serialization cost in workflow, ingestion, and app-context paths.

Changes:

  • Governance workflow delegates (SinkTaskDelegate, SetGlossaryTermStatusImpl, SetEntityCertificationImpl, RollbackEntityImpl, CheckChangeDescriptionTaskImpl) now request either an empty/specific field set or ReindexingUtil.getSearchIndexFields(entityType) instead of "*".
  • Insights workflows (DataAssetsWorkflow, CostAnalysisWorkflow) now use the entity-specific search-index fields helper (and lifeCycle only for tables, no extra fields for the database service listing).
  • SearchIndexRetryWorker.reindexEntityCascade and ApplicationContext.initialize switch from "*" to targeted fields (getSearchIndexFields(...) and "pipelines" respectively).

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
openmetadata-service/.../search/SearchIndexRetryWorker.java Use per-type search-index fields when re-fetching entities for cascade reindex.
openmetadata-service/.../automatedTask/sink/SinkTaskDelegate.java Replace "*" with ReindexingUtil.getSearchIndexFields(...) in batch and single-entity sink paths.
openmetadata-service/.../automatedTask/impl/SetGlossaryTermStatusImpl.java Load glossary term with no extra fields prior to status patch.
openmetadata-service/.../automatedTask/impl/SetEntityCertificationImpl.java Load entity with only certification field for the patch.
openmetadata-service/.../automatedTask/impl/RollbackEntityImpl.java Load current entity without extra fields (full version is reloaded later via getVersion).
openmetadata-service/.../automatedTask/impl/CheckChangeDescriptionTaskImpl.java Load entity without extra fields; only changeDescription is consumed.
openmetadata-service/.../insights/workflows/dataAssets/DataAssetsWorkflow.java Use getSearchIndexFields(entityType) for paginated source fields.
openmetadata-service/.../insights/workflows/costAnalysis/CostAnalysisWorkflow.java Database services pull no extra fields; tables pull only lifeCycle.
openmetadata-service/.../apps/ApplicationContext.java List installed apps with only the pipelines field instead of all fields.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 15, 2026

🟡 Playwright Results — all passed (11 flaky)

✅ 4110 passed · ❌ 0 failed · 🟡 11 flaky · ⏭️ 86 skipped

Shard Passed Failed Flaky Skipped
✅ Shard 1 299 0 0 4
🟡 Shard 2 770 0 4 8
🟡 Shard 3 781 0 3 7
✅ Shard 4 816 0 0 18
🟡 Shard 5 708 0 1 41
🟡 Shard 6 736 0 3 8
🟡 11 flaky test(s) (passed on retry)
  • Features/ActivityAPI.spec.ts › creates an activity event when tags are added (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should start term as Draft when glossary has reviewers (shard 2, 1 retry)
  • Features/IncidentManager.spec.ts › Next, Previous and page indicator (shard 2, 1 retry)
  • Features/KnowledgeCenter.spec.ts › Article mentions in description should working for Knowledge Center (shard 2, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Flow/PersonaFlow.spec.ts › Set default persona for team should work properly (shard 3, 1 retry)
  • Pages/Entity.spec.ts › Tier Add, Update and Remove (shard 5, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage schema filter selection (shard 6, 1 retry)
  • Pages/ServiceEntity.spec.ts › Tier Add, Update and Remove (shard 6, 1 retry)
  • Pages/Users.spec.ts › Update token expiration for Data Consumer (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

@mohityadav766 mohityadav766 added the To release Will cherry-pick this PR into the release branch label May 17, 2026
@mohityadav766 mohityadav766 changed the title Remove * callsites Fixes #28138: Pass selective fields instead of "*" in batch entity fetches to prevent OOM May 17, 2026
@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented May 17, 2026

Code Review ✅ Approved

Replaces wildcard field selectors with specific fields across multiple workflows and search indexers to minimize unnecessary data fetching. No issues found.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link
Copy Markdown

@harshach harshach merged commit 71893d5 into main May 18, 2026
54 checks passed
@harshach harshach deleted the fix-did-field branch May 18, 2026 05:50
@github-actions
Copy link
Copy Markdown
Contributor

Failed to cherry-pick changes to the 1.12.9 branch.
Please cherry-pick the changes manually.
You can find more details here.

@github-actions
Copy link
Copy Markdown
Contributor

Changes have been cherry-picked to the 1.13 branch.

github-actions Bot pushed a commit that referenced this pull request May 18, 2026
(cherry picked from commit 71893d5)
@mohityadav766 mohityadav766 moved this to Done ✅ in Shipping May 18, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

Status: Done ✅

Development

Successfully merging this pull request may close these issues.

[Bug]: OutOfMemoryError (Java heap space) in scheduled Data Insights application

3 participants