fix: memory hardening to prevent OOMKill under concurrent load by mohityadav766 · Pull Request #27397 · open-metadata/OpenMetadata

mohityadav766 · 2026-04-15T13:32:13Z

Summary

Convert 6 Guava caches from count-based (maximumSize) to weight-based (maximumWeight) eviction, capping total heap at defined MB limits instead of allowing unbounded growth with large entity JSON
Bound unbounded queues (EntityLifecycleEventDispatcher, EventPubSub) and add CallerRunsPolicy backpressure
Cap per-request entity cache at 50 LRU entries, strip full entities from ChangeEvents to lightweight refs, add LIMIT to unbounded SQL queries, set 50MB JSON input size limit
Fix WebSocket connection map leak, reduce BULK_JOBS retention, bump default heap to 2G with G1GC

Context

v1.12.4 deployment has been crashing repeatedly due to OOMKill. Root cause is memory amplification when concurrent ingestion clients + usage pipeline fire simultaneously — unbounded caches fill with large Table JSON, deep-copy operations multiply memory per request, and unbounded queues accumulate under GC pressure.

Test plan

Verify CACHE_WITH_ID/CACHE_WITH_NAME evict large entries before reaching 200MB
Verify ChangeEvent storage contains lightweight entity ref, not full entity
Verify listUnprocessedEvents() returns at most 1000 records
Run existing unit tests: mvn test -pl openmetadata-service
Verify build compiles: mvn compile -DskipTests -pl openmetadata-service,openmetadata-spec

🤖 Generated with Claude Code

Convert Guava caches from count-based to weight-based eviction to cap total heap consumed. Bound unbounded queues and thread pools that could grow without limit under load. Cap per-request entity cache, strip full entity data from ChangeEvents, add LIMIT to unbounded SQL queries, and set a 50MB JSON input size constraint. Key changes: - EntityRepository CACHE_WITH_ID/NAME: maximumSize(20K) -> maximumWeight(200MB) - GuavaLineageGraphCache: maximumSize(100) -> maximumWeight(100MB) - SubjectCache, SettingsCache, RBAC cache: weight-based eviction - EntityLifecycleEventDispatcher: bounded queue (5000) + CallerRunsPolicy - EventPubSub: bounded ThreadPoolExecutor(4-32) replacing unbounded CachedThreadPool - RequestEntityCache: LRU cap at 50 entries per thread - ChangeEvent: lightweight entity ref instead of full entity embedding - CollectionDAO.listUnprocessedEvents: added LIMIT 1000 - JsonUtils: maxStringLength capped at 50MB (was Integer.MAX_VALUE) - WebSocketManager: cleanup empty user maps on disconnect - BULK_JOBS: reduced retention from 1h to 5min, capped at 100 concurrent - Default heap bumped from 1G to 2G with G1GC and HeapDumpOnOOM Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ty in ChangeEvents The Map-based lightweight ref broke type safety and downstream code expecting typed entities. Reverted all .withEntity() calls back to passing the original entity. The ChangeEvent already carries entityId, entityType, and entityFullyQualifiedName as separate fields, so the full entity embedding can be addressed separately with a proper withEntityRef() approach. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

This PR aims to harden OpenMetadata against OOMKills under concurrent load by bounding major in-memory amplification sources (caches, queues, retained job results) and reducing the payload footprint of persisted change events.

Changes:

Add hard limits to cache growth (weight-based Guava caches + per-request LRU) and introduce reusable cache weighers.
Introduce bounded queues + backpressure policies for async dispatch / pub-sub to prevent unbounded task accumulation.
Reduce retained payload sizes (lightweight entity refs in ChangeEvents, bounded DB fetch for unprocessed events) and adjust default Docker heap/GC settings.

Reviewed changes

Copilot reviewed 15 out of 15 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
openmetadata-spec/src/main/java/org/openmetadata/schema/utils/JsonUtils.java	Adds Jackson stream read constraint intended to limit large JSON string values.
openmetadata-service/src/main/java/org/openmetadata/service/util/RequestEntityCache.java	Converts request-scoped cache to bounded LRU via ThreadLocal LinkedHashMap.
openmetadata-service/src/main/java/org/openmetadata/service/util/CacheWeighers.java	Introduces reusable Guava cache `Weigher` helpers for weight-based eviction.
openmetadata-service/src/main/java/org/openmetadata/service/socket/WebSocketManager.java	Fixes potential connection-map retention by removing empty per-user socket maps.
openmetadata-service/src/main/java/org/openmetadata/service/security/policyevaluator/SubjectCache.java	Switches user context cache to weight-based eviction using a JSON-serialization weigher.
openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSearchManager.java	Converts RBAC cache to weight-based eviction and adds a weigher for queries.
openmetadata-service/src/main/java/org/openmetadata/service/search/lineage/GuavaLineageGraphCache.java	Switches lineage cache to weight-based eviction using an estimated node-size weigher.
openmetadata-service/src/main/java/org/openmetadata/service/resources/settings/SettingsCache.java	Switches settings cache to weight-based eviction using a JSON-serialization weigher.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/UsageRepository.java	Stores lightweight entity refs in usage-related ChangeEvents to reduce payload size.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java	Weight-bounds entity JSON caches, emits lightweight ChangeEvent entity refs, masks incremental ChangeEvent JSON, caps bulk job retention and concurrent jobs.
openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java	Bounds unprocessed change-event fetch to 1000 rows with ordering.
openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java	Bounds lifecycle async queue and adds CallerRunsPolicy backpressure.
openmetadata-service/src/main/java/org/openmetadata/service/events/EventPubSub.java	Replaces cached thread pool with bounded ThreadPoolExecutor + CallerRunsPolicy backpressure.
docker/docker-compose-openmetadata/env-postgres	Raises default heap and adds GC/heap-dump flags for Docker Postgres env.
docker/docker-compose-openmetadata/env-mysql	Raises default heap and adds GC/heap-dump flags for Docker MySQL env.

Comments suppressed due to low confidence (1)

openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSearchManager.java:126

RBAC_CACHE_V2 is declared as a LoadingCache but the provided CacheLoader.load() returns null, which violates Guava’s contract (null loads throw InvalidCacheLoadException) and is misleading since the cache is populated via get(key, Callable). Consider changing the type to Cache<String, Query> (no loader) or make load() throw an UnsupportedOperationException to prevent accidental usage.

  @SuppressWarnings("NullableProblems")
  private static final LoadingCache<String, Query> RBAC_CACHE_V2 =
      CacheBuilder.newBuilder()
          .maximumWeight(50_000_000L) // ~50 MB cap based on Query.toString() size
          .weigher(
              (com.google.common.cache.Weigher<String, Query>)
                  (key, value) -> value != null ? value.toString().length() : 0)
          .expireAfterWrite(5, TimeUnit.MINUTES)
          .build(
              new CacheLoader<>() {
                @Override
                public Query load(String key) {
                  // Will be loaded via computeIfAbsent pattern
                  return null;
                }
              });

…on cost, event pagination - BULK_JOBS: synchronized check-then-put to eliminate TOCTOU race - CacheWeighers.stringWeigher: account for UTF-16 (2 bytes/char + 40B overhead) - Replace jsonSerializationWeigher with toStringWeigher to avoid full JSON serialization on every cache put (was hitting SubjectCache and SettingsCache) - Revert LIMIT 1000 on listUnprocessedEvents(offset) — the sole caller uses it for counting unprocessed events and doesn't paginate, so the LIMIT would silently undercount. The paginated overload already exists for bounded fetching. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

openmetadata-service/src/main/java/org/openmetadata/service/search/opensearch/OpenSearchSearchManager.java:125

CacheLoader.load must not return null for a Guava LoadingCache (it will trigger InvalidCacheLoadException if get() is ever called). If this cache is intentionally populated only via asMap().computeIfAbsent(...), consider switching to Cache<String, Query> instead of LoadingCache, or have load() throw UnsupportedOperationException (similar to other caches in this codebase) to fail fast with a clearer message.

              new CacheLoader<>() {
                @Override
                public Query load(String key) {
                  // Will be loaded via computeIfAbsent pattern
                  return null;
                }

Pere's review comments: 1. EntityRepository:312 "shouldnt this be part of the config too?" → Default values now reference CacheConfiguration.DEFAULT_* constants instead of inline magic numbers. initCaches() overrides at startup. 2. CacheConfiguration:37 "how did we come up with this default?" → Added Javadoc on each constant explaining the rationale (100MB safe for 2-8GB heap, 30s TTL matches original, 5000 entries for small objects). 3. OpenSearchSearchManager:113 "why is this not managed via config?" → RBAC cache now configurable via cacheMemory.rbacCacheMaxEntries env var RBAC_CACHE_MAX_ENTRIES (default 5000). Added initRbacCache() called from app startup. 4. RequestEntityCache:28 "what are the magic numbers?" → Extracted INITIAL_CAPACITY, LOAD_FACTOR, ACCESS_ORDER as named constants. Added Javadoc on MAX_ENTRIES_PER_REQUEST explaining the 50-entry cap rationale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 3 comments.

@Valid

…r RBAC, @Valid config 1. BULK_JOBS: Replace synchronized+ConcurrentHashMap with Semaphore for thread-safe concurrency limiting. tryAcquire() is atomic, release() in whenComplete ensures permits are always returned. 2. RBAC cache: Switch from LoadingCache with null-returning CacheLoader to plain Cache<String, Query>. The CacheLoader was dead code — all callers use get(key, Callable). Null returns from CacheLoader would throw InvalidCacheLoadException. 3. CacheConfiguration: Add @Valid to the cacheMemory field in OpenMetadataApplicationConfig and initialize inline so @min constraints are enforced by Bean Validation at startup. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…reakdown The previous 500MB hard assertion was too tight — total heap growth includes non-cache overhead (change events, search indexing, request buffers, thread stacks, GC pressure). 744MB growth for 30 large tables with concurrent fetching is expected server-wide, not just cache. New test structure: - Takes heap snapshots at each phase (baseline, schema setup, table creation, sequential fetches, concurrent storm, 5s settle) - Logs a full diagnostic report with per-phase growth breakdown - Dumps JVM memory pool details from Prometheus (per-pool used/max, buffer memory, GC live data, thread count) - Asserts only on what matters: >95% fetch success rate (server alive) - Heap growth is logged for analysis, not hard-asserted This lets us see WHERE the 744MB goes — is it table creation (change events), sequential fetches (cache fill), or the concurrent storm (request amplification)? Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

…nstead RequestEntityCache previously called JsonUtils.deepCopy() on both put() and get(), creating ~990KB of allocation per 247KB entity interaction (deepCopy on put + deepCopy on get). This was the largest contributor to the 12.7x memory amplification per entity in the createOrUpdate path. Fix: store JSON strings (immutable, safe to share) instead of entity objects. put() serializes once to JSON, get() deserializes back. No defensive copying needed since strings are immutable. Measured improvement (30 tables × 300 columns, 5 concurrent fetchers): Before (deepCopy): 702MB retained after settle, +407MB total growth After (JSON cache): 434MB retained after settle, +325MB total growth GC live data: 232MB (vs 200MB cache budget — only 32MB overhead) Improvement: 268MB less retained heap (38% reduction) The table creation phase went from +340MB to -88MB (GC could reclaim during creation since RequestEntityCache no longer holds deepCopy'd objects). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

The diagnostic test now reports exactly where memory goes for each entity creation and fetch, based on code path tracing: Per-table create (247KB entity, 300 columns): DB storage (serializeForStorage): ~247KB Search indexing (buildSearchIndexDoc): ~1394KB ├─ getMap(entity) full entity→Map: ~494KB ├─ pojoToJson(searchDoc) Map→JSON: ~247KB └─ indexTableColumns (300 cols × 3KB): ~900KB ChangeEvent (entity embedded + serialized): ~494KB Redis write-through (dao.findById): ~247KB RequestEntityCache (pojoToJson): ~247KB Other (relations, inheritance): ~150KB TOTAL PER TABLE: ~2.7MB (~11x amplification) Per-fetch (GET /api/v1/tables): Guava cache hit → readValue(JSON): ~495KB setFieldsInternal (10+ DB queries): ~50KB RequestEntityCache put (pojoToJson): ~247KB HTTP response serialization: ~247KB TOTAL PER FETCH: ~1MB 30 creates + 900 fetches = ~81MB creates + ~913MB transient fetch allocs. GC live data after settle: 247MB (only 47MB above 200MB cache budget). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

… failure 1. RBAC cache: Guava Cache forbids null values — Cache.get(key, Callable) throws InvalidCacheLoadException if Callable returns null. The RBAC evaluator returns null when no RBAC query is needed. Fixed by using getIfPresent() + manual put() instead of get(key, Callable), and skipping the filter when the query is null. 2. Bulk job semaphore: permit was acquired before supplyAsync() but if the executor rejects the task (AbortPolicy + full queue), the permit was never released because whenComplete was never registered. Wrapped task submission in try/catch to release on failure. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 4 comments.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

gitar-bot · 2026-04-17T10:49:37Z

Code Review 👍 Approved with suggestions 5 resolved / 6 findings

Memory hardening measures successfully address heap estimation accuracy, serialization overhead, and race conditions in bulk job processing. Add pagination logic to listUnprocessedEvents to ensure the new 1000-event limit does not result in data loss for downstream consumers.

💡 Edge Case: LIMIT 1000 on listUnprocessedEvents may silently drop events

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java:6798-6800

Adding LIMIT 1000 to listUnprocessedEvents prevents unbounded result sets, which is good, but callers must now handle pagination — otherwise events beyond the first 1000 are silently never processed. Verify that the caller loops until no more results are returned, or this becomes a silent data loss issue under high write volume.

✅ 5 resolved

✅ Bug: Replacing full entity with lightweight Map breaks downstream consumers

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:5823-5830 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:3326 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:3379 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:3816 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:5811 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:5854 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/UsageRepository.java:230
The createLightweightEntityRef() change replaces ChangeEvent.withEntity(entity) with a Map<String, Object> containing only id/name/fqn/version. However, many downstream consumers of ChangeEvent.getEntity() perform explicit casts to domain types (e.g., (DataContract) getEntity(event) in MessageDecorator), instanceof checks (e.g., event.getEntity() instanceof TestCase), and JsonUtils.convertValue(event.getEntity(), ...) deserialization that expects the full entity structure.

This will cause ClassCastException at runtime in:

MessageDecorator.java — casts to TestCase, DataContract

AlertsRuleEvaluator.java — JsonUtils.convertValue/readValue expecting full entity

ChangeEventHandler.java — processes entity as full POJO

The memory savings goal is valid, but the entity field is a public contract consumed widely. A safer approach would be to strip large nested fields (columns, sampleData, etc.) while preserving the entity type structure, or to store the lightweight ref in a separate field while keeping entity nullable.

✅ Bug: pojoToMaskedJson silently redacts fields from stored ChangeEvents

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:5856
At line 5856, JsonUtils.pojoToJson(changeEvent) was changed to JsonUtils.pojoToMaskedJson(changeEvent). The masked serializer uses MASKER_OBJECT_MAPPER with IgnoreMaskedFieldAnnotationIntrospector, which omits fields annotated for masking. This means stored change events will permanently lose masked fields, making it impossible to reconstruct the full change history. This appears to be an unintentional side effect — the original code used pojoToJson for a reason.

✅ Edge Case: BULK_JOBS size check has TOCTOU race condition

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:9833-9836
The check if (BULK_JOBS.size() >= 100) followed by BULK_JOBS.put(jobId, mergedJob) is a classic time-of-check-time-of-use race on a ConcurrentHashMap. Multiple threads can concurrently observe size < 100 and all proceed to insert, exceeding the intended limit. While this is a soft limit for memory protection and a slight overshoot is tolerable, it's worth fixing properly.

✅ Performance: String weigher underestimates heap by ~2x due to UTF-16

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/CacheWeighers.java:34 📄 openmetadata-service/src/main/java/org/openmetadata/service/util/CacheWeighers.java:61
Java String objects use UTF-16 internally (2 bytes per char), plus ~40 bytes of object overhead. CacheWeighers.stringWeigher() returns value.length() (char count), meaning the 200MB weight cap on CACHE_WITH_NAME/CACHE_WITH_ID actually allows ~400MB+ of heap. Similarly, jsonSerializationWeigher uses .length() of the serialized string. Consider using value.length() * 2 or documenting this is intentionally approximate.

✅ Performance: jsonSerializationWeigher serializes on every cache put — expensive

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/CacheWeighers.java:55-65 📄 openmetadata-service/src/main/java/org/openmetadata/service/resources/settings/SettingsCache.java:88 📄 openmetadata-service/src/main/java/org/openmetadata/service/security/policyevaluator/SubjectCache.java:69
The jsonSerializationWeigher calls JsonUtils.pojoToJson(value) on every cache insertion to estimate size. For SettingsCache and SubjectCache (which cache Settings and User objects), this means a full JSON serialization just to compute weight. The Javadoc says "Use sparingly" but it's used for two caches including the auth-critical SubjectCache. Consider caching the weight or using a cheaper heuristic.

🤖 Prompt for agents

Code Review: Memory hardening measures successfully address heap estimation accuracy, serialization overhead, and race conditions in bulk job processing. Add pagination logic to `listUnprocessedEvents` to ensure the new 1000-event limit does not result in data loss for downstream consumers.

1. 💡 Edge Case: LIMIT 1000 on listUnprocessedEvents may silently drop events
   Files: openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/CollectionDAO.java:6798-6800

   Adding `LIMIT 1000` to `listUnprocessedEvents` prevents unbounded result sets, which is good, but callers must now handle pagination — otherwise events beyond the first 1000 are silently never processed. Verify that the caller loops until no more results are returned, or this becomes a silent data loss issue under high write volume.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot

Pull request overview

Copilot reviewed 16 out of 16 changed files in this pull request and generated 2 comments.

Copilot · 2026-04-17T10:53:59Z

-                }
-              });
+  // RBAC cache for new Java API — size configurable via cacheMemory.rbacCacheMaxEntries
+  // Uses plain Cache (not LoadingCache) — values are loaded via get(key, Callable) at call sites.


The comment says RBAC cache values are “loaded via get(key, Callable) at call sites”, but the implementation below uses getIfPresent() + compute + put(). Please update the comment to match the actual loading/caching behavior to avoid misleading future maintainers.

Suggested change

// Uses plain Cache (not LoadingCache) — values are loaded via get(key, Callable) at call sites.

// Uses plain Cache (not LoadingCache) — call sites read with getIfPresent(), compute on miss,

// and store the computed value with put().

Copilot · 2026-04-17T10:54:00Z

 OM_EXTENSIONS="[]"
 # Heap OPTS Configurations
-OPENMETADATA_HEAP_OPTS="-Xmx1G -Xms1G"
+OPENMETADATA_HEAP_OPTS="-Xmx2G -Xms256M -XX:+UseG1GC -XX:MaxGCPauseMillis=200 -XX:+HeapDumpOnOutOfMemoryError"


env-postgres sets -Xms2G while env-mysql sets -Xms256M with -Xmx2G. If the intent is to bump the default heap to 2G (per PR description) and reduce GC churn under load, consider making -Xms consistent (or document why MySQL is intentionally different).

sonarqubecloud · 2026-04-17T11:51:41Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

mohityadav766 requested a review from akash-jain-10 as a code owner April 15, 2026 13:32

Copilot AI review requested due to automatic review settings April 15, 2026 13:32

mohityadav766 requested review from harshach and tutte as code owners April 15, 2026 13:32

github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Apr 15, 2026

Copilot started reviewing on behalf of mohityadav766 April 15, 2026 13:32 View session