Skip to content

Fixes #26729: Add DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency#26730

Open
hkhan925 wants to merge 4 commits intoopen-metadata:mainfrom
hkhan925:fix/type-registry-multi-pod-cache-sync
Open

Fixes #26729: Add DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency#26730
hkhan925 wants to merge 4 commits intoopen-metadata:mainfrom
hkhan925:fix/type-registry-multi-pod-cache-sync

Conversation

@hkhan925
Copy link
Copy Markdown

@hkhan925 hkhan925 commented Mar 24, 2026

Describe your changes:

Fixes #26729

In multi-replica Kubernetes deployments, TypeRegistry is a JVM-local singleton that loads custom property definitions once at startup and never refreshes. When a custom property is created on one pod, other pods remain stale — causing Unknown custom field validation errors on ~(N-1)/N of requests. This has been reported since version 1.6.0 (#25532, #21865) and remains unfixed.

Changes to TypeRegistry.java:

  • DB fallback on cache miss: getSchema(), getCustomPropertyType(), and getCustomPropertyConfig() now fall back to the database when a property is not found in the local cache. The full entity type is loaded via TypeRepository.getByName() and all three ConcurrentHashMap caches are warmed atomically through the existing addType() method.

  • TTL-based staleness detection: A per-entity-type timestamp tracks when data was last refreshed. After 15 minutes, the next access triggers a DB reload. This handles property deletions and config modifications propagating across pods.

  • Cache-miss debounce: To prevent unbounded DB queries when a non-existent property is queried repeatedly, cache-miss-triggered refreshes are debounced to at most once per 30 seconds per entity type. The TTL-based staleness check (15 min) operates independently.

  • Stale entry cleanup: On refresh, existing custom property entries for the entity type are cleared via removeIf() before re-adding fresh data from DB, ensuring deleted properties don't persist in the cache.

  • NPE fix: addType() now uses listOrEmpty() for the custom properties iteration, fixing a latent NPE for Field-category types with null customProperties. This matches the existing guard in validateCustomProperties().

This approach mirrors the established DB-fallback pattern used by other caches in the codebase (SettingsCache, SubjectCache, BotTokenCache, etc.) and introduces no new infrastructure or dependencies.

Type of change:

  • Bug fix

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.
  • I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…istry for multi-pod cache consistency

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@hkhan925 hkhan925 force-pushed the fix/type-registry-multi-pod-cache-sync branch from b69ecfc to 6bd506f Compare March 24, 2026 15:10
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java Outdated
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@hkhan925
Copy link
Copy Markdown
Author

Code Review ⚠️ Changes requested 3 resolved / 4 findings

Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.
⚠️ Bug: PR description says 2-min TTL but code uses 15-min interval

✅ 3 resolved

🤖 Prompt for agents

Code Review: Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.

1. ⚠️ Bug: PR description says 2-min TTL but code uses 15-min interval
   Files: openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:56

   The PR description states: "After 2 minutes (aligned with SubjectCache and BotTokenCache TTLs), the next access triggers a DB reload." However, the code at line 56 sets `REFRESH_INTERVAL_MS = 900_000` which is 15 minutes, not 2 minutes (120,000 ms). Either the description or the code is wrong. Given the original issue is about stale caches across pods, a 15-minute TTL means pods can serve stale data for up to 15 minutes after a custom property change, which may not meet user expectations.

Options

Was this helpful? React with 👍 / 👎 | Gitar

Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.

Updated PR description to reflect the 15 minute interval. Custom Property creation via API probably isn't so frequent that it warrants refreshing the cache every 2 minute in a multi-pod setup. With the DB fallback and the debounce logic, could probably be bumped up even higher than 15 minutes

@hkhan925 hkhan925 marked this pull request as ready for review March 24, 2026 15:59
@harshach harshach added the safe to test Add this label to run secure Github workflows on PRs label Mar 24, 2026
@harshach
Copy link
Copy Markdown
Collaborator

@hkhan925 thanks for the PR. will review cc @sonika-shah

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 14, 2026

Code Review ⚠️ Changes requested 3 resolved / 4 findings

Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.

⚠️ Bug: PR description says 2-min TTL but code uses 15-min interval

📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:56

The PR description states: "After 2 minutes (aligned with SubjectCache and BotTokenCache TTLs), the next access triggers a DB reload." However, the code at line 56 sets REFRESH_INTERVAL_MS = 900_000 which is 15 minutes, not 2 minutes (120,000 ms). Either the description or the code is wrong. Given the original issue is about stale caches across pods, a 15-minute TTL means pods can serve stale data for up to 15 minutes after a custom property change, which may not meet user expectations.

✅ 3 resolved
Quality: Test tearDown doesn't clear TYPE_LAST_REFRESHED — test pollution

📄 openmetadata-service/src/test/java/org/openmetadata/service/TypeRegistryTest.java:45-49
clearRegistryState() (lines 45-49) clears TYPES, CUSTOM_PROPERTIES, and CUSTOM_PROPERTY_SCHEMAS but does not clear TYPE_LAST_REFRESHED. Since this is a private static final map on a singleton, stale timestamps from one test leak into subsequent tests. For example, a test that adds a type will record a fresh timestamp; the next test may then skip a DB refresh because isTypeStale() returns false based on the leaked timestamp.

Access via reflection (similar to line 154-158 in the existing test) to clear this map in clearRegistryState().

Performance: Non-existent property lookups cause unbounded DB queries

📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:145-148 📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:211-214 📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:225-228
In getSchema(), getCustomPropertyType(), and getCustomPropertyConfig(), the condition if (value == null || isTypeStale(entityType)) means that every call with a property name that doesn't exist will trigger a DB refresh, even if the type was just refreshed milliseconds ago. The value == null branch always fires regardless of TTL freshness.

For example, getSchema("table", "nonExistentProp") called in a loop will issue a DB query on every single invocation. In production, this could be triggered by validation of user-supplied payloads referencing wrong property names, causing a query storm against the database.

The fix is to separate the cache-miss logic from the staleness logic: only fall back to DB if the type is stale OR was never loaded. After a successful refresh, a null result should be treated as authoritative until the TTL expires.

Bug: Debounce bypassed when DB refresh throws non-404 exception

📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:187-189
When refreshTypeFromDB() catches a generic Exception (line 188), it does not update TYPE_LAST_REFRESHED. This means the debounce timer is never reset on failure, so every subsequent request where isTypeStale() is true or shouldRefreshOnMiss() passes will immediately retry the DB query. If the database is experiencing transient errors (connection timeout, query failure), this creates exactly the query storm the debounce was designed to prevent.

The EntityNotFoundException branch correctly updates the timestamp (line 186), but the generic exception branch does not.

🤖 Prompt for agents
Code Review: Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.

1. ⚠️ Bug: PR description says 2-min TTL but code uses 15-min interval
   Files: openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:56

   The PR description states: "After 2 minutes (aligned with SubjectCache and BotTokenCache TTLs), the next access triggers a DB reload." However, the code at line 56 sets `REFRESH_INTERVAL_MS = 900_000` which is 15 minutes, not 2 minutes (120,000 ms). Either the description or the code is wrong. Given the original issue is about stale caches across pods, a 15-minute TTL means pods can serve stale data for up to 15 minutes after a custom property change, which may not meet user expectations.

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TypeRegistry in-memory cache not synchronized across pods in multi-replica deployments

3 participants