Fixes #26729: Add DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency by hkhan925 · Pull Request #26730 · open-metadata/OpenMetadata

hkhan925 · 2026-03-24T15:03:13Z

Describe your changes:

In multi-replica Kubernetes deployments, TypeRegistry is a JVM-local singleton that loads custom property definitions once at startup and never refreshes. When a custom property is created on one pod, other pods remain stale — causing Unknown custom field validation errors on ~(N-1)/N of requests. This has been reported since version 1.6.0 (#25532, #21865) and remains unfixed.

Changes to TypeRegistry.java:

DB fallback on cache miss: getSchema(), getCustomPropertyType(), and getCustomPropertyConfig() now fall back to the database when a property is not found in the local cache. The full entity type is loaded via TypeRepository.getByName() and all three ConcurrentHashMap caches are warmed atomically through the existing addType() method.
TTL-based staleness detection: A per-entity-type timestamp tracks when data was last refreshed. After 15 minutes, the next access triggers a DB reload. This handles property deletions and config modifications propagating across pods.
Cache-miss debounce: To prevent unbounded DB queries when a non-existent property is queried repeatedly, cache-miss-triggered refreshes are debounced to at most once per 30 seconds per entity type. The TTL-based staleness check (15 min) operates independently.
Stale entry cleanup: On refresh, existing custom property entries for the entity type are cleared via removeIf() before re-adding fresh data from DB, ensuring deleted properties don't persist in the cache.
NPE fix: addType() now uses listOrEmpty() for the custom properties iteration, fixing a latent NPE for Field-category types with null customProperties. This matches the existing guard in validateCustomProperties().

This approach mirrors the established DB-fallback pattern used by other caches in the codebase (SettingsCache, SubjectCache, BotTokenCache, etc.) and introduces no new infrastructure or dependencies.

Type of change:

Bug fix

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

I have added a test that covers the exact scenario we are fixing. For complex issues, comment the issue number in the test for future reference.

github-actions · 2026-03-24T15:03:43Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

…istry for multi-pod cache consistency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-24T15:10:54Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-24T15:25:05Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

github-actions · 2026-03-24T15:30:40Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

hkhan925 · 2026-03-24T15:56:19Z

Code Review ⚠️ Changes requested 3 resolved / 4 findings

Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.
⚠️ Bug: PR description says 2-min TTL but code uses 15-min interval

✅ 3 resolved

🤖 Prompt for agents
Code Review: Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.

1. ⚠️ Bug: PR description says 2-min TTL but code uses 15-min interval
   Files: openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:56

   The PR description states: "After 2 minutes (aligned with SubjectCache and BotTokenCache TTLs), the next access triggers a DB reload." However, the code at line 56 sets `REFRESH_INTERVAL_MS = 900_000` which is 15 minutes, not 2 minutes (120,000 ms). Either the description or the code is wrong. Given the original issue is about stale caches across pods, a 15-minute TTL means pods can serve stale data for up to 15 minutes after a custom property change, which may not meet user expectations.
Options

Was this helpful? React with 👍 / 👎 | Gitar

Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.

Updated PR description to reflect the 15 minute interval. Custom Property creation via API probably isn't so frequent that it warrants refreshing the cache every 2 minute in a multi-pod setup. With the DB fallback and the debounce logic, could probably be bumped up even higher than 15 minutes

harshach · 2026-03-24T16:00:28Z

@hkhan925 thanks for the PR. will review cc @sonika-shah

gitar-bot · 2026-04-14T16:58:02Z

Code Review ⚠️ Changes requested 3 resolved / 4 findings

Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.

⚠️

Bug: PR description says 2-min TTL but code uses 15-min interval

📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:56

The PR description states: "After 2 minutes (aligned with SubjectCache and BotTokenCache TTLs), the next access triggers a DB reload." However, the code at line 56 sets REFRESH_INTERVAL_MS = 900_000 which is 15 minutes, not 2 minutes (120,000 ms). Either the description or the code is wrong. Given the original issue is about stale caches across pods, a 15-minute TTL means pods can serve stale data for up to 15 minutes after a custom property change, which may not meet user expectations.

✅ 3 resolved

✅ Quality: Test tearDown doesn't clear TYPE_LAST_REFRESHED — test pollution

📄 openmetadata-service/src/test/java/org/openmetadata/service/TypeRegistryTest.java:45-49
clearRegistryState() (lines 45-49) clears TYPES, CUSTOM_PROPERTIES, and CUSTOM_PROPERTY_SCHEMAS but does not clear TYPE_LAST_REFRESHED. Since this is a private static final map on a singleton, stale timestamps from one test leak into subsequent tests. For example, a test that adds a type will record a fresh timestamp; the next test may then skip a DB refresh because isTypeStale() returns false based on the leaked timestamp.

Access via reflection (similar to line 154-158 in the existing test) to clear this map in clearRegistryState().

✅ Performance: Non-existent property lookups cause unbounded DB queries

📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:145-148 📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:211-214 📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:225-228
In getSchema(), getCustomPropertyType(), and getCustomPropertyConfig(), the condition if (value == null || isTypeStale(entityType)) means that every call with a property name that doesn't exist will trigger a DB refresh, even if the type was just refreshed milliseconds ago. The value == null branch always fires regardless of TTL freshness.

For example, getSchema("table", "nonExistentProp") called in a loop will issue a DB query on every single invocation. In production, this could be triggered by validation of user-supplied payloads referencing wrong property names, causing a query storm against the database.

The fix is to separate the cache-miss logic from the staleness logic: only fall back to DB if the type is stale OR was never loaded. After a successful refresh, a null result should be treated as authoritative until the TTL expires.

✅ Bug: Debounce bypassed when DB refresh throws non-404 exception

📄 openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:187-189
When refreshTypeFromDB() catches a generic Exception (line 188), it does not update TYPE_LAST_REFRESHED. This means the debounce timer is never reset on failure, so every subsequent request where isTypeStale() is true or shouldRefreshOnMiss() passes will immediately retry the DB query. If the database is experiencing transient errors (connection timeout, query failure), this creates exactly the query storm the debounce was designed to prevent.

The EntityNotFoundException branch correctly updates the timestamp (line 186), but the generic exception branch does not.

🤖 Prompt for agents

Code Review: Adds DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency with debounce logic to prevent query storms. However, the PR description documents a 2-minute TTL while the code implements a 15-minute interval—clarify which is correct.

1. ⚠️ Bug: PR description says 2-min TTL but code uses 15-min interval
   Files: openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java:56

   The PR description states: "After 2 minutes (aligned with SubjectCache and BotTokenCache TTLs), the next access triggers a DB reload." However, the code at line 56 sets `REFRESH_INTERVAL_MS = 900_000` which is 15 minutes, not 2 minutes (120,000 ms). Either the description or the code is wrong. Given the original issue is about stale caches across pods, a 15-minute TTL means pods can serve stale data for up to 15 minutes after a custom property change, which may not meet user expectations.

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

gitar-bot bot reviewed Mar 24, 2026

View reviewed changes

Comment thread openmetadata-service/src/test/java/org/openmetadata/service/TypeRegistryTest.java

Fixes open-metadata#26729: Add DB fallback and TTL refresh to TypeReg…

6bd506f

…istry for multi-pod cache consistency Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hkhan925 force-pushed the fix/type-registry-multi-pod-cache-sync branch from b69ecfc to 6bd506f Compare March 24, 2026 15:10

gitar-bot bot reviewed Mar 24, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java Outdated

gitar-bot bot reviewed Mar 24, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java

Add debounce for cache-miss DB lookups to prevent query storms

4eacfdb

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gitar-bot bot reviewed Mar 24, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/TypeRegistry.java

Update timestamp on DB failure to prevent retry storms during outages

1a2db18

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

hkhan925 marked this pull request as ready for review March 24, 2026 15:59

harshach added the safe to test Add this label to run secure Github workflows on PRs label Mar 24, 2026

Merge branch 'main' into fix/type-registry-multi-pod-cache-sync

9a519cb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes #26729: Add DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency#26730

Fixes #26729: Add DB fallback and TTL refresh to TypeRegistry for multi-pod cache consistency#26730
hkhan925 wants to merge 4 commits intoopen-metadata:mainfrom
hkhan925:fix/type-registry-multi-pod-cache-sync

hkhan925 commented Mar 24, 2026 •

edited

Loading

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

hkhan925 commented Mar 24, 2026

Uh oh!

harshach commented Mar 24, 2026

Uh oh!

gitar-bot bot commented Apr 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

hkhan925 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

Uh oh!

github-actions bot commented Mar 24, 2026

Uh oh!

hkhan925 commented Mar 24, 2026

Uh oh!

harshach commented Mar 24, 2026

Uh oh!

gitar-bot bot commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

hkhan925 commented Mar 24, 2026 •

edited

Loading

gitar-bot bot commented Apr 14, 2026 •

edited

Loading