perf(ebean-dao): share schema metadata cache per database to fix cold-start P99 spikes by ybz1013 · Pull Request #615 · linkedin/datahub-gma

ybz1013 · 2026-04-27T18:03:00Z

Summary

Problem: Each entity type on a shared database maintained its own SchemaValidatorUtil Caffeine cache for information_schema column/index metadata (10-min TTL). On cache expiry, the next request queries information_schema inline. For low-QPS entities (MlModel, DataPolicy, etc.) this single slow query dominates P95 and causes SLO breaches. Compounding factors: ~20 entity types on metagalaxy each had independent caches (20 redundant queries for the same data), all deployed together so all caches expired simultaneously (confirmed: 7+ hosts refreshing in the same second), and no pre-warming after deploy/restart.
Fix: Introduce SharedSchemaCache — one cache instance per database URL (via static registry) instead of one per entity type. ~50 independent caches collapse to 4 (one per database). Each entity table is pre-warmed during ensureSchemaUpToDate() so the first post-deploy request is never slow. A background daemon thread refreshes all registered tables every 9 minutes with per-host random jitter (0–60s) to prevent thundering-herd expiry across the fleet.
Impact: Reduces information_schema queries from ~150 to ~2 per database per refresh cycle per host. Zero API changes — metadata-graph-assets, metadata-store, and metadata-graph-query all benefit automatically.

Files changed

NEW SharedSchemaCache.java — per-database shared cache with background refresh + jitter
MODIFIED SchemaValidatorUtil.java — new constructor accepting SharedSchemaCache; all public methods delegate to it when present; backward-compatible EbeanServer constructor unchanged for tests
MODIFIED EbeanLocalAccess.java — resolves SharedSchemaCache by JDBC URL at construction; calls registerAndPreWarm in ensureSchemaUpToDate()

Testing Done

Compiles cleanly with JDK 11 (./gradlew :dao-impl:ebean-dao:compileJava)
Existing SchemaValidatorUtil(EbeanServer) constructor unchanged — all existing tests remain compatible
EbeanLocalAccessTest.resetValidatorInstance() still works (injects old-style validator via reflection, bypassing shared cache — intentional for test isolation)
Integration tests blocked by Apple Silicon MariaDB setup (brew install mariadb required per CLAUDE.md) — pre-existing environment issue unrelated to this change

Checklist

The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
Links to related issues (if applicable)
Docs related to the changes have been added/updated (if applicable)

…-start P99 spikes Introduce SharedSchemaCache — one cache per database URL instead of one per entity type. EbeanLocalAccess now resolves a shared instance via its JDBC URL, so ~20 entity types on metagalaxy share 4 caches instead of maintaining ~50 independent copies. Pre-warm each entity table during ensureSchemaUpToDate() so the first request after deploy is never slow. Background refresh runs every 9 min with per-host random jitter (0-60s) to prevent the thundering-herd cache expiry seen in prod (7+ hosts querying information_schema simultaneously). Reduces information_schema queries from ~150 to ~2 per database per refresh cycle per host. Zero API changes — all upstream services benefit automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

EbeanLocalDAOTest drops and recreates tables before each test via @BeforeMethod, removing any virtual columns that addIndex() had added dynamically. The SharedSchemaCache (static singleton per JDBC URL) retained the stale column set across that DROP+CREATE, so subsequent columnExists() calls returned false for freshly-re-added columns, causing getIndexedExpressionOrColumn() to skip the WHERE filter and return wrong row counts. Fix: call SharedSchemaCache.clearRegistry() after the schema reset in setupTest() so the next columnExists() query hits information_schema fresh. Also make clearRegistry() public so the test can reach it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…rviceIdentifier EbeanLocalAccessTest resets its EbeanLocalAccess validator to a fresh per-instance SchemaValidatorUtil before each test via @BeforeMethod resetValidatorInstance(), preventing the SharedSchemaCache from serving stale column data after setupTest() recreates tables. The sibling class EbeanLocalAccessTestWithoutServiceIdentifier was missing this method, causing testListUrnsWithLastUrn, testListUrnsWithOffset, and testCountAggregate to fail because columnExists() returned stale results and WHERE filters were silently skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rakhiagr · 2026-04-29T17:11:35Z

+    try {
+      columnCache.put(tableName, loadColumns(tableName));
+      indexCache.put(tableName, loadIndexes(tableName));
+      log.debug("Schema cache refreshed for table '{}'", tableName);


We should refresh indexExpressionCache as well?

…reshTable Background refresh and registerAndPreWarm were keeping columnCache and indexCache warm but silently leaving indexExpressionCache stale. Add the missing loadIndexesAndExpressions() call so all three caches stay in sync. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rakhiagr · 2026-04-29T17:25:40Z

+          indexes != null ? indexes.getOrDefault(lowerIndex, null) : null);
+    } catch (Exception e) {
+      // MariaDB for local testing doesn't support EXPRESSION column
+      log.debug("Unable to load index expressions for table '{}': {}", lowerTable, e.getMessage());


We don't have DEBUG level logs enabled I think. We should make it INFO level.

And at all other places too

DEBUG logs are not enabled in most environments; the log message for MariaDB not supporting the EXPRESSION column was silently dropped. Change to INFO so the fallback is always visible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Schema cache refresh success log was at DEBUG level which is not enabled. Change to INFO for consistency with the other logs in this class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rakhiagr · 2026-04-29T17:38:55Z

  public void ensureSchemaUpToDate() {
    _schemaEvolutionManager.ensureSchemaUpToDate();
+    // Pre-warm the shared cache for this entity's table so the first request is never slow.
+    validator.registerAndPreWarm(getTableName(_entityType));


Nit: this only pre-warms the prod tables. batchGetUnion calls columnExists with getTestTableName(entityUrn) when isTestMode=true. Should we do it for test tables also?

rakhiagr · 2026-04-29T17:40:52Z

+
+  public boolean columnExists(@Nonnull String tableName, @Nonnull String columnName) {
+    Set<String> columns = columnCache.get(tableName.toLowerCase(), this::loadColumns);
+    return columns != null && columns.contains(columnName.toLowerCase());


Nit: columns != null is dead code? Same for following methods?

rakhiagr · 2026-04-29T17:44:29Z

+
+  private void scheduleBackgroundRefresh() {
+    long jitterSeconds = ThreadLocalRandom.current().nextLong(0, JITTER_MAX_SECONDS + 1);
+    long initialDelay = REFRESH_INTERVAL_SECONDS + jitterSeconds;


Can you explain why we did we add jitterSeconds here?

Good question! Without jitter, all ~7 metagalaxy hosts would start their background refresh at the same moment (since they all deploy together and REFRESH_INTERVAL_SECONDS is fixed). That would cause a thundering herd where all hosts simultaneously query information_schema every 9 minutes, amplifying DB load. Adding random jitter of 0–60s to the initial delay staggers the first refresh across hosts so subsequent cycles are naturally spread out.

nit: flip + jitterSeconds to - jitterSeconds — keeps the same fleet spread but always leaves a 60s buffer before TTL?

@ybz1013 Can you address this?

Good catch — done. With + the worst-case initial delay was 540+60 = 600s, which equals the TTL exactly and could let entries expire before the refresh fires. With - the initial delay is 480–540s, so the refresh always fires at least 2 minutes before the 10-minute TTL. Updated the constant comment to explain the new semantics.

rakhiagr · 2026-04-29T17:50:33Z

Few comments:

How are you planning to test whether these changes are helping with P99 spikes?
EbeanLocalRelationshipQueryDAO still uses new SchemaValidatorUtil(server). Should we migrate them too?
Lets add some unit tests for SharedSchemaCache?

- remove dead null checks (Caffeine get() never returns null) - add explanatory comments to each constant - pre-warm test table in ensureSchemaUpToDate() so warm-up coverage is complete for both prod and shadow tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…eryDAO + add tests - add new EbeanLocalRelationshipQueryDAO(server, serverConfig, config) constructor that uses SharedSchemaCache via the same buildValidator() pattern as EbeanLocalAccess - pre-warm test table in EbeanLocalAccess.ensureSchemaUpToDate() alongside prod table - add SharedSchemaCacheTest covering columnExists, getColumns, indexExists, registerAndPreWarm (verifies no DB call after warm), clearCaches, singleton identity per URL, and getIndexExpression (mocked for MariaDB compatibility) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ybz1013 · 2026-04-29T21:18:15Z

Thanks for the review @rakhiagr! Addressing all three:

P99 testing plan: The P99 improvement isn't easily unit-testable, but the mechanism is observable in two ways: (a) the Schema cache refreshed for table '...' INFO log emitted by registerAndPreWarm at startup confirms pre-warming happened before the first request; (b) in production, we can compare information_schema query counts before/after via MySQL slow-query logs or metrics — the cold-start queries that previously fired on the first entity request per pod restart should no longer appear. The background refresh log (Background schema cache refresh: N tables) also confirms the thundering-herd jitter is working across the fleet.
EbeanLocalRelationshipQueryDAO migration: Done — added a new (EbeanServer, ServerConfig, EBeanDAOConfig) constructor that delegates to SharedSchemaCache via the same buildValidator() pattern as EbeanLocalAccess. The existing two-arg constructors are kept for backward compatibility.
Unit tests for SharedSchemaCache: Added SharedSchemaCacheTest covering columnExists, getColumns, indexExists, registerAndPreWarm (verifies no DB call after warm via Mockito), clearCaches, singleton identity per URL, and getIndexExpression (mocked since MariaDB doesn't support the EXPRESSION column).

ybz1013 · 2026-04-30T16:49:35Z

test-plan-mgcy26-shared-schema-cache.md

mridul111998 · 2026-04-30T19:44:08Z

  void clearCaches() {
+    if (sharedCache != null) {
+      sharedCache.clearCaches();
+      return;
+    }
    indexCache.invalidateAll();
    columnCache.invalidateAll();
+    indexExpressionCache.invalidateAll();
  }


why is this change required, this function already seems redundant to me, Can we clean it up if not required

Good catch — you're right that this change isn't strictly needed. SchemaValidatorUtil.clearCaches() is @VisibleForTesting and our tests actually call SharedSchemaCache.clearRegistry() / SharedSchemaCache.clearCaches() directly, so the delegation I added here is dead code. I'll remove it and leave the method body clearing only the per-instance Caffeine caches (which is the pre-existing behavior). If you'd prefer we remove the whole method since it's also not called anywhere, happy to do that too — let me know.

mridul111998 · 2026-04-30T20:00:08Z

+  private static final String SQL_GET_ALL_INDEXES =
+      "SELECT DISTINCT INDEX_NAME FROM information_schema.STATISTICS WHERE TABLE_SCHEMA = database() AND TABLE_NAME = '%s'";


why do we need this query? we are getting index name already in SQL_GET_ALL_INDEXES_WITH_EXPRESSIONS query,

The two queries need to stay separate for MariaDB compatibility. MariaDB (our embedded test DB) does not have the EXPRESSION column in information_schema.STATISTICS, so any query that selects it throws an exception. SQL_GET_ALL_INDEXES (no EXPRESSION column) is used by indexExists(), which must work reliably in all environments including local dev/test. SQL_GET_ALL_INDEXES_WITH_EXPRESSIONS is only used by getIndexExpression(), which wraps the call in a try/catch and gracefully returns null on failure (the MariaDB case). If we merged them into one query, indexExists() would throw in MariaDB and break all local tests.

…orUtil.clearCaches Tests call SharedSchemaCache.clearRegistry() / clearCaches() directly; the delegation added in this PR was never exercised. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mridul111998 · 2026-04-30T20:30:14Z

+    if (dbUrl != null && !dbUrl.isEmpty()) {
+      return new SchemaValidatorUtil(SharedSchemaCache.getInstance(server, dbUrl));
+    }
+    return new SchemaValidatorUtil(server);


when can dburl be null or empty? I would recommend throwing an exception here and cleanup this path completely, and

Good point — in practice dbUrl should never be null since all production metagalaxy configs use URL-based datasource setup. The fallback was overly defensive. Done — buildValidator() now throws IllegalStateException if dbUrl is null or empty, and the silent fallback path is removed.

Also incorporated @rakagrawal's suggestion from Slack: added log.info("Inline schema cache miss: loading ... for table '{}'", tbl) inside all three cache.get() loaders in SharedSchemaCache, so operators can see unexpected cold loads in production (after this PR, these should only appear at startup before pre-warming completes, never mid-traffic).

…iss logs - buildValidator() now throws IllegalStateException if ServerConfig has no URL instead of silently falling back to a per-instance SchemaValidatorUtil (which would bypass SharedSchemaCache entirely) - add INFO log on inline cache miss in all three SharedSchemaCache.cache.get() loaders so operators can detect unexpected cold loads in production Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ee TTL buffer With + jitterSeconds the worst-case initial delay equalled the TTL (600s), risking a cache gap. Subtracting instead means the refresh fires 8–9 min after startup, always leaving at least 2 min before the 10-min TTL expires. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rakhiagr

Thank you for addressing all the comments!

@VisibleForTesting

PR linkedin#615 introduced SharedSchemaCache with pre-warm calls inside EbeanLocalAccess.ensureSchemaUpToDate(). In deploys where schema migrations run as a separate job (LinkedIn's prod pattern, and likely others), ensureSchemaUpToDate() is gated off, so pre-warm never runs and the first request after JVM startup pays the inline information_schema query cost (the "Inline schema cache miss" log line). Moves the pre-warm calls to the EbeanLocalAccess constructor so they always run, regardless of whether schema evolution is enabled. refreshTable() already has its own try/catch, so a transient DB hiccup during construction logs a warning rather than failing boot. Also drops the unused tableColumns field (dead code, replaced by the shared cache in linkedin#615). Two supporting changes: 1. SharedSchemaCache.clearRegistry() now also calls clearCaches() on each held instance before wiping the REGISTRY map. Previously clearing only the static map left stale entries readable through any reference (e.g. EbeanLocalAccess.validator) that callers were already holding -- a footgun that the @VisibleForTesting name did not suggest. 2. EbeanLocalDAOTest.addIndex() helper now calls clearRegistry() after its dynamic ALTER TABLE ADD COLUMN. With pre-warm now running in the constructor, the cache snapshots the schema before addIndex modifies it; the explicit invalidation keeps subsequent columnExists()/indexExists() lookups in sync with the actual table state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

@VisibleForTesting

PR #615 introduced SharedSchemaCache with pre-warm calls inside EbeanLocalAccess.ensureSchemaUpToDate(). In deploys where schema migrations run as a separate job (LinkedIn's prod pattern, and likely others), ensureSchemaUpToDate() is gated off, so pre-warm never runs and the first request after JVM startup pays the inline information_schema query cost (the "Inline schema cache miss" log line). Moves the pre-warm calls to the EbeanLocalAccess constructor so they always run, regardless of whether schema evolution is enabled. refreshTable() already has its own try/catch, so a transient DB hiccup during construction logs a warning rather than failing boot. Also drops the unused tableColumns field (dead code, replaced by the shared cache in #615). Two supporting changes: 1. SharedSchemaCache.clearRegistry() now also calls clearCaches() on each held instance before wiping the REGISTRY map. Previously clearing only the static map left stale entries readable through any reference (e.g. EbeanLocalAccess.validator) that callers were already holding -- a footgun that the @VisibleForTesting name did not suggest. 2. EbeanLocalDAOTest.addIndex() helper now calls clearRegistry() after its dynamic ALTER TABLE ADD COLUMN. With pre-warm now running in the constructor, the cache snapshots the schema before addIndex modifies it; the explicit invalidation keeps subsequent columnExists()/indexExists() lookups in sync with the actual table state. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ybz1013 and others added 3 commits April 27, 2026 10:58

rakhiagr reviewed Apr 29, 2026

View reviewed changes

ybz1013 and others added 2 commits April 29, 2026 10:26

fix(ebean-dao): upgrade remaining SharedSchemaCache debug log to INFO

a5566e4

Schema cache refresh success log was at DEBUG level which is not enabled. Change to INFO for consistency with the other logs in this class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

rakhiagr reviewed Apr 29, 2026

View reviewed changes

Comment thread dao-impl/ebean-dao/src/main/java/com/linkedin/metadata/dao/utils/SharedSchemaCache.java

rakhiagr reviewed Apr 29, 2026

View reviewed changes

ybz1013 and others added 2 commits April 29, 2026 10:55

mridul111998 reviewed Apr 30, 2026

View reviewed changes

fix(ebean-dao): remove dead sharedCache delegation from SchemaValidat…

b7a0a96

…orUtil.clearCaches Tests call SharedSchemaCache.clearRegistry() / clearCaches() directly; the delegation added in this PR was never exercised. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

mridul111998 reviewed Apr 30, 2026

View reviewed changes

ybz1013 and others added 2 commits April 30, 2026 14:20

rakhiagr approved these changes Apr 30, 2026

View reviewed changes

ybz1013 merged commit de343af into linkedin:master Apr 30, 2026
2 checks passed

mridul111998 mentioned this pull request May 6, 2026

perf(ebean-dao): move SharedSchemaCache pre-warm to constructor #618

Merged

3 tasks

		private static final String SQL_GET_ALL_INDEXES =
		"SELECT DISTINCT INDEX_NAME FROM information_schema.STATISTICS WHERE TABLE_SCHEMA = database() AND TABLE_NAME = '%s'";

Conversation

ybz1013 commented Apr 27, 2026

Summary

Files changed

Testing Done

Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rakhiagr commented Apr 29, 2026

Uh oh!

ybz1013 commented Apr 29, 2026

Uh oh!

ybz1013 commented Apr 30, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rakhiagr left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants