Skip to content

perf(ebean-dao): share schema metadata cache per database to fix cold-start P99 spikes#615

Merged
ybz1013 merged 11 commits into
linkedin:masterfrom
ybz1013:yanbzhao/fix-cold-start-shared-schema-cache
Apr 30, 2026
Merged

perf(ebean-dao): share schema metadata cache per database to fix cold-start P99 spikes#615
ybz1013 merged 11 commits into
linkedin:masterfrom
ybz1013:yanbzhao/fix-cold-start-shared-schema-cache

Conversation

@ybz1013
Copy link
Copy Markdown
Contributor

@ybz1013 ybz1013 commented Apr 27, 2026

Summary

  • Problem: Each entity type on a shared database maintained its own SchemaValidatorUtil Caffeine cache for information_schema column/index metadata (10-min TTL). On cache expiry, the next request queries information_schema inline. For low-QPS entities (MlModel, DataPolicy, etc.) this single slow query dominates P95 and causes SLO breaches. Compounding factors: ~20 entity types on metagalaxy each had independent caches (20 redundant queries for the same data), all deployed together so all caches expired simultaneously (confirmed: 7+ hosts refreshing in the same second), and no pre-warming after deploy/restart.

  • Fix: Introduce SharedSchemaCache — one cache instance per database URL (via static registry) instead of one per entity type. ~50 independent caches collapse to 4 (one per database). Each entity table is pre-warmed during ensureSchemaUpToDate() so the first post-deploy request is never slow. A background daemon thread refreshes all registered tables every 9 minutes with per-host random jitter (0–60s) to prevent thundering-herd expiry across the fleet.

  • Impact: Reduces information_schema queries from ~150 to ~2 per database per refresh cycle per host. Zero API changes — metadata-graph-assets, metadata-store, and metadata-graph-query all benefit automatically.

Files changed

  • NEW SharedSchemaCache.java — per-database shared cache with background refresh + jitter
  • MODIFIED SchemaValidatorUtil.java — new constructor accepting SharedSchemaCache; all public methods delegate to it when present; backward-compatible EbeanServer constructor unchanged for tests
  • MODIFIED EbeanLocalAccess.java — resolves SharedSchemaCache by JDBC URL at construction; calls registerAndPreWarm in ensureSchemaUpToDate()

Testing Done

  • Compiles cleanly with JDK 11 (./gradlew :dao-impl:ebean-dao:compileJava)
  • Existing SchemaValidatorUtil(EbeanServer) constructor unchanged — all existing tests remain compatible
  • EbeanLocalAccessTest.resetValidatorInstance() still works (injects old-style validator via reflection, bypassing shared cache — intentional for test isolation)
  • Integration tests blocked by Apple Silicon MariaDB setup (brew install mariadb required per CLAUDE.md) — pre-existing environment issue unrelated to this change

Checklist

ybz1013 and others added 3 commits April 27, 2026 10:58
…-start P99 spikes

Introduce SharedSchemaCache — one cache per database URL instead of one
per entity type. EbeanLocalAccess now resolves a shared instance via its
JDBC URL, so ~20 entity types on metagalaxy share 4 caches instead of
maintaining ~50 independent copies.

Pre-warm each entity table during ensureSchemaUpToDate() so the first
request after deploy is never slow. Background refresh runs every 9 min
with per-host random jitter (0-60s) to prevent the thundering-herd cache
expiry seen in prod (7+ hosts querying information_schema simultaneously).

Reduces information_schema queries from ~150 to ~2 per database per
refresh cycle per host. Zero API changes — all upstream services benefit
automatically.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
EbeanLocalDAOTest drops and recreates tables before each test via
@BeforeMethod, removing any virtual columns that addIndex() had added
dynamically. The SharedSchemaCache (static singleton per JDBC URL)
retained the stale column set across that DROP+CREATE, so subsequent
columnExists() calls returned false for freshly-re-added columns,
causing getIndexedExpressionOrColumn() to skip the WHERE filter and
return wrong row counts.

Fix: call SharedSchemaCache.clearRegistry() after the schema reset in
setupTest() so the next columnExists() query hits information_schema
fresh. Also make clearRegistry() public so the test can reach it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rviceIdentifier

EbeanLocalAccessTest resets its EbeanLocalAccess validator to a fresh
per-instance SchemaValidatorUtil before each test via @BeforeMethod
resetValidatorInstance(), preventing the SharedSchemaCache from serving
stale column data after setupTest() recreates tables. The sibling class
EbeanLocalAccessTestWithoutServiceIdentifier was missing this method,
causing testListUrnsWithLastUrn, testListUrnsWithOffset, and
testCountAggregate to fail because columnExists() returned stale results
and WHERE filters were silently skipped.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
try {
columnCache.put(tableName, loadColumns(tableName));
indexCache.put(tableName, loadIndexes(tableName));
log.debug("Schema cache refreshed for table '{}'", tableName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should refresh indexExpressionCache as well?

…reshTable

Background refresh and registerAndPreWarm were keeping columnCache and
indexCache warm but silently leaving indexExpressionCache stale. Add the
missing loadIndexesAndExpressions() call so all three caches stay in sync.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
indexes != null ? indexes.getOrDefault(lowerIndex, null) : null);
} catch (Exception e) {
// MariaDB for local testing doesn't support EXPRESSION column
log.debug("Unable to load index expressions for table '{}': {}", lowerTable, e.getMessage());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have DEBUG level logs enabled I think. We should make it INFO level.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And at all other places too

ybz1013 and others added 2 commits April 29, 2026 10:26
DEBUG logs are not enabled in most environments; the log message for
MariaDB not supporting the EXPRESSION column was silently dropped.
Change to INFO so the fallback is always visible.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Schema cache refresh success log was at DEBUG level which is not
enabled. Change to INFO for consistency with the other logs in this class.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
public void ensureSchemaUpToDate() {
_schemaEvolutionManager.ensureSchemaUpToDate();
// Pre-warm the shared cache for this entity's table so the first request is never slow.
validator.registerAndPreWarm(getTableName(_entityType));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: this only pre-warms the prod tables. batchGetUnion calls columnExists with getTestTableName(entityUrn) when isTestMode=true. Should we do it for test tables also?


public boolean columnExists(@Nonnull String tableName, @Nonnull String columnName) {
Set<String> columns = columnCache.get(tableName.toLowerCase(), this::loadColumns);
return columns != null && columns.contains(columnName.toLowerCase());
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: columns != null is dead code? Same for following methods?


private void scheduleBackgroundRefresh() {
long jitterSeconds = ThreadLocalRandom.current().nextLong(0, JITTER_MAX_SECONDS + 1);
long initialDelay = REFRESH_INTERVAL_SECONDS + jitterSeconds;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why we did we add jitterSeconds here?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Without jitter, all ~7 metagalaxy hosts would start their background refresh at the same moment (since they all deploy together and REFRESH_INTERVAL_SECONDS is fixed). That would cause a thundering herd where all hosts simultaneously query information_schema every 9 minutes, amplifying DB load. Adding random jitter of 0–60s to the initial delay staggers the first refresh across hosts so subsequent cycles are naturally spread out.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: flip + jitterSeconds to - jitterSeconds — keeps the same fleet spread but always leaves a 60s buffer before TTL?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ybz1013 Can you address this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — done. With + the worst-case initial delay was 540+60 = 600s, which equals the TTL exactly and could let entries expire before the refresh fires. With - the initial delay is 480–540s, so the refresh always fires at least 2 minutes before the 10-minute TTL. Updated the constant comment to explain the new semantics.

@rakhiagr
Copy link
Copy Markdown
Contributor

Few comments:

  1. How are you planning to test whether these changes are helping with P99 spikes?
  2. EbeanLocalRelationshipQueryDAO still uses new SchemaValidatorUtil(server). Should we migrate them too?
  3. Lets add some unit tests for SharedSchemaCache?

ybz1013 and others added 2 commits April 29, 2026 10:55
- remove dead null checks (Caffeine get() never returns null)
- add explanatory comments to each constant
- pre-warm test table in ensureSchemaUpToDate() so warm-up coverage
  is complete for both prod and shadow tables

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eryDAO + add tests

- add new EbeanLocalRelationshipQueryDAO(server, serverConfig, config) constructor
  that uses SharedSchemaCache via the same buildValidator() pattern as EbeanLocalAccess
- pre-warm test table in EbeanLocalAccess.ensureSchemaUpToDate() alongside prod table
- add SharedSchemaCacheTest covering columnExists, getColumns, indexExists,
  registerAndPreWarm (verifies no DB call after warm), clearCaches, singleton
  identity per URL, and getIndexExpression (mocked for MariaDB compatibility)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@ybz1013
Copy link
Copy Markdown
Contributor Author

ybz1013 commented Apr 29, 2026

Thanks for the review @rakhiagr! Addressing all three:

  1. P99 testing plan: The P99 improvement isn't easily unit-testable, but the mechanism is observable in two ways: (a) the Schema cache refreshed for table '...' INFO log emitted by registerAndPreWarm at startup confirms pre-warming happened before the first request; (b) in production, we can compare information_schema query counts before/after via MySQL slow-query logs or metrics — the cold-start queries that previously fired on the first entity request per pod restart should no longer appear. The background refresh log (Background schema cache refresh: N tables) also confirms the thundering-herd jitter is working across the fleet.

  2. EbeanLocalRelationshipQueryDAO migration: Done — added a new (EbeanServer, ServerConfig, EBeanDAOConfig) constructor that delegates to SharedSchemaCache via the same buildValidator() pattern as EbeanLocalAccess. The existing two-arg constructors are kept for backward compatibility.

  3. Unit tests for SharedSchemaCache: Added SharedSchemaCacheTest covering columnExists, getColumns, indexExists, registerAndPreWarm (verifies no DB call after warm via Mockito), clearCaches, singleton identity per URL, and getIndexExpression (mocked since MariaDB doesn't support the EXPRESSION column).

@ybz1013
Copy link
Copy Markdown
Contributor Author

ybz1013 commented Apr 30, 2026

Comment on lines 104 to 112
void clearCaches() {
if (sharedCache != null) {
sharedCache.clearCaches();
return;
}
indexCache.invalidateAll();
columnCache.invalidateAll();
indexExpressionCache.invalidateAll();
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change required, this function already seems redundant to me, Can we clean it up if not required

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — you're right that this change isn't strictly needed. SchemaValidatorUtil.clearCaches() is @VisibleForTesting and our tests actually call SharedSchemaCache.clearRegistry() / SharedSchemaCache.clearCaches() directly, so the delegation I added here is dead code. I'll remove it and leave the method body clearing only the per-instance Caffeine caches (which is the pre-existing behavior). If you'd prefer we remove the whole method since it's also not called anywhere, happy to do that too — let me know.

Comment on lines +46 to +47
private static final String SQL_GET_ALL_INDEXES =
"SELECT DISTINCT INDEX_NAME FROM information_schema.STATISTICS WHERE TABLE_SCHEMA = database() AND TABLE_NAME = '%s'";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this query? we are getting index name already in SQL_GET_ALL_INDEXES_WITH_EXPRESSIONS query,

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The two queries need to stay separate for MariaDB compatibility. MariaDB (our embedded test DB) does not have the EXPRESSION column in information_schema.STATISTICS, so any query that selects it throws an exception. SQL_GET_ALL_INDEXES (no EXPRESSION column) is used by indexExists(), which must work reliably in all environments including local dev/test. SQL_GET_ALL_INDEXES_WITH_EXPRESSIONS is only used by getIndexExpression(), which wraps the call in a try/catch and gracefully returns null on failure (the MariaDB case). If we merged them into one query, indexExists() would throw in MariaDB and break all local tests.

…orUtil.clearCaches

Tests call SharedSchemaCache.clearRegistry() / clearCaches() directly;
the delegation added in this PR was never exercised.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
if (dbUrl != null && !dbUrl.isEmpty()) {
return new SchemaValidatorUtil(SharedSchemaCache.getInstance(server, dbUrl));
}
return new SchemaValidatorUtil(server);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when can dburl be null or empty? I would recommend throwing an exception here and cleanup this path completely, and

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point — in practice dbUrl should never be null since all production metagalaxy configs use URL-based datasource setup. The fallback was overly defensive. Done — buildValidator() now throws IllegalStateException if dbUrl is null or empty, and the silent fallback path is removed.

Also incorporated @rakagrawal's suggestion from Slack: added log.info("Inline schema cache miss: loading ... for table '{}'", tbl) inside all three cache.get() loaders in SharedSchemaCache, so operators can see unexpected cold loads in production (after this PR, these should only appear at startup before pre-warming completes, never mid-traffic).

ybz1013 and others added 2 commits April 30, 2026 14:20
…iss logs

- buildValidator() now throws IllegalStateException if ServerConfig has no
  URL instead of silently falling back to a per-instance SchemaValidatorUtil
  (which would bypass SharedSchemaCache entirely)
- add INFO log on inline cache miss in all three SharedSchemaCache.cache.get()
  loaders so operators can detect unexpected cold loads in production

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ee TTL buffer

With + jitterSeconds the worst-case initial delay equalled the TTL (600s),
risking a cache gap. Subtracting instead means the refresh fires 8–9 min
after startup, always leaving at least 2 min before the 10-min TTL expires.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor

@rakhiagr rakhiagr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for addressing all the comments!

@ybz1013 ybz1013 merged commit de343af into linkedin:master Apr 30, 2026
2 checks passed
mridul111998 added a commit to mridul111998/datahub-gma that referenced this pull request May 6, 2026
PR linkedin#615 introduced SharedSchemaCache with pre-warm calls inside
EbeanLocalAccess.ensureSchemaUpToDate(). In deploys where schema
migrations run as a separate job (LinkedIn's prod pattern, and likely
others), ensureSchemaUpToDate() is gated off, so pre-warm never runs
and the first request after JVM startup pays the inline
information_schema query cost (the "Inline schema cache miss" log
line).

Moves the pre-warm calls to the EbeanLocalAccess constructor so they
always run, regardless of whether schema evolution is enabled.
refreshTable() already has its own try/catch, so a transient DB
hiccup during construction logs a warning rather than failing boot.

Also drops the unused tableColumns field (dead code, replaced by the
shared cache in linkedin#615).

Two supporting changes:

1. SharedSchemaCache.clearRegistry() now also calls clearCaches() on
   each held instance before wiping the REGISTRY map. Previously
   clearing only the static map left stale entries readable through
   any reference (e.g. EbeanLocalAccess.validator) that callers were
   already holding -- a footgun that the @VisibleForTesting name did
   not suggest.

2. EbeanLocalDAOTest.addIndex() helper now calls clearRegistry()
   after its dynamic ALTER TABLE ADD COLUMN. With pre-warm now
   running in the constructor, the cache snapshots the schema before
   addIndex modifies it; the explicit invalidation keeps subsequent
   columnExists()/indexExists() lookups in sync with the actual
   table state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
ybz1013 pushed a commit that referenced this pull request May 6, 2026
PR #615 introduced SharedSchemaCache with pre-warm calls inside
EbeanLocalAccess.ensureSchemaUpToDate(). In deploys where schema
migrations run as a separate job (LinkedIn's prod pattern, and likely
others), ensureSchemaUpToDate() is gated off, so pre-warm never runs
and the first request after JVM startup pays the inline
information_schema query cost (the "Inline schema cache miss" log
line).

Moves the pre-warm calls to the EbeanLocalAccess constructor so they
always run, regardless of whether schema evolution is enabled.
refreshTable() already has its own try/catch, so a transient DB
hiccup during construction logs a warning rather than failing boot.

Also drops the unused tableColumns field (dead code, replaced by the
shared cache in #615).

Two supporting changes:

1. SharedSchemaCache.clearRegistry() now also calls clearCaches() on
   each held instance before wiping the REGISTRY map. Previously
   clearing only the static map left stale entries readable through
   any reference (e.g. EbeanLocalAccess.validator) that callers were
   already holding -- a footgun that the @VisibleForTesting name did
   not suggest.

2. EbeanLocalDAOTest.addIndex() helper now calls clearRegistry()
   after its dynamic ALTER TABLE ADD COLUMN. With pre-warm now
   running in the constructor, the cache snapshots the schema before
   addIndex modifies it; the explicit invalidation keeps subsequent
   columnExists()/indexExists() lookups in sync with the actual
   table state.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants