perf(ebean-dao): share schema metadata cache per database to fix cold-start P99 spikes#615
Conversation
…-start P99 spikes Introduce SharedSchemaCache — one cache per database URL instead of one per entity type. EbeanLocalAccess now resolves a shared instance via its JDBC URL, so ~20 entity types on metagalaxy share 4 caches instead of maintaining ~50 independent copies. Pre-warm each entity table during ensureSchemaUpToDate() so the first request after deploy is never slow. Background refresh runs every 9 min with per-host random jitter (0-60s) to prevent the thundering-herd cache expiry seen in prod (7+ hosts querying information_schema simultaneously). Reduces information_schema queries from ~150 to ~2 per database per refresh cycle per host. Zero API changes — all upstream services benefit automatically. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
EbeanLocalDAOTest drops and recreates tables before each test via @BeforeMethod, removing any virtual columns that addIndex() had added dynamically. The SharedSchemaCache (static singleton per JDBC URL) retained the stale column set across that DROP+CREATE, so subsequent columnExists() calls returned false for freshly-re-added columns, causing getIndexedExpressionOrColumn() to skip the WHERE filter and return wrong row counts. Fix: call SharedSchemaCache.clearRegistry() after the schema reset in setupTest() so the next columnExists() query hits information_schema fresh. Also make clearRegistry() public so the test can reach it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…rviceIdentifier EbeanLocalAccessTest resets its EbeanLocalAccess validator to a fresh per-instance SchemaValidatorUtil before each test via @BeforeMethod resetValidatorInstance(), preventing the SharedSchemaCache from serving stale column data after setupTest() recreates tables. The sibling class EbeanLocalAccessTestWithoutServiceIdentifier was missing this method, causing testListUrnsWithLastUrn, testListUrnsWithOffset, and testCountAggregate to fail because columnExists() returned stale results and WHERE filters were silently skipped. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| try { | ||
| columnCache.put(tableName, loadColumns(tableName)); | ||
| indexCache.put(tableName, loadIndexes(tableName)); | ||
| log.debug("Schema cache refreshed for table '{}'", tableName); |
There was a problem hiding this comment.
We should refresh indexExpressionCache as well?
…reshTable Background refresh and registerAndPreWarm were keeping columnCache and indexCache warm but silently leaving indexExpressionCache stale. Add the missing loadIndexesAndExpressions() call so all three caches stay in sync. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| indexes != null ? indexes.getOrDefault(lowerIndex, null) : null); | ||
| } catch (Exception e) { | ||
| // MariaDB for local testing doesn't support EXPRESSION column | ||
| log.debug("Unable to load index expressions for table '{}': {}", lowerTable, e.getMessage()); |
There was a problem hiding this comment.
We don't have DEBUG level logs enabled I think. We should make it INFO level.
There was a problem hiding this comment.
And at all other places too
DEBUG logs are not enabled in most environments; the log message for MariaDB not supporting the EXPRESSION column was silently dropped. Change to INFO so the fallback is always visible. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Schema cache refresh success log was at DEBUG level which is not enabled. Change to INFO for consistency with the other logs in this class. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| public void ensureSchemaUpToDate() { | ||
| _schemaEvolutionManager.ensureSchemaUpToDate(); | ||
| // Pre-warm the shared cache for this entity's table so the first request is never slow. | ||
| validator.registerAndPreWarm(getTableName(_entityType)); |
There was a problem hiding this comment.
Nit: this only pre-warms the prod tables. batchGetUnion calls columnExists with getTestTableName(entityUrn) when isTestMode=true. Should we do it for test tables also?
|
|
||
| public boolean columnExists(@Nonnull String tableName, @Nonnull String columnName) { | ||
| Set<String> columns = columnCache.get(tableName.toLowerCase(), this::loadColumns); | ||
| return columns != null && columns.contains(columnName.toLowerCase()); |
There was a problem hiding this comment.
Nit: columns != null is dead code? Same for following methods?
|
|
||
| private void scheduleBackgroundRefresh() { | ||
| long jitterSeconds = ThreadLocalRandom.current().nextLong(0, JITTER_MAX_SECONDS + 1); | ||
| long initialDelay = REFRESH_INTERVAL_SECONDS + jitterSeconds; |
There was a problem hiding this comment.
Can you explain why we did we add jitterSeconds here?
There was a problem hiding this comment.
Good question! Without jitter, all ~7 metagalaxy hosts would start their background refresh at the same moment (since they all deploy together and REFRESH_INTERVAL_SECONDS is fixed). That would cause a thundering herd where all hosts simultaneously query information_schema every 9 minutes, amplifying DB load. Adding random jitter of 0–60s to the initial delay staggers the first refresh across hosts so subsequent cycles are naturally spread out.
There was a problem hiding this comment.
nit: flip + jitterSeconds to - jitterSeconds — keeps the same fleet spread but always leaves a 60s buffer before TTL?
There was a problem hiding this comment.
Good catch — done. With + the worst-case initial delay was 540+60 = 600s, which equals the TTL exactly and could let entries expire before the refresh fires. With - the initial delay is 480–540s, so the refresh always fires at least 2 minutes before the 10-minute TTL. Updated the constant comment to explain the new semantics.
|
Few comments:
|
- remove dead null checks (Caffeine get() never returns null) - add explanatory comments to each constant - pre-warm test table in ensureSchemaUpToDate() so warm-up coverage is complete for both prod and shadow tables Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…eryDAO + add tests - add new EbeanLocalRelationshipQueryDAO(server, serverConfig, config) constructor that uses SharedSchemaCache via the same buildValidator() pattern as EbeanLocalAccess - pre-warm test table in EbeanLocalAccess.ensureSchemaUpToDate() alongside prod table - add SharedSchemaCacheTest covering columnExists, getColumns, indexExists, registerAndPreWarm (verifies no DB call after warm), clearCaches, singleton identity per URL, and getIndexExpression (mocked for MariaDB compatibility) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
|
Thanks for the review @rakhiagr! Addressing all three:
|
| void clearCaches() { | ||
| if (sharedCache != null) { | ||
| sharedCache.clearCaches(); | ||
| return; | ||
| } | ||
| indexCache.invalidateAll(); | ||
| columnCache.invalidateAll(); | ||
| indexExpressionCache.invalidateAll(); | ||
| } |
There was a problem hiding this comment.
why is this change required, this function already seems redundant to me, Can we clean it up if not required
There was a problem hiding this comment.
Good catch — you're right that this change isn't strictly needed. SchemaValidatorUtil.clearCaches() is @VisibleForTesting and our tests actually call SharedSchemaCache.clearRegistry() / SharedSchemaCache.clearCaches() directly, so the delegation I added here is dead code. I'll remove it and leave the method body clearing only the per-instance Caffeine caches (which is the pre-existing behavior). If you'd prefer we remove the whole method since it's also not called anywhere, happy to do that too — let me know.
| private static final String SQL_GET_ALL_INDEXES = | ||
| "SELECT DISTINCT INDEX_NAME FROM information_schema.STATISTICS WHERE TABLE_SCHEMA = database() AND TABLE_NAME = '%s'"; |
There was a problem hiding this comment.
why do we need this query? we are getting index name already in SQL_GET_ALL_INDEXES_WITH_EXPRESSIONS query,
There was a problem hiding this comment.
The two queries need to stay separate for MariaDB compatibility. MariaDB (our embedded test DB) does not have the EXPRESSION column in information_schema.STATISTICS, so any query that selects it throws an exception. SQL_GET_ALL_INDEXES (no EXPRESSION column) is used by indexExists(), which must work reliably in all environments including local dev/test. SQL_GET_ALL_INDEXES_WITH_EXPRESSIONS is only used by getIndexExpression(), which wraps the call in a try/catch and gracefully returns null on failure (the MariaDB case). If we merged them into one query, indexExists() would throw in MariaDB and break all local tests.
…orUtil.clearCaches Tests call SharedSchemaCache.clearRegistry() / clearCaches() directly; the delegation added in this PR was never exercised. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
| if (dbUrl != null && !dbUrl.isEmpty()) { | ||
| return new SchemaValidatorUtil(SharedSchemaCache.getInstance(server, dbUrl)); | ||
| } | ||
| return new SchemaValidatorUtil(server); |
There was a problem hiding this comment.
when can dburl be null or empty? I would recommend throwing an exception here and cleanup this path completely, and
There was a problem hiding this comment.
Good point — in practice dbUrl should never be null since all production metagalaxy configs use URL-based datasource setup. The fallback was overly defensive. Done — buildValidator() now throws IllegalStateException if dbUrl is null or empty, and the silent fallback path is removed.
Also incorporated @rakagrawal's suggestion from Slack: added log.info("Inline schema cache miss: loading ... for table '{}'", tbl) inside all three cache.get() loaders in SharedSchemaCache, so operators can see unexpected cold loads in production (after this PR, these should only appear at startup before pre-warming completes, never mid-traffic).
…iss logs - buildValidator() now throws IllegalStateException if ServerConfig has no URL instead of silently falling back to a per-instance SchemaValidatorUtil (which would bypass SharedSchemaCache entirely) - add INFO log on inline cache miss in all three SharedSchemaCache.cache.get() loaders so operators can detect unexpected cold loads in production Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ee TTL buffer With + jitterSeconds the worst-case initial delay equalled the TTL (600s), risking a cache gap. Subtracting instead means the refresh fires 8–9 min after startup, always leaving at least 2 min before the 10-min TTL expires. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
rakhiagr
left a comment
There was a problem hiding this comment.
Thank you for addressing all the comments!
PR linkedin#615 introduced SharedSchemaCache with pre-warm calls inside EbeanLocalAccess.ensureSchemaUpToDate(). In deploys where schema migrations run as a separate job (LinkedIn's prod pattern, and likely others), ensureSchemaUpToDate() is gated off, so pre-warm never runs and the first request after JVM startup pays the inline information_schema query cost (the "Inline schema cache miss" log line). Moves the pre-warm calls to the EbeanLocalAccess constructor so they always run, regardless of whether schema evolution is enabled. refreshTable() already has its own try/catch, so a transient DB hiccup during construction logs a warning rather than failing boot. Also drops the unused tableColumns field (dead code, replaced by the shared cache in linkedin#615). Two supporting changes: 1. SharedSchemaCache.clearRegistry() now also calls clearCaches() on each held instance before wiping the REGISTRY map. Previously clearing only the static map left stale entries readable through any reference (e.g. EbeanLocalAccess.validator) that callers were already holding -- a footgun that the @VisibleForTesting name did not suggest. 2. EbeanLocalDAOTest.addIndex() helper now calls clearRegistry() after its dynamic ALTER TABLE ADD COLUMN. With pre-warm now running in the constructor, the cache snapshots the schema before addIndex modifies it; the explicit invalidation keeps subsequent columnExists()/indexExists() lookups in sync with the actual table state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
PR #615 introduced SharedSchemaCache with pre-warm calls inside EbeanLocalAccess.ensureSchemaUpToDate(). In deploys where schema migrations run as a separate job (LinkedIn's prod pattern, and likely others), ensureSchemaUpToDate() is gated off, so pre-warm never runs and the first request after JVM startup pays the inline information_schema query cost (the "Inline schema cache miss" log line). Moves the pre-warm calls to the EbeanLocalAccess constructor so they always run, regardless of whether schema evolution is enabled. refreshTable() already has its own try/catch, so a transient DB hiccup during construction logs a warning rather than failing boot. Also drops the unused tableColumns field (dead code, replaced by the shared cache in #615). Two supporting changes: 1. SharedSchemaCache.clearRegistry() now also calls clearCaches() on each held instance before wiping the REGISTRY map. Previously clearing only the static map left stale entries readable through any reference (e.g. EbeanLocalAccess.validator) that callers were already holding -- a footgun that the @VisibleForTesting name did not suggest. 2. EbeanLocalDAOTest.addIndex() helper now calls clearRegistry() after its dynamic ALTER TABLE ADD COLUMN. With pre-warm now running in the constructor, the cache snapshots the schema before addIndex modifies it; the explicit invalidation keeps subsequent columnExists()/indexExists() lookups in sync with the actual table state. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Problem: Each entity type on a shared database maintained its own
SchemaValidatorUtilCaffeine cache forinformation_schemacolumn/index metadata (10-min TTL). On cache expiry, the next request queriesinformation_schemainline. For low-QPS entities (MlModel, DataPolicy, etc.) this single slow query dominates P95 and causes SLO breaches. Compounding factors: ~20 entity types onmetagalaxyeach had independent caches (20 redundant queries for the same data), all deployed together so all caches expired simultaneously (confirmed: 7+ hosts refreshing in the same second), and no pre-warming after deploy/restart.Fix: Introduce
SharedSchemaCache— one cache instance per database URL (via static registry) instead of one per entity type. ~50 independent caches collapse to 4 (one per database). Each entity table is pre-warmed duringensureSchemaUpToDate()so the first post-deploy request is never slow. A background daemon thread refreshes all registered tables every 9 minutes with per-host random jitter (0–60s) to prevent thundering-herd expiry across the fleet.Impact: Reduces
information_schemaqueries from ~150 to ~2 per database per refresh cycle per host. Zero API changes —metadata-graph-assets,metadata-store, andmetadata-graph-queryall benefit automatically.Files changed
SharedSchemaCache.java— per-database shared cache with background refresh + jitterSchemaValidatorUtil.java— new constructor acceptingSharedSchemaCache; all public methods delegate to it when present; backward-compatibleEbeanServerconstructor unchanged for testsEbeanLocalAccess.java— resolvesSharedSchemaCacheby JDBC URL at construction; callsregisterAndPreWarminensureSchemaUpToDate()Testing Done
./gradlew :dao-impl:ebean-dao:compileJava)SchemaValidatorUtil(EbeanServer)constructor unchanged — all existing tests remain compatibleEbeanLocalAccessTest.resetValidatorInstance()still works (injects old-style validator via reflection, bypassing shared cache — intentional for test isolation)brew install mariadbrequired per CLAUDE.md) — pre-existing environment issue unrelated to this changeChecklist