Skip to content

fix: move postEvict callbacks outside queueLock in FIFO cache#24117

Merged
mergify[bot] merged 4 commits intomatrixorigin:3.0-devfrom
aptend:fix-fifo-queuelock-eviction
Apr 15, 2026
Merged

fix: move postEvict callbacks outside queueLock in FIFO cache#24117
mergify[bot] merged 4 commits intomatrixorigin:3.0-devfrom
aptend:fix-fifo-queuelock-eviction

Conversation

@aptend
Copy link
Copy Markdown
Contributor

@aptend aptend commented Apr 10, 2026

What type of PR is this?

  • API-change
  • BUG
  • Improvement
  • Documentation
  • Feature
  • Test and CI
  • Code Refactoring

Which issue(s) this PR fixes:

issue #24106

What this PR does / why we need it:

Move postEvict callbacks (value.Release + metrics updates) outside the global queueLock in the FIFO cache to eliminate lock convoy under memory pressure.

Root Cause

When MemCache (12GB FIFO cache) is 100% full, every Set() triggers Evict() which holds the global queueLock while executing postEvict callbacks. Under GC pressure (STW pauses up to ~1s), this creates a lock convoy where all concurrent cache operations serialize through the single lock:

  • Each Set() takes 300-444ms (normal: <0.1ms, 3000-4400x slower)
  • Queries with ~90 cache ops inflate from 1-5s to 60-107s
  • Exceeds client-side timeouts → connection disconnections

Evidence from fileservice slow event trace:

Single S3FS.Read (total: 798ms) breakdown:
  disk cache read:            35ms
  set memory cache entry #1: 318ms  ← queueLock wait + eviction
  set memory cache entry #2: 444ms  ← queueLock wait + eviction
  762ms (95%) spent in "set memory cache entry"

Fix

  1. Collect evicted items under queueLock into a pending list
  2. Release queueLock
  3. Execute postEvict callbacks (value.Release + metrics) outside the lock

The item.valueOK flag is still set to false under the shard lock (not queueLock) to prevent data races with concurrent Get() calls.

Impact

In stability testing (commit 725b723):

  • TPCC: 40 "Communications link failure" events
  • Fulltext: 57 "Lost connection" events
  • Sysbench: 0 errors (point queries <5s skip tombstone transfer)

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR addresses lock convoy and severe latency spikes in the FIFO-based in-memory cache eviction path by moving postEvict work (e.g., value.Release() and metrics updates) out of the global queueLock, reducing contention under memory/GC pressure.

Changes:

  • Collect evicted entries under queueLock into a pending list during eviction.
  • Release queueLock before executing postEvict callbacks.
  • Refactor eviction helpers (evict1/evict2/enqueueGhost) to return pending post-evict data instead of invoking callbacks inline.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread pkg/fileservice/fifocache/fifo.go
Comment thread pkg/fileservice/fifocache/fifo.go Outdated
aptend and others added 4 commits April 15, 2026 13:59
When MemCache is full (100% utilization), every Set() triggers Evict()
which holds the global queueLock while executing postEvict callbacks
(value.Release + metrics updates). Under GC pressure with STW pauses
up to ~1s, this creates a lock convoy where all concurrent cache
operations serialize through the single queueLock, inflating per-Set()
latency from <0.1ms to 300-444ms.

This change collects evicted items under the queueLock, then executes
postEvict callbacks after releasing the lock. The item's valueOK flag
is still set to false under the shard lock (to avoid data races with
Get()), but the expensive Release() and metrics callbacks run
lock-free.

In stability testing, this bottleneck caused query execution times to
inflate from 1-5s to 60-107s, exceeding client-side timeouts and
producing connection disconnections in TPCC (40 events) and fulltext
(57 events) workloads.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Wrap queueLock.Unlock + postEvict callbacks + done signal in a defer
  to ensure panic safety (lock is always released even on unexpected panic).
- Move done <- target after postEvict callbacks complete, preserving the
  IOVectorCache.Evict contract that done signals eviction is finished.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Apr 15, 2026

Merge Queue Status

  • Entered queue2026-04-15 08:08 UTC · Rule: release-3.0
  • Checks skipped · PR is already up-to-date
  • Merged2026-04-15 08:09 UTC · at bad39a1d5d1146fd5e6175987b75133a4310a0fb

This pull request spent 11 seconds in the queue, including 2 seconds running CI.

Required conditions to merge
  • #approved-reviews-by >= 1 [🛡 GitHub branch protection]
  • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
  • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Utils CI (3.0) / Coverage
    • check-neutral = Matrixone Utils CI (3.0) / Coverage
    • check-skipped = Matrixone Utils CI (3.0) / Coverage
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI (3.0) / SCA Test on Ubuntu/x86
    • check-neutral = Matrixone CI (3.0) / SCA Test on Ubuntu/x86
    • check-skipped = Matrixone CI (3.0) / SCA Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone CI (3.0) / UT Test on Ubuntu/x86
    • check-neutral = Matrixone CI (3.0) / UT Test on Ubuntu/x86
    • check-skipped = Matrixone CI (3.0) / UT Test on Ubuntu/x86
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-neutral = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(Optimistic/PUSH)
    • check-skipped = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(Optimistic/PUSH)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-neutral = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(PESSIMISTIC)
    • check-skipped = Matrixone Compose CI (3.0) / multi cn e2e bvt test docker compose(PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI (3.0) / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-neutral = Matrixone Standlone CI (3.0) / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
    • check-skipped = Matrixone Standlone CI (3.0) / Multi-CN e2e BVT Test on Linux/x64(LAUNCH, PROXY)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-neutral = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
    • check-skipped = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH, PESSIMISTIC)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-neutral = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
    • check-skipped = Matrixone Standlone CI (3.0) / e2e BVT Test on Linux/x64(LAUNCH,Optimistic)
  • any of [🛡 GitHub branch protection]:
    • check-success = Matrixone Upgrade CI (3.0) / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-neutral = Matrixone Upgrade CI (3.0) / Compatibility Test With Target on Linux/x64(LAUNCH)
    • check-skipped = Matrixone Upgrade CI (3.0) / Compatibility Test With Target on Linux/x64(LAUNCH)

@mergify mergify Bot merged commit 9b4ab50 into matrixorigin:3.0-dev Apr 15, 2026
43 of 45 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

kind/bug Something isn't working size/M Denotes a PR that changes [100,499] lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants