-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Open
Description
、
Environment
- Workload: LTP
fs/rwtest04 - Reproducer:
- Command:
/opt/ltp/runltp -Q -b /root/lin/mnt002 -f fs -s rwtest04 -d /root/lin/mnt002 - File:
/root/lin/mnt002/ltp-*/mm-sync-<pid>
- Command:
- Client: FUSE-based file I/O (verify uses O_RDONLY)
- Observation time: 2025-11-12 17:12:45.xxx
Expected behavior
After memcpy to a MAP_SHARED mapping and msync(..., MS_SYNC) returns 0, a verification read of the same range should observe fully updated data.
Actual behavior
Immediate verify-read returns a mix of old and new data for the same range (data comparison error). Logs show:
- Old 12KB content was persisted just before the mmap
memcpy - The subsequent
msyncappears to have flushed only the first 4KB page - The verify-read hits cached data prepared earlier by a window-aligned backend read, so its syscall time is very short and “doesn’t align” with the backend request timing
User-space timeline (doio)
- mmap-write memcpy (offset 4442856, len 9927):
doio_debug_log("mmap-%s memcpy done: fd=%d offset=%d copied=%d memaddr=0x%lx duration_ns=%lld\n",
(sysc->sy_flags & SY_WRITE) ? "write" : "read",
fd, req->r_data.io.r_offset, req->r_data.io.r_nbytes,
(unsigned long)memaddr, ...);[2025-11-12 17:12:45.977696142] mmap-write memcpy done: fd=3 offset=4442856 copied=9927 memaddr=0x7f6663c3cae8 duration_ns=9430449
- msync start and completion:
doio_debug_log("msync: fd=%d offset=%d base=0x%lx len=%d flags=MS_SYNC\n",
fd, req->r_data.io.r_offset, (unsigned long)fdc->c_memaddr,
(int)sbuf.st_size);
msret = msync(fdc->c_memaddr, (int)sbuf.st_size, MS_SYNC);
doio_debug_log("msync: fd=%d offset=%d ret=%d errno=%d duration_ns=%lld\n", ...);[2025-11-12 17:12:45.977715758] msync: fd=3 offset=4442856 base=0x7f6663800000 len=12800000 flags=MS_SYNC
[2025-11-12 17:12:45.982667206] msync: fd=3 offset=4442856 ret=0 errno=0 duration_ns=4937949
- verify-read (independent O_RDONLY fd) right after msync:
doio_debug_log("verify: fd=%d file=%s offset=%d len=%d\n", fd, file, offset, length);
doio_debug_log("verify-lseek: fd=%d offset=%d ret=%d errno=%d duration_ns=%lld\n", ...);
doio_debug_log("verify-read: fd=%d offset=%d buf=0x%lx len=%d ret=%d errno=%d duration_ns=%lld\n", ...);[2025-11-12 17:12:45.982712405] verify: fd=4 file=/root/lin/mnt002/ltp-np5plAMCXo/mm-sync-107502 offset=4442856 len=9927
[2025-11-12 17:12:45.982726786] verify-lseek: fd=4 offset=4442856 ret=4442856 errno=0 duration_ns=1226
[2025-11-12 17:12:45.982805728] verify-read: fd=4 offset=4442856 buf=0x1a85c70 len=9927 ret=9927 errno=0 duration_ns=65796
- Data comparison error (mixed old/new):
*** DATA COMPARISON ERROR ***
check_file(..., 4442856, 9927, I:107540:ip-10-52-148-199:doio*, 31, 0) failed
Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 4442856
1st 32 expected bytes: I:107540:ip-10-52-148-199:doio*I
1st 32 actual bytes: ........................9:doio*I
Server/FUSE-side timeline (backend I/O)
- Prior to
memcpy, a 12KB write of old data persisted:
2025/11/12 17:12:45.968617 [DEBUG] file.go:425: TRACE Write enter: ino(8508829) ... offset(4440064) len(12288) flags(WriteCache) ...
2025/11/12 17:12:45.977674 [DEBUG] file.go:532: TRACE Write: ino(8508829) offset(4440064) len(12288) ... (8999822)ns
- Read path uses window-aligned requests (128KB), may start earlier than user offset; hole handling logs:
2025/11/12 17:12:45.968765 [DEBUG] file.go:357: TRACE Read enter: ... offset(4395008) reqsize(131072) ... ffl=OpenReadOnly+OpenSync)
2025/11/12 17:12:45.982158 [DEBUG] stream_reader.go:189: ... req FileOffset(4395008) Size(45056) ExtentKey(<nil>)
2025/11/12 17:12:45.982179 [DEBUG] stream_reader.go:206: Stream read hole: ino(8508829) req(FileOffset(4395008) Size(45056) ExtentKey(<nil>)) total(45056)
- The 12KB region [4440064, 4452352) is delivered before verify-read:
2025/11/12 17:12:45.982426 [DEBUG] extent_reader.go:114: ExtentReader Read exit: req(FileOffset(4440064) Size(12288) ...) ... readBytes(12288) err(<nil>)
- Only a single 4KB overwrite (new data) is observed right after:
2025/11/12 17:12:45.982545 [DEBUG] stream_conn.go:228: sendToConn exit: ... Op(OpRandomWrite) ... FileOffset(4440064) Size(4096) ...
2025/11/12 17:12:45.982569 [DEBUG] stream_writer.go:436: Streamer write exit: ... offset(4440064) size(4096) done total(4096) err(<nil>)
- A separate example of verify-read at higher offset still triggering an earlier aligned window:
[2025-11-12 17:12:45.968535747] verify-read: fd=4 offset=5072258 ... len=124935 ret=124935 ... duration_ns=1880012
2025/11/12 17:12:45.968765 [DEBUG] file.go:357: TRACE Read enter: ... offset(4395008) reqsize(131072) ...
Why this looks like a “page tear”
- 4KB page size. User-space mmap write: S=4442856, L=9927, end inclusive E=S+L-1=4452782.
- Spans 4 pages:
- P0 [4440064, 4444160) → tail 1304B
- P1 [4444160, 4448256) → full 4096B
- P2 [4448256, 4452352) → full 4096B
- P3 [4452352, 4456448) → head 431B
- Before
memcpy, old data for [4440064, 4452352) (P0–P2) was persisted. - The
msyncwindow that followed only resulted in a single 4KB overwrite (P0), while P1/P2/P3 were not updated in that window. - The immediate verify-read then observed “new P0 tail + old P1/P2 + old P3 head”, matching the compare output where a long run of '.' (old) is followed by the new tail “9:doio*I”.
Why verify-read time “doesn’t align” with backend time
check_file()uses a separate O_RDONLY fd and does a single lseek+read (not reading the writer’s mmap). Its short syscall time reflects copying data from cache.- The backend read had already populated the needed 12KB into cache before verify-read was issued, due to window-aligned prefetch/extent reads.
Impact
- Immediate data verification after
msync(MS_SYNC)can read partially updated content (mixed old/new) for the same range, causing false data corruption failures in mmap workloads.
What might be happening
- Race between:
- Background/streamed writeback of old content (already completed for P0–P2 before
memcpy) - The subsequent
msynconly resulting in one 4KB flush (observed)
- Background/streamed writeback of old content (already completed for P0–P2 before
- Page-granular flushing and request coalescing lead to only a subset of dirty pages being persisted in that specific window, making an immediate read see a torn range.
Requests for maintainers
- Confirm semantics: After
msync(..., MS_SYNC)on MAP_SHARED, should CubeFS guarantee the full dirty range is persisted atomically to the backend before returning? - Investigate:
- Ensuring
msync-triggered flush covers the entire dirty range intersected, not just a single page - Coordination between FUSE writeback and mmap
msyncto avoid persisting stale data that races with a newer mmap write - Read path’s window alignment is expected, but verify that it cannot cache stale data over freshly dirtied pages post-msync
- Ensuring
- If full atomicity is not guaranteed, document recommended barriers/flags for mmap writers to ensure strong persistence before verification.
Possible workarounds (for the test)
- Delay verification slightly after
msyncto allow remaining pages to flush - Verify on page-aligned, page-sized boundaries or use
fsyncon the fd aftermsync - Avoid immediate verification of sub-page tails/heads that straddle pages
Metadata
Metadata
Assignees
Labels
No labels