Skip to content

LTP rwtest04: mmap-write + msync shows partial data visibility (page tear) on immediate verify-read #1268

@clinx

Description

@clinx

Environment

  • Workload: LTP fs/rwtest04
  • Reproducer:
    • Command: /opt/ltp/runltp -Q -b /root/lin/mnt002 -f fs -s rwtest04 -d /root/lin/mnt002
    • File: /root/lin/mnt002/ltp-*/mm-sync-<pid>
  • Client: FUSE-based file I/O (verify uses O_RDONLY)
  • Observation time: 2025-11-12 17:12:45.xxx

Expected behavior

After memcpy to a MAP_SHARED mapping and msync(..., MS_SYNC) returns 0, a verification read of the same range should observe fully updated data.

Actual behavior

Immediate verify-read returns a mix of old and new data for the same range (data comparison error). Logs show:

  • Old 12KB content was persisted just before the mmap memcpy
  • The subsequent msync appears to have flushed only the first 4KB page
  • The verify-read hits cached data prepared earlier by a window-aligned backend read, so its syscall time is very short and “doesn’t align” with the backend request timing

User-space timeline (doio)

  • mmap-write memcpy (offset 4442856, len 9927):
doio_debug_log("mmap-%s memcpy done: fd=%d offset=%d copied=%d memaddr=0x%lx duration_ns=%lld\n",
               (sysc->sy_flags & SY_WRITE) ? "write" : "read",
               fd, req->r_data.io.r_offset, req->r_data.io.r_nbytes,
               (unsigned long)memaddr, ...);
[2025-11-12 17:12:45.977696142] mmap-write memcpy done: fd=3 offset=4442856 copied=9927 memaddr=0x7f6663c3cae8 duration_ns=9430449
  • msync start and completion:
doio_debug_log("msync: fd=%d offset=%d base=0x%lx len=%d flags=MS_SYNC\n",
               fd, req->r_data.io.r_offset, (unsigned long)fdc->c_memaddr,
               (int)sbuf.st_size);
msret = msync(fdc->c_memaddr, (int)sbuf.st_size, MS_SYNC);
doio_debug_log("msync: fd=%d offset=%d ret=%d errno=%d duration_ns=%lld\n", ...);
[2025-11-12 17:12:45.977715758] msync: fd=3 offset=4442856 base=0x7f6663800000 len=12800000 flags=MS_SYNC
[2025-11-12 17:12:45.982667206] msync: fd=3 offset=4442856 ret=0 errno=0 duration_ns=4937949
  • verify-read (independent O_RDONLY fd) right after msync:
doio_debug_log("verify: fd=%d file=%s offset=%d len=%d\n", fd, file, offset, length);
doio_debug_log("verify-lseek: fd=%d offset=%d ret=%d errno=%d duration_ns=%lld\n", ...);
doio_debug_log("verify-read: fd=%d offset=%d buf=0x%lx len=%d ret=%d errno=%d duration_ns=%lld\n", ...);
[2025-11-12 17:12:45.982712405] verify: fd=4 file=/root/lin/mnt002/ltp-np5plAMCXo/mm-sync-107502 offset=4442856 len=9927
[2025-11-12 17:12:45.982726786] verify-lseek: fd=4 offset=4442856 ret=4442856 errno=0 duration_ns=1226
[2025-11-12 17:12:45.982805728] verify-read: fd=4 offset=4442856 buf=0x1a85c70 len=9927 ret=9927 errno=0 duration_ns=65796
  • Data comparison error (mixed old/new):
*** DATA COMPARISON ERROR ***
check_file(..., 4442856, 9927, I:107540:ip-10-52-148-199:doio*, 31, 0) failed

Corrupt regions follow - unprintable chars are represented as '.'
-----------------------------------------------------------------
corrupt bytes starting at file offset 4442856
    1st 32 expected bytes:  I:107540:ip-10-52-148-199:doio*I
    1st 32 actual bytes:    ........................9:doio*I

Server/FUSE-side timeline (backend I/O)

  • Prior to memcpy, a 12KB write of old data persisted:
2025/11/12 17:12:45.968617 [DEBUG] file.go:425: TRACE Write enter: ino(8508829) ... offset(4440064) len(12288) flags(WriteCache) ...
2025/11/12 17:12:45.977674 [DEBUG] file.go:532: TRACE Write: ino(8508829) offset(4440064) len(12288) ... (8999822)ns
  • Read path uses window-aligned requests (128KB), may start earlier than user offset; hole handling logs:
2025/11/12 17:12:45.968765 [DEBUG] file.go:357: TRACE Read enter: ... offset(4395008) reqsize(131072) ... ffl=OpenReadOnly+OpenSync)
2025/11/12 17:12:45.982158 [DEBUG] stream_reader.go:189: ... req FileOffset(4395008) Size(45056) ExtentKey(<nil>)
2025/11/12 17:12:45.982179 [DEBUG] stream_reader.go:206: Stream read hole: ino(8508829) req(FileOffset(4395008) Size(45056) ExtentKey(<nil>)) total(45056)
  • The 12KB region [4440064, 4452352) is delivered before verify-read:
2025/11/12 17:12:45.982426 [DEBUG] extent_reader.go:114: ExtentReader Read exit: req(FileOffset(4440064) Size(12288) ...) ... readBytes(12288) err(<nil>)
  • Only a single 4KB overwrite (new data) is observed right after:
2025/11/12 17:12:45.982545 [DEBUG] stream_conn.go:228: sendToConn exit: ... Op(OpRandomWrite) ... FileOffset(4440064) Size(4096) ...
2025/11/12 17:12:45.982569 [DEBUG] stream_writer.go:436: Streamer write exit: ... offset(4440064) size(4096) done total(4096) err(<nil>)
  • A separate example of verify-read at higher offset still triggering an earlier aligned window:
[2025-11-12 17:12:45.968535747] verify-read: fd=4 offset=5072258 ... len=124935 ret=124935 ... duration_ns=1880012
2025/11/12 17:12:45.968765 [DEBUG] file.go:357: TRACE Read enter: ... offset(4395008) reqsize(131072) ...

Why this looks like a “page tear”

  • 4KB page size. User-space mmap write: S=4442856, L=9927, end inclusive E=S+L-1=4452782.
  • Spans 4 pages:
    • P0 [4440064, 4444160) → tail 1304B
    • P1 [4444160, 4448256) → full 4096B
    • P2 [4448256, 4452352) → full 4096B
    • P3 [4452352, 4456448) → head 431B
  • Before memcpy, old data for [4440064, 4452352) (P0–P2) was persisted.
  • The msync window that followed only resulted in a single 4KB overwrite (P0), while P1/P2/P3 were not updated in that window.
  • The immediate verify-read then observed “new P0 tail + old P1/P2 + old P3 head”, matching the compare output where a long run of '.' (old) is followed by the new tail “9:doio*I”.

Why verify-read time “doesn’t align” with backend time

  • check_file() uses a separate O_RDONLY fd and does a single lseek+read (not reading the writer’s mmap). Its short syscall time reflects copying data from cache.
  • The backend read had already populated the needed 12KB into cache before verify-read was issued, due to window-aligned prefetch/extent reads.

Impact

  • Immediate data verification after msync(MS_SYNC) can read partially updated content (mixed old/new) for the same range, causing false data corruption failures in mmap workloads.

What might be happening

  • Race between:
    • Background/streamed writeback of old content (already completed for P0–P2 before memcpy)
    • The subsequent msync only resulting in one 4KB flush (observed)
  • Page-granular flushing and request coalescing lead to only a subset of dirty pages being persisted in that specific window, making an immediate read see a torn range.

Requests for maintainers

  • Confirm semantics: After msync(..., MS_SYNC) on MAP_SHARED, should CubeFS guarantee the full dirty range is persisted atomically to the backend before returning?
  • Investigate:
    • Ensuring msync-triggered flush covers the entire dirty range intersected, not just a single page
    • Coordination between FUSE writeback and mmap msync to avoid persisting stale data that races with a newer mmap write
    • Read path’s window alignment is expected, but verify that it cannot cache stale data over freshly dirtied pages post-msync
  • If full atomicity is not guaranteed, document recommended barriers/flags for mmap writers to ensure strong persistence before verification.

Possible workarounds (for the test)

  • Delay verification slightly after msync to allow remaining pages to flush
  • Verify on page-aligned, page-sized boundaries or use fsync on the fd after msync
  • Avoid immediate verification of sub-page tails/heads that straddle pages

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions