Skip to content

Improve watermarks in POSIX-like objects tracker #225

@zxqfd555

Description

@zxqfd555

Is your feature request related to a problem? Please describe.

The POSIX-like object tracker currently detects changes by comparing a watermark tuple of (mtime_seconds, size, owner). This is insufficient in at least one well-known scenario: if a file is written twice within the same wall-clock second and the resulting size and owner happen to be identical, the second write is invisible to the tracker and the change is silently lost. This is not an edge case — it regularly occurs in log rotation, test pipelines, and any system that rewrites files at high frequency.

Describe the solution you'd like

Improve the watermark tuple on each supported platform to make silent miss-detection as unlikely as possible. The improvements fall into three independent areas:

1. Sub-second timestamps (POSIX / local filesystem)

Replace the integer-second mtime with a nanosecond-precision timestamp where the OS and filesystem support it. On Linux, os.stat() already exposes st_mtime_ns (an integer number of nanoseconds since the epoch) without any additional syscall. On macOS, st_mtime_ns is available similarly. This alone eliminates the vast majority of same-second collision cases. The new watermark tuple for local-filesystem objects becomes:

(mtime_ns, size, owner)

Where mtime_ns is stat.st_mtime_ns (integer, nanoseconds). For filesystems or OS configurations that cannot provide sub-second precision, mtime_ns falls back gracefully to mtime * 1_000_000_000.

2. ETag / checksum (S3 and S3-compatible stores)

S3's HeadObject response already returns an ETag field for every object at no extra cost (the call is made anyway for size). For objects uploaded as a single part, the ETag is the MD5 digest of the content; for multipart uploads it is a composite digest. Either way, it uniquely identifies the content version. The new watermark tuple for S3 objects becomes:

(last_modified, size, etag)

etag replaces owner (S3 objects have no meaningful owner field). If etag is absent or empty in the response (some S3-compatible stores omit it), the tracker falls back to the existing (last_modified, size) tuple. Additionally, where S3 Object Versioning is enabled on the bucket, the VersionId from HeadObject is an even stronger identity signal and should be preferred over the ETag when present.

3. Additional OS-level factors (POSIX)

On Linux, os.stat() exposes two additional fields that can act as tiebreakers when mtime_ns and size are equal:

  • st_ino — inode number. Changes when a file is replaced atomically (e.g. rename() or mv), which is the most common pattern for log rotation and atomic writes. Including the inode number means that a file silently swapped under the same path is always detected.
  • st_ctime_ns — inode change time in nanoseconds. Changes on any metadata update, including chmod, chown, and rename-based replacement.

The new watermark tuple for local-filesystem objects on Linux / macOS therefore becomes:

(mtime_ns, size, owner, inode, ctime_ns)

Fields should be added in order of reliability. On platforms where a field is unavailable (e.g. Windows, some network filesystems that lie about inodes), the tuple silently drops that field to maintain portability.

Backward compatibility

The cached object storage layer (persistence) must remain backward-compatible. Existing persisted watermarks in the old (mtime_seconds, size, owner) format must continue to be recognized and compared correctly against the new format. Specifically:

  • On first startup after the upgrade, if a persisted watermark is in the old format, the tracker must treat it as a cache miss (i.e. re-read the object) rather than crash or silently skip the file. This is the safe default: a false positive re-read is always preferable to a silently missed change.
  • A migration path or version tag in the persisted watermark format would be a clean solution, but is not strictly required as long as the fallback behavior above is guaranteed.

Describe alternatives you've considered

  • Content hashing every file on every poll — would guarantee detection of all changes, but is prohibitively expensive at scale (requires reading all file contents, not just metadata).
  • Reducing the poll interval to sub-second — does not fix the fundamental ambiguity; two writes within the same tick are still indistinguishable.

Additional context

The issue affects pw.io.fs.read, pw.io.s3.read, pw.io.minio.read, and any other connector that uses the same POSIX-like object tracker internally. The S3 ETag improvement also benefits pw.io.s3.read in streaming mode, where the tracker is already responsible for detecting object modifications.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions