Skip to content

fix(session-log): fix Windows test flake from dropped log lines#1246

Merged
Aaronontheweb merged 1 commit into
netclaw-dev:devfrom
Aaronontheweb:fix/session-log-windows-flake
May 31, 2026
Merged

fix(session-log): fix Windows test flake from dropped log lines#1246
Aaronontheweb merged 1 commit into
netclaw-dev:devfrom
Aaronontheweb:fix/session-log-windows-flake

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Collaborator

@Aaronontheweb Aaronontheweb commented May 31, 2026

What's wrong

One test — SessionLogActorTests.Successive_dispatchers_append_to_same_canonical_file — randomly fails on the Windows CI runner. It never fails on Linux or macOS.

Why

The session log is written one line at a time: open the file, write a line, close it. On Windows, antivirus (Defender) scans a file right after it's closed, and while scanning it briefly locks the file. If the next log line tries to write during that scan, the write is blocked.

The writer retries when that happens, but two problems combined:

  1. The retry budget was too short — the comment said it waited ~150ms, but a bug meant it actually only waited ~70ms.
  2. When the retries ran out, the log line was quietly dropped (with a warning).

So under a slow/busy Windows runner the line got dropped, and the test then waited ~17 seconds for a line that was never coming and failed with a confusing "substring not found".

(Linux and macOS don't lock files this way, which is why it only happens on Windows.)

The fix

  • Retry long enough. Fixed the off-by-one so every backoff step actually runs, extended the total wait to ~585ms, and added a little randomness so retries don't keep landing on the same scan. That comfortably outlasts a Defender scan even on a busy CI machine.
  • Fail clearly if a line is still dropped. The test now watches for the "dropped line" warning and fails immediately with the real reason, instead of timing out 17 seconds later with a vague message.

Not in this PR

The cleaner long-term fix is to keep the log file open for the life of the session instead of opening and closing it for every line — that avoids the antivirus collision entirely. Left out here to keep this change small and low-risk; happy to do it separately.

Testing

  • Ran the test repeatedly — stable.
  • slopwatch and copyright-header checks clean; no baseline changes.

SessionLogActorTests.Successive_dispatchers_append_to_same_canonical_file
flaked on Windows CI: the per-line open→write→close writer triggers Defender
scan-on-close, whose transient write-excluding handle blocks the next append.
When AppendLine exhausted its retry budget the actor dropped the line (logged a
warning), so the polling assertion spun to the AwaitAssert deadline and failed
with an opaque "substring not found".

- SessionLogFile.AppendLine: the retry budget claimed 10/20/40/80ms (150ms) but
  the catch filter `attempt < MaxAttempts - 1` only ran 3 sleeps (70ms) and never
  used the 4th entry. Fix the off-by-one (one retry per backoff entry), extend the
  schedule to ~585ms, and add jitter so retries don't phase-lock with a periodic
  scanner. Bounded blocking on an external OS resource (the AV handle) is the
  legitimate use of Thread.Sleep here.
- SessionLogActorTests: wrap the test in EventFilter.Warning(contains: "Dropped")
  .ExpectAsync(0, ...) so a dropped audit line fails fast with the real cause
  instead of a ~17s opaque timeout.

A deeper follow-up (a persistent per-actor append handle, opened in PreStart and
closed in PostStop) would eliminate the per-line scan-on-close churn entirely;
this change keeps the writer's shape and just makes the retry cover the spike.
@Aaronontheweb Aaronontheweb changed the title fix(session-log): harden append against Windows AV scan-on-close flake fix(session-log): fix Windows test flake from dropped log lines May 31, 2026
@Aaronontheweb Aaronontheweb added tests All issues related to testing, quality assurance, and smoke testing. bug Something isn't working labels May 31, 2026
@Aaronontheweb Aaronontheweb enabled auto-merge (squash) May 31, 2026 17:17
@Aaronontheweb Aaronontheweb merged commit 3b91662 into netclaw-dev:dev May 31, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working tests All issues related to testing, quality assurance, and smoke testing.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant