Skip to content

client/doublezerod/latency: fix flaky TestSingleSocketPing — race + stray-reply misattribution#3594

Closed
ben-dz wants to merge 2 commits intomainfrom
bdz/fix-latency-race
Closed

client/doublezerod/latency: fix flaky TestSingleSocketPing — race + stray-reply misattribution#3594
ben-dz wants to merge 2 commits intomainfrom
bdz/fix-latency-race

Conversation

@ben-dz
Copy link
Copy Markdown
Contributor

@ben-dz ben-dz commented Apr 27, 2026

Summary of Changes

Two distinct bugs in SingleSocketPing were combining to make TestSingleSocketPing_Localhost flaky in CI:

  • Data race on rttEntry.sent. The sender goroutine wrote states[i].rtts[seq].sent = time.Now() while the reader goroutine read it inside receiveTime.Sub(...) to compute RTT. The kernel socket roundtrip provides logical happens-before but Go's race detector can't see across socket I/O, so this surfaced as an actual race report. Fixed with a per-state sync.Mutex covering the cross-goroutine access; the reader's update block is now one critical section with totalReceived.Add(1) outside the lock.
  • Stray echo replies were misattributed to pending slots. The reader filtered incoming ICMP replies by peer IP and seq only, not by echo.ID. Any ambient echo reply on the host (e.g. responses to other processes' pings to localhost) with a sequence in [0, pingCount) was matched to one of our slots — whose sent was zero, so receiveTime.Sub(zero) saturated to MaxDuration and poisoned min/avg/max in buildResults. This is what produced the got 61562 <= 20520 <= 9223372036854775807 assertion seen in CI. Fixed by matching on echo.ID (we set it to the target index when sending; raw ICMP sockets under NET_RAW preserve it), with a defensive peer-IP sanity check and IsZero guard on sent for belt-and-suspenders correctness. The ipToIndices map is no longer needed.

buildResults switches to index-based iteration so it doesn't copy probeState (which now contains a sync.Mutex) and trip govet. No locking is needed there — it runs only after readerDone closes, which establishes happens-before with the reader.

Originally surfaced as the flaky go-test failure in PR #3592 CI; isolated and fixed here so it can land independently.

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 1 +47 / -19 +28

Single-file fix, no test changes — existing tests now pass cleanly under `-race`.

Key files (click to expand)
  • `client/doublezerod/internal/latency/singlesocket.go` — adds `sync.Mutex` to `probeState`, locks around `.sent` accesses, switches the reader to match incoming replies by `echo.ID` (with peer-IP sanity check), drops the now-redundant `ipToIndices` map, and refactors `buildResults` to iterate by index.

Testing Verification

  • `go test -race -count=20 -run TestSingleSocketPing ./client/doublezerod/internal/latency/...` inside `dz-go-test` (which mirrors CI's `NET_RAW`/`NET_ADMIN` capabilities) — clean across 20 iterations, no race output, no assertion failures.
  • `go test -race ./client/doublezerod/internal/latency/...` — full latency package, single pass — `ok ... 56.161s`.
  • `golangci-lint run ./client/doublezerod/internal/latency/...` — `0 issues`.
  • `go vet ./client/doublezerod/internal/latency/...` — clean.

…ocketPing

The sender goroutine wrote `states[i].rtts[seq].sent = time.Now()` while
the reader goroutine read it inside `receiveTime.Sub(...)` to compute
RTT. The kernel roundtrip provides a logical happens-before but the Go
race detector can't see across socket I/O, so the access showed up as a
data race and `TestSingleSocketPing_Localhost` failed intermittently.

Add a per-state sync.Mutex covering the cross-goroutine field. The
sender locks briefly around the `.sent` write; the reader locks around
the read + the `.got/.rtt` updates so the slot stays internally
consistent. buildResults runs only after readerDone closes, so it
doesn't need to lock — but iterating by index avoids copying probeState
(which now holds the mutex) and silences the corresponding govet
warning.

Verified: `go test -race -count=20` over the affected tests passes
cleanly inside the dz-go-test container. Originally surfaced as the
flaky failure in PR #3592 CI.
@ben-dz ben-dz marked this pull request as ready for review April 27, 2026 14:17
@ben-dz ben-dz requested a review from bgm-malbeclabs April 27, 2026 14:17
The reader matched replies by peer IP and seq only, so any ICMP echo
reply on the host with a sequence in [0, pingCount) — including stray
replies to other processes' pings on the loopback interface — was
attributed to one of our pending slots. The associated `sent` slot was
zero-valued, so receiveTime.Sub(zero) saturated to MaxDuration and
poisoned min/avg/max in buildResults; this manifested as the assertion
"got 61562 <= 20520 <= 9223372036854775807" in
TestSingleSocketPing_Localhost on the previous CI run.

Switch to matching by echo.ID (which we set to the target index when
sending). Raw ICMP sockets preserve the ID byte-for-byte under NET_RAW
(the cap CI runs with), so the ID gives an exact, single-target match.
Removed the now-redundant ipToIndices map and added a defensive peer-IP
sanity check + IsZero guard on `sent`.

Verified: 20× -count race-test passes locally; full latency package
ok; golangci-lint reports 0 issues.
@ben-dz ben-dz changed the title client/doublezerod/latency: fix data race on rttEntry.sent in SingleSocketPing client/doublezerod/latency: fix flaky TestSingleSocketPing — race + stray-reply misattribution Apr 27, 2026
@ben-dz
Copy link
Copy Markdown
Contributor Author

ben-dz commented Apr 28, 2026

Fixed in #3577

@ben-dz ben-dz closed this Apr 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant