client/doublezerod/latency: fix flaky TestSingleSocketPing — race + stray-reply misattribution#3594
Closed
client/doublezerod/latency: fix flaky TestSingleSocketPing — race + stray-reply misattribution#3594
Conversation
…ocketPing The sender goroutine wrote `states[i].rtts[seq].sent = time.Now()` while the reader goroutine read it inside `receiveTime.Sub(...)` to compute RTT. The kernel roundtrip provides a logical happens-before but the Go race detector can't see across socket I/O, so the access showed up as a data race and `TestSingleSocketPing_Localhost` failed intermittently. Add a per-state sync.Mutex covering the cross-goroutine field. The sender locks briefly around the `.sent` write; the reader locks around the read + the `.got/.rtt` updates so the slot stays internally consistent. buildResults runs only after readerDone closes, so it doesn't need to lock — but iterating by index avoids copying probeState (which now holds the mutex) and silences the corresponding govet warning. Verified: `go test -race -count=20` over the affected tests passes cleanly inside the dz-go-test container. Originally surfaced as the flaky failure in PR #3592 CI.
The reader matched replies by peer IP and seq only, so any ICMP echo reply on the host with a sequence in [0, pingCount) — including stray replies to other processes' pings on the loopback interface — was attributed to one of our pending slots. The associated `sent` slot was zero-valued, so receiveTime.Sub(zero) saturated to MaxDuration and poisoned min/avg/max in buildResults; this manifested as the assertion "got 61562 <= 20520 <= 9223372036854775807" in TestSingleSocketPing_Localhost on the previous CI run. Switch to matching by echo.ID (which we set to the target index when sending). Raw ICMP sockets preserve the ID byte-for-byte under NET_RAW (the cap CI runs with), so the ID gives an exact, single-target match. Removed the now-redundant ipToIndices map and added a defensive peer-IP sanity check + IsZero guard on `sent`. Verified: 20× -count race-test passes locally; full latency package ok; golangci-lint reports 0 issues.
Contributor
Author
|
Fixed in #3577 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of Changes
Two distinct bugs in
SingleSocketPingwere combining to makeTestSingleSocketPing_Localhostflaky in CI:rttEntry.sent. The sender goroutine wrotestates[i].rtts[seq].sent = time.Now()while the reader goroutine read it insidereceiveTime.Sub(...)to compute RTT. The kernel socket roundtrip provides logical happens-before but Go's race detector can't see across socket I/O, so this surfaced as an actual race report. Fixed with a per-statesync.Mutexcovering the cross-goroutine access; the reader's update block is now one critical section withtotalReceived.Add(1)outside the lock.seqonly, not byecho.ID. Any ambient echo reply on the host (e.g. responses to other processes' pings to localhost) with a sequence in[0, pingCount)was matched to one of our slots — whosesentwas zero, soreceiveTime.Sub(zero)saturated toMaxDurationand poisonedmin/avg/maxinbuildResults. This is what produced thegot 61562 <= 20520 <= 9223372036854775807assertion seen in CI. Fixed by matching onecho.ID(we set it to the target index when sending; raw ICMP sockets underNET_RAWpreserve it), with a defensive peer-IP sanity check andIsZeroguard onsentfor belt-and-suspenders correctness. TheipToIndicesmap is no longer needed.buildResultsswitches to index-based iteration so it doesn't copyprobeState(which now contains async.Mutex) and trip govet. No locking is needed there — it runs only afterreaderDonecloses, which establishes happens-before with the reader.Originally surfaced as the flaky
go-testfailure in PR #3592 CI; isolated and fixed here so it can land independently.Diff Breakdown
Single-file fix, no test changes — existing tests now pass cleanly under `-race`.
Key files (click to expand)
Testing Verification