Skip to content

telemetry/geoprobe: retry transient bind errors in publisher AddProbe#3818

Merged
nikw9944 merged 1 commit into
mainfrom
nikw9944/doublezero-3765
Jun 2, 2026
Merged

telemetry/geoprobe: retry transient bind errors in publisher AddProbe#3818
nikw9944 merged 1 commit into
mainfrom
nikw9944/doublezero-3765

Conversation

@nikw9944
Copy link
Copy Markdown
Contributor

@nikw9944 nikw9944 commented Jun 1, 2026

Summary of Changes

  • Wrap the per-probe UDP socket allocation in Publisher.AddProbe with a retry-on-bind-error helper, mirroring the existing mitigation in Pinger. The kernel can return EINVAL from bind(2) when many goroutines concurrently allocate ephemeral sockets, which intermittently fails TestPublisher_RemoveProbe/TestPublisher_AddProbe in CI with bind: invalid argument.
  • Lift isBindError and the retry loop out of pinger.go into a small package-shared retry.go, exposing a generic retryOnBindError[T] helper. The pinger path now calls the same helper, removing duplicated logic.
  • The retry runs inside the netns.RunInNamespace closure so all attempts and backoffs stay on the same OS-thread-locked netns when ManagementNamespace is set.
  • Fixes geolocation test flakes #3765

Testing Verification

  • New retry_test.go covers the four important branches of the helper: success after a transient bind failure, fail-fast on a non-bind error, give-up after senderRetries attempts, and prompt return on a cancelled context.
  • go test -race -count=5 -run 'TestRetryOnBindError|TestIsBindError|TestPublisher_AddProbe|TestPublisher_RemoveProbe' ./controlplane/telemetry/internal/geoprobe/... passes locally.
  • Full go test ./controlplane/telemetry/internal/geoprobe/... passes.
  • golangci-lint run ./controlplane/telemetry/internal/geoprobe/... reports 0 issues.

@nikw9944 nikw9944 mentioned this pull request Jun 1, 2026
@nikw9944 nikw9944 requested a review from ben-dz June 1, 2026 19:59
@nikw9944 nikw9944 marked this pull request as ready for review June 1, 2026 19:59
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3765 branch 2 times, most recently from fb0ef8e to 9203fb1 Compare June 1, 2026 20:07
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3765 branch from 9203fb1 to eb2419e Compare June 1, 2026 20:36
@nikw9944 nikw9944 enabled auto-merge (squash) June 1, 2026 20:37
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3765 branch from eb2419e to 632798e Compare June 1, 2026 23:27
Lift the existing retry-on-bind-error helper from pinger.go into a
package-shared retry.go and use it to wrap the per-probe UDP socket
allocation in Publisher.AddProbe. The kernel can return EINVAL from
bind(2) when many goroutines concurrently allocate ephemeral sockets,
which intermittently fails geoprobe publisher tests in CI.

Refs #3765
@nikw9944 nikw9944 force-pushed the nikw9944/doublezero-3765 branch from 632798e to cba3ae8 Compare June 2, 2026 01:36
@nikw9944 nikw9944 merged commit 22b1a82 into main Jun 2, 2026
38 of 39 checks passed
@nikw9944 nikw9944 deleted the nikw9944/doublezero-3765 branch June 2, 2026 02:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

geolocation test flakes

2 participants