Skip to content

e2e: fix flaky DeviceTelemetry metrics timeout#2998

Merged
snormore merged 1 commit intomainfrom
snor/fix-telemetry-metrics-waitforready-timeout
Feb 14, 2026
Merged

e2e: fix flaky DeviceTelemetry metrics timeout#2998
snormore merged 1 commit intomainfrom
snor/fix-telemetry-metrics-waitforready-timeout

Conversation

@snormore
Copy link
Contributor

Summary

  • Add a 5s per-request context timeout to each HTTP fetch attempt inside MetricsClient.WaitForReady, fixing a flaky TestE2E_DeviceTelemetry failure where ny5MetricsClient.WaitForReady times out after 60s
  • The ny5 device runs its metrics listener inside ns-management (via netns.RunInNamespace). Before cEOS finishes moving eth0 into the management namespace, packets to the metrics port are silently dropped. Without a per-request timeout, http.DefaultClient blocks for ~20-30s per attempt on TCP SYN timeout, allowing only 2-3 retries within the 60s window. With the 5s cap, the poller gets ~12 attempts instead.

Testing Verification

  • Verified the flake scenario from CI run: ny5 metrics endpoint unreachable for the full 60s polling window due to management namespace setup delay under heavy CI load (10 parallel e2e tests)

WaitForReady passes the outer context (with no deadline) to Fetch,
which uses http.DefaultClient (no timeout). When the metrics port
silently drops packets (e.g. listener in ns-management before cEOS
finishes namespace setup), each HTTP attempt blocks for ~20-30s on
TCP connect timeout, starving the poller of retries within the 60s
window. Add a 5s per-request timeout so attempts fail fast and the
poller can retry properly.
@snormore snormore marked this pull request as ready for review February 14, 2026 17:08
@snormore snormore merged commit ef06763 into main Feb 14, 2026
29 of 30 checks passed
@snormore snormore deleted the snor/fix-telemetry-metrics-waitforready-timeout branch February 14, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants