internet-latency-collector: detect and replace unresponsive RIPE Atlas source probes by snormore · Pull Request #3198 · malbeclabs/doublezero

snormore · 2026-03-07T23:32:37Z

Summary of Changes

Add per-source-probe staleness detection so the collector automatically replaces RIPE Atlas probes that stop sending pings, even when RIPE reports them as "Connected"
Add 24-hour TTL to the unresponsive probe blacklist so probes get retried instead of being permanently banned — previously, locations with only one nearby probe (e.g. Pittsburgh) would permanently lose RIPE Atlas coverage
Fix source probe comparison in measurement reconciliation to check probe IDs, not just location codes — a replaced probe at the same location was invisible to the reconciliation logic

Diff Breakdown

Category	Files	Lines (+/-)	Net
Core logic	2	+165 / -31	+134
Tests	2	+216 / -0	+216

~60% tests, ~40% core logic.

Key files (click to expand)

controlplane/internet-latency-collector/internal/ripeatlas/collector_test.go — new test for unresponsive source probe detection end-to-end (stale probe → marked → measurement recreated with replacement)
controlplane/internet-latency-collector/internal/ripeatlas/state_test.go — tests for source probe response tracking, probe expiry after 24h, and backwards-compatible loading of legacy []int state format
controlplane/internet-latency-collector/internal/ripeatlas/state.go — LastResponseAt field on source probes, UnresponsiveProbeEntry struct with MarkedAt timestamp replacing bare []int, PruneExpiredUnresponsiveProbes(), backwards-compatible state loading
controlplane/internet-latency-collector/internal/ripeatlas/collector.go — source probe staleness detection in configureMeasurements, response time tracking during export, regenerate wanted measurements after marking new probes unresponsive, probe ID comparison in reconciliation

Testing Verification

New test TestInternetLatency_RIPEAtlas_ConfigureMeasurements_UnresponsiveSourceProbe validates the full flow: stale source probe detected → marked unresponsive → old measurement stopped → new measurement created with replacement probe from the same location
New test TestInternetLatency_RIPEAtlas_State_UpdateSourceProbeResponse covers response timestamp tracking including monotonicity and edge cases
New test TestInternetLatency_RIPEAtlas_State_UnresponsiveProbeExpiry verifies probes expire from blacklist after 24 hours
New test TestInternetLatency_RIPEAtlas_State_UnresponsiveProbeBackwardsCompat verifies legacy [7466, 1234] state files are migrated to the new timestamped format on load

…s source probes Add per-source-probe staleness detection so the collector can automatically replace probes that stop sending pings. Previously, only target-probe staleness was detected, meaning a source probe could go silent for hours without being replaced. - Track LastResponseAt per source probe during export - Detect source probes with no results after 1 hour and mark unresponsive - Regenerate wanted measurements after detecting new unresponsive probes - Compare source probe IDs (not just location codes) during reconciliation

Previously, probes marked unresponsive were blacklisted permanently with no recovery path. If a location only had one nearby probe, marking it unresponsive left that location with zero probes forever. - Add MarkedAt timestamp to unresponsive probe entries - Probes automatically expire from the blacklist after 24 hours - Prune expired entries at the start of each measurement management cycle - Handle backwards compatibility with legacy []int state format

…ry yet On first deploy, all existing source probes have LastResponseAt=0 since the field is new. Don't treat this as "never responded" — wait for at least one export cycle to populate the field before making staleness decisions.

…leness check

nikw9944 · 2026-03-10T14:02:37Z

I think there might be a case we're still not handling. Each probe is involved in producing results in two ways: 1) By pinging other probes and reporting results in a ripe atlas measurement and 2) By responding to pings from other probes, which report results in different measurements. In #3173 the probe was broken in a way that prevented it from responding to pings from the "to sjc" measurement (way 2), even though it was still able to producing data when pinging other probes (way 1).

Would it make sense to update this PR to consider unresponsiveness of only one of the two ways? (Although that gets a little iffy when a probe is only using the way in question to ping or respond to pings from 1 probe.)

…s unresponsive When a target probe stops responding, all source probes in that measurement show stale LastResponseAt (because latency=0 results don't trigger UpdateSourceProbeResponse). Without this check, Phase 2 would incorrectly blacklist healthy source probes. Skip source staleness checks for measurements whose target is already marked unresponsive by Phase 1.

snormore marked this pull request as ready for review March 9, 2026 13:01

snormore requested a review from nikw9944 March 9, 2026 13:01

snormore force-pushed the snor/internet-latency-collector-source-probe-staleness branch from 36908af to 59e21e4 Compare March 10, 2026 13:28

snormore added 5 commits March 10, 2026 09:53

changelog: add internet-latency-collector source probe staleness entries

b32631e

internet-latency-collector: fix gofmt indentation in source probe sta…

9c38e8f

…leness check

snormore force-pushed the snor/internet-latency-collector-source-probe-staleness branch from 59e21e4 to 9c38e8f Compare March 10, 2026 13:54

nikw9944 approved these changes Mar 10, 2026

View reviewed changes

snormore merged commit 503f3a6 into main Mar 10, 2026
36 of 37 checks passed

snormore deleted the snor/internet-latency-collector-source-probe-staleness branch March 10, 2026 16:57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internet-latency-collector: detect and replace unresponsive RIPE Atlas source probes#3198

internet-latency-collector: detect and replace unresponsive RIPE Atlas source probes#3198
snormore merged 6 commits intomainfrom
snor/internet-latency-collector-source-probe-staleness

snormore commented Mar 7, 2026

Uh oh!

nikw9944 commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

snormore commented Mar 7, 2026

Summary of Changes

Diff Breakdown

Testing Verification

Uh oh!

nikw9944 commented Mar 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants