Skip to content

internet-latency-collector: detect and replace unresponsive RIPE Atlas source probes#3198

Merged
snormore merged 6 commits intomainfrom
snor/internet-latency-collector-source-probe-staleness
Mar 10, 2026
Merged

internet-latency-collector: detect and replace unresponsive RIPE Atlas source probes#3198
snormore merged 6 commits intomainfrom
snor/internet-latency-collector-source-probe-staleness

Conversation

@snormore
Copy link
Contributor

@snormore snormore commented Mar 7, 2026

Summary of Changes

  • Add per-source-probe staleness detection so the collector automatically replaces RIPE Atlas probes that stop sending pings, even when RIPE reports them as "Connected"
  • Add 24-hour TTL to the unresponsive probe blacklist so probes get retried instead of being permanently banned — previously, locations with only one nearby probe (e.g. Pittsburgh) would permanently lose RIPE Atlas coverage
  • Fix source probe comparison in measurement reconciliation to check probe IDs, not just location codes — a replaced probe at the same location was invisible to the reconciliation logic

Diff Breakdown

Category Files Lines (+/-) Net
Core logic 2 +165 / -31 +134
Tests 2 +216 / -0 +216

~60% tests, ~40% core logic.

Key files (click to expand)

Testing Verification

  • New test TestInternetLatency_RIPEAtlas_ConfigureMeasurements_UnresponsiveSourceProbe validates the full flow: stale source probe detected → marked unresponsive → old measurement stopped → new measurement created with replacement probe from the same location
  • New test TestInternetLatency_RIPEAtlas_State_UpdateSourceProbeResponse covers response timestamp tracking including monotonicity and edge cases
  • New test TestInternetLatency_RIPEAtlas_State_UnresponsiveProbeExpiry verifies probes expire from blacklist after 24 hours
  • New test TestInternetLatency_RIPEAtlas_State_UnresponsiveProbeBackwardsCompat verifies legacy [7466, 1234] state files are migrated to the new timestamped format on load

@snormore snormore marked this pull request as ready for review March 9, 2026 13:01
@snormore snormore requested a review from nikw9944 March 9, 2026 13:01
@snormore snormore force-pushed the snor/internet-latency-collector-source-probe-staleness branch from 36908af to 59e21e4 Compare March 10, 2026 13:28
…s source probes

Add per-source-probe staleness detection so the collector can automatically
replace probes that stop sending pings. Previously, only target-probe
staleness was detected, meaning a source probe could go silent for hours
without being replaced.

- Track LastResponseAt per source probe during export
- Detect source probes with no results after 1 hour and mark unresponsive
- Regenerate wanted measurements after detecting new unresponsive probes
- Compare source probe IDs (not just location codes) during reconciliation
Previously, probes marked unresponsive were blacklisted permanently with
no recovery path. If a location only had one nearby probe, marking it
unresponsive left that location with zero probes forever.

- Add MarkedAt timestamp to unresponsive probe entries
- Probes automatically expire from the blacklist after 24 hours
- Prune expired entries at the start of each measurement management cycle
- Handle backwards compatibility with legacy []int state format
…ry yet

On first deploy, all existing source probes have LastResponseAt=0 since
the field is new. Don't treat this as "never responded" — wait for at
least one export cycle to populate the field before making staleness
decisions.
@snormore snormore force-pushed the snor/internet-latency-collector-source-probe-staleness branch from 59e21e4 to 9c38e8f Compare March 10, 2026 13:54
@nikw9944
Copy link
Contributor

I think there might be a case we're still not handling. Each probe is involved in producing results in two ways: 1) By pinging other probes and reporting results in a ripe atlas measurement and 2) By responding to pings from other probes, which report results in different measurements. In #3173 the probe was broken in a way that prevented it from responding to pings from the "to sjc" measurement (way 2), even though it was still able to producing data when pinging other probes (way 1).

Would it make sense to update this PR to consider unresponsiveness of only one of the two ways? (Although that gets a little iffy when a probe is only using the way in question to ping or respond to pings from 1 probe.)

…s unresponsive

When a target probe stops responding, all source probes in that
measurement show stale LastResponseAt (because latency=0 results
don't trigger UpdateSourceProbeResponse). Without this check,
Phase 2 would incorrectly blacklist healthy source probes. Skip
source staleness checks for measurements whose target is already
marked unresponsive by Phase 1.
@snormore snormore merged commit 503f3a6 into main Mar 10, 2026
36 of 37 checks passed
@snormore snormore deleted the snor/internet-latency-collector-source-probe-staleness branch March 10, 2026 16:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants