Conversation
36908af to
59e21e4
Compare
…s source probes Add per-source-probe staleness detection so the collector can automatically replace probes that stop sending pings. Previously, only target-probe staleness was detected, meaning a source probe could go silent for hours without being replaced. - Track LastResponseAt per source probe during export - Detect source probes with no results after 1 hour and mark unresponsive - Regenerate wanted measurements after detecting new unresponsive probes - Compare source probe IDs (not just location codes) during reconciliation
Previously, probes marked unresponsive were blacklisted permanently with no recovery path. If a location only had one nearby probe, marking it unresponsive left that location with zero probes forever. - Add MarkedAt timestamp to unresponsive probe entries - Probes automatically expire from the blacklist after 24 hours - Prune expired entries at the start of each measurement management cycle - Handle backwards compatibility with legacy []int state format
…ry yet On first deploy, all existing source probes have LastResponseAt=0 since the field is new. Don't treat this as "never responded" — wait for at least one export cycle to populate the field before making staleness decisions.
59e21e4 to
9c38e8f
Compare
Contributor
|
I think there might be a case we're still not handling. Each probe is involved in producing results in two ways: 1) By pinging other probes and reporting results in a ripe atlas measurement and 2) By responding to pings from other probes, which report results in different measurements. In #3173 the probe was broken in a way that prevented it from responding to pings from the "to sjc" measurement (way 2), even though it was still able to producing data when pinging other probes (way 1). Would it make sense to update this PR to consider unresponsiveness of only one of the two ways? (Although that gets a little iffy when a probe is only using the way in question to ping or respond to pings from 1 probe.) |
…s unresponsive When a target probe stops responding, all source probes in that measurement show stale LastResponseAt (because latency=0 results don't trigger UpdateSourceProbeResponse). Without this check, Phase 2 would incorrectly blacklist healthy source probes. Skip source staleness checks for measurements whose target is already marked unresponsive by Phase 1.
nikw9944
approved these changes
Mar 10, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary of Changes
Diff Breakdown
~60% tests, ~40% core logic.
Key files (click to expand)
controlplane/internet-latency-collector/internal/ripeatlas/collector_test.go— new test for unresponsive source probe detection end-to-end (stale probe → marked → measurement recreated with replacement)controlplane/internet-latency-collector/internal/ripeatlas/state_test.go— tests for source probe response tracking, probe expiry after 24h, and backwards-compatible loading of legacy[]intstate formatcontrolplane/internet-latency-collector/internal/ripeatlas/state.go—LastResponseAtfield on source probes,UnresponsiveProbeEntrystruct withMarkedAttimestamp replacing bare[]int,PruneExpiredUnresponsiveProbes(), backwards-compatible state loadingcontrolplane/internet-latency-collector/internal/ripeatlas/collector.go— source probe staleness detection inconfigureMeasurements, response time tracking during export, regenerate wanted measurements after marking new probes unresponsive, probe ID comparison in reconciliationTesting Verification
TestInternetLatency_RIPEAtlas_ConfigureMeasurements_UnresponsiveSourceProbevalidates the full flow: stale source probe detected → marked unresponsive → old measurement stopped → new measurement created with replacement probe from the same locationTestInternetLatency_RIPEAtlas_State_UpdateSourceProbeResponsecovers response timestamp tracking including monotonicity and edge casesTestInternetLatency_RIPEAtlas_State_UnresponsiveProbeExpiryverifies probes expire from blacklist after 24 hoursTestInternetLatency_RIPEAtlas_State_UnresponsiveProbeBackwardsCompatverifies legacy[7466, 1234]state files are migrated to the new timestamped format on load