coordinate: trust negative netcheck per family, filter CGNAT#808
Conversation
Reproduces the server's public-IP advertisement logic (ipdiscovery + netcheck + apiAddresses filter) in a standalone CLI so we can see on the problem machine which addresses would be advertised and why. Flags the two suspected MIR-1018 failure modes in its output: netcheck source IPs proven unreachable but still advertised via the fallback, and suspicious private ranges (tailscale CGNAT, docker/libvirt bridges, IPv6 ULA) that sneak through the private filter.
Factor the advertise-address logic out of coordinate.apiAddresses into a pure ComputeAdvertise function so the production path and the 'debug advertise' command share the same decision tree. Two behavior changes: 1. Per-family netcheck state. When netcheck ran for a family with a valid public source IP but every probed port failed, we now trust that negative result and drop global-unicast discovered IPs in that family. Previously the code treated zero reachable ports as "netcheck didn't run" and kept the unreachable IPs as a fallback, which is the original MIR-1018 complaint. The "not run" and "source rejected" cases still fall back to discovered IPs as before. State is computed independently per address family, so an IPv4 success no longer silently drops discovered IPv6 public addresses (and vice versa). 2. CGNAT filter. Addresses in 100.64.0.0/10 (RFC 6598) are filtered out of the discovered list by default. This range is used by tailscale tailnets and ISP CGNAT, neither of which is reachable from a generic external client. Users who want a CGNAT address advertised can still set it explicitly via AdditionalIPs.
The apiAddresses list feeds StatusReport.APIAddresses, which miren.cloud hands out to remote clients. Neither 0.0.0.0 (wildcard listen) nor 127.0.0.1 / ::1 (loopback) is reachable from a remote client, so they were producing needless failed connection attempts. ComputeAdvertise now skips unspecified and loopback addresses for both the listen-addr slot and AdditionalIPs (user-curated). Discovered IPs already exclude loopback at the ipdiscovery layer.
Replace the two separate slices (AdditionalIPs, DiscoveredIPs) and the interim []SourcedIP with an IPSet type that provides insertion-ordered, de-duplicated storage of SourcedIP entries. When a duplicate IP is added, the Explicit flag is sticky: adding an IP as explicit promotes a previously-discovered entry, but adding as discovered never demotes an explicit one. This untangles the explicit-vs-discovered distinction cleanly: ComputeAdvertise iterates a single list and branches on the Explicit flag — explicit IPs bypass all filtering (netcheck, CGNAT), discovered IPs are subject to pruning. The IPSet handles dedup at insertion time so downstream code doesn't need to worry about it. CoordinatorConfig now holds *IPSet instead of two []net.IP slices. server.go builds the set by adding discovered IPs first, then explicit ones (which promotes any overlap).
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (4)
📝 WalkthroughWalkthroughThe PR introduces an insertion-ordered IPSet that tracks explicit vs discovered IPs, adds coordinate.ComputeAdvertise to centralize advertisement logic (including netcheck handling, CGNAT/link-local/private rules, and port normalization), updates server startup and CoordinatorConfig to use IPSet for TLS SANs and discovery, adds a Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (2)
cli/commands/server.go (1)
295-300:⚠️ Potential issue | 🟠 Major | ⚡ Quick winParse
cfg.TLS.AdditionalIPsbefore callingSetupEtcdTLS().
ipSet.RawIPs()is passed intoSetupEtcdTLS()before the explicit additional IPs are added. That means the etcd server cert SANs omit user-configured addresses, so distributed runners connecting over anAdditionalIPwill fail TLS verification.Also applies to: 519-526
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@cli/commands/server.go` around lines 295 - 300, When labs.DistributedRunners() is true, parse and add cfg.TLS.AdditionalIPs into the ipSet before calling coordinate.SetupEtcdTLS so the etcd server certificate SANs include user-configured IPs; specifically, update the flow around ipSet.RawIPs() and SetupEtcdTLS(...) so cfg.TLS.AdditionalIPs are processed/merged into the ipSet (or converted to the same RawIPs slice) prior to invoking SetupEtcdTLS, and apply the same fix to the other SetupEtcdTLS call (the second occurrence) referenced in the review.components/coordinate/coordinate.go (1)
1281-1309:⚠️ Potential issue | 🟠 Major | ⚡ Quick winKeep
PublicIPs()aligned with the new advertise filters.This path now bypasses the behavior you just centralized in
ComputeAdvertise: it returnsresp.SourceAddresseven when that family had zero reachable ports, and its fallback still admits discovered100.64.0.0/10addresses because it only checksIsGlobalUnicast/IsPrivate. SinceAutocertControllerstill usesc.PublicIPs, unreachable netcheck/CGNAT addresses can leak back in through that path.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@cli/commands/debug_advertise.go`:
- Around line 45-49: The call to ipdiscovery.DiscoverWithTimeout in
debug_advertise.go should not abort on error; replace the current pattern that
does ctx.Warn(...) then return err so the command continues like the server
does. After ipdiscovery.DiscoverWithTimeout (function name) catch the error, log
it with ctx.Warn (include err), but do not return; ensure the rest of the
function that uses the discovery variable (named discovery) can handle a
nil/zero-value discovery (or initialize an empty/zero discovery result) so the
command continues to exercise explicit-IP/netcheck behavior.
- Around line 18-23: The DebugAdvertise command currently prints human-readable
tables; embed the shared FormatOptions into the opts struct of DebugAdvertise,
then after collecting the scan/list items build a command-specific JSON struct
(with snake_case json tags) representing each advertised entry, check
opts.IsJSON() and if true call PrintJSON(items) and return; otherwise keep the
existing table/print flow. Reference the DebugAdvertise function and its opts
struct, use FormatOptions' IsJSON() and the global PrintJSON(items) helper when
returning the JSON output.
---
Outside diff comments:
In `@cli/commands/server.go`:
- Around line 295-300: When labs.DistributedRunners() is true, parse and add
cfg.TLS.AdditionalIPs into the ipSet before calling coordinate.SetupEtcdTLS so
the etcd server certificate SANs include user-configured IPs; specifically,
update the flow around ipSet.RawIPs() and SetupEtcdTLS(...) so
cfg.TLS.AdditionalIPs are processed/merged into the ipSet (or converted to the
same RawIPs slice) prior to invoking SetupEtcdTLS, and apply the same fix to the
other SetupEtcdTLS call (the second occurrence) referenced in the review.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: a657560b-3f2e-416b-a364-7f4eb68f2926
📒 Files selected for processing (6)
cli/commands/commands.gocli/commands/debug_advertise.gocli/commands/server.gocomponents/coordinate/advertise.gocomponents/coordinate/api_addresses_test.gocomponents/coordinate/coordinate.go
| func DebugAdvertise(ctx *Context, opts struct { | ||
| CloudURL string `long:"cloud-url" description:"Cloud URL to use for netcheck (default: https://api.miren.cloud)"` | ||
| SkipNetcheck bool `long:"skip-netcheck" description:"Skip the netcheck call and only report interface scan"` | ||
| AdditionalIPs []string `long:"additional-ip" description:"Simulate a server-configured AdditionalIP (repeatable)"` | ||
| ListenAddr string `long:"listen" description:"Simulate the server's listen address (default: 0.0.0.0:8443)"` | ||
| }) error { |
There was a problem hiding this comment.
🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win
Add --format json to this command.
This prints a table plus a final list, so it should follow the same machine-readable output contract as the other list-style CLI commands.
As per coding guidelines, cli/commands/**/*.go: Commands that output lists or tables should support --format json using the FormatOptions pattern: embed FormatOptions in opts struct, check IsJSON(), define a command-specific JSON struct with snake_case tags, and return PrintJSON(items).
Also applies to: 104-121
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cli/commands/debug_advertise.go` around lines 18 - 23, The DebugAdvertise
command currently prints human-readable tables; embed the shared FormatOptions
into the opts struct of DebugAdvertise, then after collecting the scan/list
items build a command-specific JSON struct (with snake_case json tags)
representing each advertised entry, check opts.IsJSON() and if true call
PrintJSON(items) and return; otherwise keep the existing table/print flow.
Reference the DebugAdvertise function and its opts struct, use FormatOptions'
IsJSON() and the global PrintJSON(items) helper when returning the JSON output.
| discovery, err := ipdiscovery.DiscoverWithTimeout(15*time.Second, ctx.Log, discoveryOpts) | ||
| if err != nil { | ||
| ctx.Warn("ipdiscovery.Discover failed: %v", err) | ||
| return err | ||
| } |
There was a problem hiding this comment.
Don’t abort on interface-discovery failure here.
server.go just warns and keeps going, but this command returns immediately. That makes debug advertise unable to reproduce the server’s explicit-IP/netcheck behavior in the same failure mode it is supposed to diagnose.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@cli/commands/debug_advertise.go` around lines 45 - 49, The call to
ipdiscovery.DiscoverWithTimeout in debug_advertise.go should not abort on error;
replace the current pattern that does ctx.Warn(...) then return err so the
command continues like the server does. After ipdiscovery.DiscoverWithTimeout
(function name) catch the error, log it with ctx.Warn (include err), but do not
return; ensure the rest of the function that uses the discovery variable (named
discovery) can handle a nil/zero-value discovery (or initialize an empty/zero
discovery result) so the command continues to exercise explicit-IP/netcheck
behavior.
…owups coordinate: address CodeRabbit feedback from #808
Opening this on Evan's behalf since the branch has been sitting since April 15. Original commits are his (
dd03230d"Fix MIR-1018" and8eb63ab"Add 'debug advertise' command"); I'm just turning the crank after rediscovering the same bug as MIR-1139.Two problems addressed.
The advertise-address logic in
coordinate.apiAddresseswas treating "netcheck ran but found zero reachable" as identical to "netcheck didn't run at all" and falling back to discovered IPs in both cases. That fallback would push the WAN IP of a firewalled cluster (or a tailnet address fromtailscale0) into the status report, and cloud would happily write A records to those unreachable IPs instead of routing through POP. Per-family state fixes this: an IPv4 success no longer silently drops discovered IPv6 public addresses, and a per-family "ran and found nothing" verdict is now trusted rather than papered over.Addresses in 100.64.0.0/10 (RFC 6598) are also filtered out of the discovered list by default. This range is used by Tailscale and by ISP CGNAT, neither of which is reachable from a generic external client. Users who genuinely want a CGNAT address advertised can set it explicitly via
AdditionalIPs, which passes through unfiltered.The logic itself was factored into a pure
ComputeAdvertisefunction so the production path and thedebug advertisecommand share the same decision tree. That makes the debug command actually useful for diagnosing "why is my cluster advertising X" without needing to start a full coordinator.Closes MIR-1018
Closes MIR-1139 (duplicate)