Skip to content

coordinate: trust negative netcheck per family, filter CGNAT#808

Merged
phinze merged 6 commits into
mainfrom
mir-1018-unusable-addresses-still-present-in-discovery
May 14, 2026
Merged

coordinate: trust negative netcheck per family, filter CGNAT#808
phinze merged 6 commits into
mainfrom
mir-1018-unusable-addresses-still-present-in-discovery

Conversation

@phinze
Copy link
Copy Markdown
Contributor

@phinze phinze commented May 14, 2026

Opening this on Evan's behalf since the branch has been sitting since April 15. Original commits are his (dd03230d "Fix MIR-1018" and 8eb63ab "Add 'debug advertise' command"); I'm just turning the crank after rediscovering the same bug as MIR-1139.

Two problems addressed.

The advertise-address logic in coordinate.apiAddresses was treating "netcheck ran but found zero reachable" as identical to "netcheck didn't run at all" and falling back to discovered IPs in both cases. That fallback would push the WAN IP of a firewalled cluster (or a tailnet address from tailscale0) into the status report, and cloud would happily write A records to those unreachable IPs instead of routing through POP. Per-family state fixes this: an IPv4 success no longer silently drops discovered IPv6 public addresses, and a per-family "ran and found nothing" verdict is now trusted rather than papered over.

Addresses in 100.64.0.0/10 (RFC 6598) are also filtered out of the discovered list by default. This range is used by Tailscale and by ISP CGNAT, neither of which is reachable from a generic external client. Users who genuinely want a CGNAT address advertised can set it explicitly via AdditionalIPs, which passes through unfiltered.

The logic itself was factored into a pure ComputeAdvertise function so the production path and the debug advertise command share the same decision tree. That makes the debug command actually useful for diagnosing "why is my cluster advertising X" without needing to start a full coordinator.

Closes MIR-1018
Closes MIR-1139 (duplicate)

evanphx added 4 commits April 15, 2026 16:03
Reproduces the server's public-IP advertisement logic (ipdiscovery +
netcheck + apiAddresses filter) in a standalone CLI so we can see on
the problem machine which addresses would be advertised and why.

Flags the two suspected MIR-1018 failure modes in its output: netcheck
source IPs proven unreachable but still advertised via the fallback,
and suspicious private ranges (tailscale CGNAT, docker/libvirt bridges,
IPv6 ULA) that sneak through the private filter.
Factor the advertise-address logic out of coordinate.apiAddresses into a
pure ComputeAdvertise function so the production path and the 'debug
advertise' command share the same decision tree.

Two behavior changes:

1. Per-family netcheck state. When netcheck ran for a family with a
   valid public source IP but every probed port failed, we now trust
   that negative result and drop global-unicast discovered IPs in that
   family. Previously the code treated zero reachable ports as "netcheck
   didn't run" and kept the unreachable IPs as a fallback, which is the
   original MIR-1018 complaint. The "not run" and "source rejected"
   cases still fall back to discovered IPs as before. State is computed
   independently per address family, so an IPv4 success no longer
   silently drops discovered IPv6 public addresses (and vice versa).

2. CGNAT filter. Addresses in 100.64.0.0/10 (RFC 6598) are filtered
   out of the discovered list by default. This range is used by
   tailscale tailnets and ISP CGNAT, neither of which is reachable from
   a generic external client. Users who want a CGNAT address advertised
   can still set it explicitly via AdditionalIPs.
The apiAddresses list feeds StatusReport.APIAddresses, which miren.cloud
hands out to remote clients. Neither 0.0.0.0 (wildcard listen) nor
127.0.0.1 / ::1 (loopback) is reachable from a remote client, so they
were producing needless failed connection attempts.

ComputeAdvertise now skips unspecified and loopback addresses for both
the listen-addr slot and AdditionalIPs (user-curated). Discovered IPs
already exclude loopback at the ipdiscovery layer.
Replace the two separate slices (AdditionalIPs, DiscoveredIPs) and the
interim []SourcedIP with an IPSet type that provides insertion-ordered,
de-duplicated storage of SourcedIP entries. When a duplicate IP is
added, the Explicit flag is sticky: adding an IP as explicit promotes a
previously-discovered entry, but adding as discovered never demotes an
explicit one.

This untangles the explicit-vs-discovered distinction cleanly:
ComputeAdvertise iterates a single list and branches on the Explicit
flag — explicit IPs bypass all filtering (netcheck, CGNAT), discovered
IPs are subject to pruning. The IPSet handles dedup at insertion time
so downstream code doesn't need to worry about it.

CoordinatorConfig now holds *IPSet instead of two []net.IP slices.
server.go builds the set by adding discovered IPs first, then explicit
ones (which promotes any overlap).
@phinze phinze requested a review from a team as a code owner May 14, 2026 19:25
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 14, 2026

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: f719fb3b-4235-4276-b3f5-2a92aa4e81f8

📥 Commits

Reviewing files that changed from the base of the PR and between 82b5965 and c804992.

📒 Files selected for processing (4)
  • docs/command-sidebar.json
  • docs/docs/command/debug-advertise.md
  • docs/docs/command/debug.md
  • docs/docs/commands.md
✅ Files skipped from review due to trivial changes (4)
  • docs/docs/command/debug-advertise.md
  • docs/docs/commands.md
  • docs/command-sidebar.json
  • docs/docs/command/debug.md

📝 Walkthrough

Walkthrough

The PR introduces an insertion-ordered IPSet that tracks explicit vs discovered IPs, adds coordinate.ComputeAdvertise to centralize advertisement logic (including netcheck handling, CGNAT/link-local/private rules, and port normalization), updates server startup and CoordinatorConfig to use IPSet for TLS SANs and discovery, adds a miren debug advertise CLI command, and updates tests and docs to exercise and document the new behavior.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@evanphx evanphx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OMG thank you!

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
cli/commands/server.go (1)

295-300: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Parse cfg.TLS.AdditionalIPs before calling SetupEtcdTLS().

ipSet.RawIPs() is passed into SetupEtcdTLS() before the explicit additional IPs are added. That means the etcd server cert SANs omit user-configured addresses, so distributed runners connecting over an AdditionalIP will fail TLS verification.

Also applies to: 519-526

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/commands/server.go` around lines 295 - 300, When
labs.DistributedRunners() is true, parse and add cfg.TLS.AdditionalIPs into the
ipSet before calling coordinate.SetupEtcdTLS so the etcd server certificate SANs
include user-configured IPs; specifically, update the flow around ipSet.RawIPs()
and SetupEtcdTLS(...) so cfg.TLS.AdditionalIPs are processed/merged into the
ipSet (or converted to the same RawIPs slice) prior to invoking SetupEtcdTLS,
and apply the same fix to the other SetupEtcdTLS call (the second occurrence)
referenced in the review.
components/coordinate/coordinate.go (1)

1281-1309: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Keep PublicIPs() aligned with the new advertise filters.

This path now bypasses the behavior you just centralized in ComputeAdvertise: it returns resp.SourceAddress even when that family had zero reachable ports, and its fallback still admits discovered 100.64.0.0/10 addresses because it only checks IsGlobalUnicast/IsPrivate. Since AutocertController still uses c.PublicIPs, unreachable netcheck/CGNAT addresses can leak back in through that path.

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@cli/commands/debug_advertise.go`:
- Around line 45-49: The call to ipdiscovery.DiscoverWithTimeout in
debug_advertise.go should not abort on error; replace the current pattern that
does ctx.Warn(...) then return err so the command continues like the server
does. After ipdiscovery.DiscoverWithTimeout (function name) catch the error, log
it with ctx.Warn (include err), but do not return; ensure the rest of the
function that uses the discovery variable (named discovery) can handle a
nil/zero-value discovery (or initialize an empty/zero discovery result) so the
command continues to exercise explicit-IP/netcheck behavior.
- Around line 18-23: The DebugAdvertise command currently prints human-readable
tables; embed the shared FormatOptions into the opts struct of DebugAdvertise,
then after collecting the scan/list items build a command-specific JSON struct
(with snake_case json tags) representing each advertised entry, check
opts.IsJSON() and if true call PrintJSON(items) and return; otherwise keep the
existing table/print flow. Reference the DebugAdvertise function and its opts
struct, use FormatOptions' IsJSON() and the global PrintJSON(items) helper when
returning the JSON output.

---

Outside diff comments:
In `@cli/commands/server.go`:
- Around line 295-300: When labs.DistributedRunners() is true, parse and add
cfg.TLS.AdditionalIPs into the ipSet before calling coordinate.SetupEtcdTLS so
the etcd server certificate SANs include user-configured IPs; specifically,
update the flow around ipSet.RawIPs() and SetupEtcdTLS(...) so
cfg.TLS.AdditionalIPs are processed/merged into the ipSet (or converted to the
same RawIPs slice) prior to invoking SetupEtcdTLS, and apply the same fix to the
other SetupEtcdTLS call (the second occurrence) referenced in the review.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a657560b-3f2e-416b-a364-7f4eb68f2926

📥 Commits

Reviewing files that changed from the base of the PR and between e99e71e and cb87157.

📒 Files selected for processing (6)
  • cli/commands/commands.go
  • cli/commands/debug_advertise.go
  • cli/commands/server.go
  • components/coordinate/advertise.go
  • components/coordinate/api_addresses_test.go
  • components/coordinate/coordinate.go

Comment on lines +18 to +23
func DebugAdvertise(ctx *Context, opts struct {
CloudURL string `long:"cloud-url" description:"Cloud URL to use for netcheck (default: https://api.miren.cloud)"`
SkipNetcheck bool `long:"skip-netcheck" description:"Skip the netcheck call and only report interface scan"`
AdditionalIPs []string `long:"additional-ip" description:"Simulate a server-configured AdditionalIP (repeatable)"`
ListenAddr string `long:"listen" description:"Simulate the server's listen address (default: 0.0.0.0:8443)"`
}) error {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major | ⚡ Quick win

Add --format json to this command.

This prints a table plus a final list, so it should follow the same machine-readable output contract as the other list-style CLI commands.

As per coding guidelines, cli/commands/**/*.go: Commands that output lists or tables should support --format json using the FormatOptions pattern: embed FormatOptions in opts struct, check IsJSON(), define a command-specific JSON struct with snake_case tags, and return PrintJSON(items).

Also applies to: 104-121

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/commands/debug_advertise.go` around lines 18 - 23, The DebugAdvertise
command currently prints human-readable tables; embed the shared FormatOptions
into the opts struct of DebugAdvertise, then after collecting the scan/list
items build a command-specific JSON struct (with snake_case json tags)
representing each advertised entry, check opts.IsJSON() and if true call
PrintJSON(items) and return; otherwise keep the existing table/print flow.
Reference the DebugAdvertise function and its opts struct, use FormatOptions'
IsJSON() and the global PrintJSON(items) helper when returning the JSON output.

Comment on lines +45 to +49
discovery, err := ipdiscovery.DiscoverWithTimeout(15*time.Second, ctx.Log, discoveryOpts)
if err != nil {
ctx.Warn("ipdiscovery.Discover failed: %v", err)
return err
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Don’t abort on interface-discovery failure here.

server.go just warns and keeps going, but this command returns immediately. That makes debug advertise unable to reproduce the server’s explicit-IP/netcheck behavior in the same failure mode it is supposed to diagnose.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@cli/commands/debug_advertise.go` around lines 45 - 49, The call to
ipdiscovery.DiscoverWithTimeout in debug_advertise.go should not abort on error;
replace the current pattern that does ctx.Warn(...) then return err so the
command continues like the server does. After ipdiscovery.DiscoverWithTimeout
(function name) catch the error, log it with ctx.Warn (include err), but do not
return; ensure the rest of the function that uses the discovery variable (named
discovery) can handle a nil/zero-value discovery (or initialize an empty/zero
discovery result) so the command continues to exercise explicit-IP/netcheck
behavior.

@phinze phinze merged commit 6d4aea6 into main May 14, 2026
19 checks passed
@phinze phinze deleted the mir-1018-unusable-addresses-still-present-in-discovery branch May 14, 2026 20:13
phinze added a commit that referenced this pull request May 14, 2026
…owups

coordinate: address CodeRabbit feedback from #808
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants