Conversation
Add pkg/globalrouter package that connects clusters to the cloud coordination service and POP servers for NAT traversal: - WebSocket client with JWT auth and exponential backoff reconnection - Two-connection POP protocol (control plane + data plane over QUIC) - HTTP/3 request forwarding from POPs to local apps via httpingress - WebSocket tunnel handler (/_pop/ws) with tunnel frame encoding that bridges POP WebSocket tunnels to local app WebSocket connections - Message handlers for connection.request, time sync, and org info - AcquireTunnel method on httpingress for direct sandbox URL resolution Gated behind the globalrouter labs flag and cloud auth being enabled.
MIR-825) Introduce netcheck to determine public reachability via the cloud endpoint, cache results, and use them to build accurate API address lists for status reports. Hardcode port 8443 instead of parsing c.Address, and skip c.Address from the address list when it lacks a valid IP host. Use net.JoinHostPort for proper IPv6 formatting.
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (3)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdds netcheck support: cloudauth.Netcheck client, tests, and Coordinator state/logic to query, cache, and expose public reachability and to derive API listen addresses from netcheck results. Introduces a globalrouter package with a WebSocket client, message router, GlobalRouter, POPManager, protocol framing, and tests; Coordinator optionally starts a background GlobalRouter. Extends httpingress with TunnelConn and Server.AcquireTunnel. Updates go.mod and CI manifests (hack/test-groups.json, hack/test-times.json). No exported public method signatures were removed; several new exported types and methods were added in globalrouter and httpingress. Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (5)
pkg/globalrouter/pop_test.go (1)
52-53: AssertHandleConnectionRequesterrors in test setup paths.These calls are part of test setup; ignoring errors can let tests pass with broken preconditions.
💡 Suggested fix
- pm.HandleConnectionRequest(t.Context(), req) + if err := pm.HandleConnectionRequest(t.Context(), req); err != nil { + t.Fatalf("HandleConnectionRequest: %v", err) + } pm.RemovePOP("pop-1")- pm.HandleConnectionRequest(t.Context(), req) + if err := pm.HandleConnectionRequest(t.Context(), req); err != nil { + t.Fatalf("HandleConnectionRequest: %v", err) + }Also applies to: 74-75
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/globalrouter/pop_test.go` around lines 52 - 53, The test setup currently ignores errors returned by pm.HandleConnectionRequest(...) and pm.RemovePOP(...); update the test to capture and assert these errors (e.g., err := pm.HandleConnectionRequest(...); require.NoError(t, err) or t.Fatalf on error) so setup failures fail the test immediately; do the same for pm.RemovePOP(...) and for the other occurrences at the later setup lines (the calls on lines 74-75).pkg/globalrouter/client.go (2)
79-101: Backoff should reset after a successful connection.The backoff increases on each disconnection but never resets when a connection is successfully established and maintained. After a transient failure followed by recovery, subsequent disconnections will start with an unnecessarily high backoff.
Proposed fix to reset backoff on successful connection
func (c *Client) Run(ctx context.Context) error { backoff := initialBackoff for { err := c.runOnce(ctx) if ctx.Err() != nil { return ctx.Err() } + // Reset backoff on successful connection (err != connect error) + // A connection that lasted indicates recovery + backoff = initialBackoff + c.log.Warn("websocket disconnected, reconnecting", "error", err, "backoff", backoff) select { case <-ctx.Done(): return ctx.Err() case <-time.After(backoff): } backoff = min(backoff*2, maxBackoff) } }Alternatively, reset backoff only if the connection was healthy for a minimum duration (e.g., 30 seconds).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/globalrouter/client.go` around lines 79 - 101, The Run loop currently grows backoff on every disconnect but never resets it; modify Client.Run so that after a successful sustained connection (i.e., runOnce returns nil and the ctx hasn't been cancelled) you reset backoff = initialBackoff (or reset only if the connection lasted at least a threshold, e.g., 30s) before continuing; locate the backoff, initialBackoff, maxBackoff and runOnce references in Client.Run and update the logic to track the last successful connect time (or check runOnce's success) and reset backoff accordingly.
172-187: Consider reducing log verbosity for production.Logging at
Infolevel for every received and dispatched message (lines 179, 184) will be very verbose under load. Consider usingDebuglevel for per-message logging.Proposed change
- c.log.Info("received message from cloud", "type", env.Type, "data_len", len(env.Data)) + c.log.Debug("received message from cloud", "type", env.Type, "data_len", len(env.Data)) if err := c.router.Dispatch(ctx, env); err != nil { c.log.Warn("dispatch error", "type", env.Type, "error", err) } else { - c.log.Info("message dispatched successfully", "type", env.Type) + c.log.Debug("message dispatched successfully", "type", env.Type) }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/globalrouter/client.go` around lines 172 - 187, The per-message Info logs in Client.readLoop (the c.log.Info calls logging received messages and successful dispatches for Envelope handling) are too verbose for production; change those per-message logs to use c.log.Debug instead of c.log.Info (leave the dispatch error warning via c.log.Warn as-is) so normal Info-level production logging is not flooded—update the two c.log.Info calls in readLoop to c.log.Debug and ensure the change applies to the Envelope receive and successful dispatch log paths.pkg/globalrouter/globalrouter.go (1)
59-65: Errors fromSendMessageare silently ignored.The
OnConnectcallback callsSendMessagetwice but doesn't handle potential marshaling errors. While unlikely to fail for these simple structs, logging the error would aid debugging.Proposed fix
client.OnConnect(func(ctx context.Context) { gr.log.Info("sending initial requests") - client.SendMessage(TypeTimeRequest, TimeRequest{ + if err := client.SendMessage(TypeTimeRequest, TimeRequest{ ClientTransmitTime: time.Now().UTC(), - }) - client.SendMessage(TypeOrgInfoRequest, struct{}{}) + }); err != nil { + gr.log.Warn("failed to send time request", "error", err) + } + if err := client.SendMessage(TypeOrgInfoRequest, struct{}{}); err != nil { + gr.log.Warn("failed to send org info request", "error", err) + } })🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/globalrouter/globalrouter.go` around lines 59 - 65, The OnConnect callback on client currently calls SendMessage(TypeTimeRequest, TimeRequest{...}) and SendMessage(TypeOrgInfoRequest, struct{}{}) but ignores returned errors; update the client.OnConnect handler to capture the error returned by SendMessage for both calls and log any non-nil error via gr.log (or return/handle it appropriately). Specifically modify the anonymous func passed to client.OnConnect to check the error from client.SendMessage when sending TimeRequest and when sending the org info request, and call gr.log.Error or gr.log.Warn with contextual messages referencing TypeTimeRequest/TimeRequest and TypeOrgInfoRequest so marshaling/transport errors are visible.pkg/globalrouter/pop.go (1)
432-472: Ping error is silently ignored.At line 466, the result of
localConn.Ping(ctx)is discarded. While ping failures may not be critical (the connection will fail on the next read/write anyway), logging at debug level could help diagnose connectivity issues.Proposed change
if msgType == tunnelFrameKeepalive { - localConn.Ping(ctx) + if err := localConn.Ping(ctx); err != nil { + // Connection may be closing; continue to let next read fail naturally + return 0, buf, fmt.Errorf("keepalive ping failed: %w", err) + } continue }Alternatively, just continue silently if ping failures should not interrupt the tunnel.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/globalrouter/pop.go` around lines 432 - 472, In readTunnelFrame, the error returned by localConn.Ping(ctx) is currently discarded; capture the returned error, and log it at debug/trace level (using the repository's existing logger, e.g., processLogger or similar) with context like "keepalive ping failed" but do not return the error — continue the loop so ping failures don't interrupt processing; update the localConn.Ping(ctx) call site to check err and log the message including the err value.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@components/coordinate/coordinate.go`:
- Around line 1135-1145: The loop over result.Results can append duplicate
"host:port" strings to the addrs slice when multiple reachable protocols share
the same port; change the logic in that block (the loop referencing
result.Results, r.Reachable, result.SourceAddress, r.Port and the addrs slice
where net.JoinHostPort is used) to deduplicate before appending — e.g., maintain
a local map[string]struct{} seen and only append the joined host:port if it is
not already in seen, then mark it seen.
- Around line 1087-1090: When netcheck fails or the source is invalid, clear the
stale netcheck cache and ensure gating logic uses the validated result: inside
the netcheck handling blocks that set c.netcheckCheckedAt (references:
c.netcheckMu, c.netcheckCheckedAt, netcheckResult, hasNetcheck and
AdditionalIPs), set c.netcheckMu.Lock(), zero or nil out c.netcheckResult (and
any related validity flag) before updating the timestamp and unlocking; ensure
any downstream gating that calls hasNetcheck or filters AdditionalIPs only
treats the result as present if netcheckResult is valid (apply same change at
the other occurrences mentioned around the blocks at the other occurrences
referenced in the comment).
In `@pkg/globalrouter/pop.go`:
- Around line 186-191: The data plane TLS config (variable dataTLS with
NextProtos {popDataALPN} and ServerName host) is using InsecureSkipVerify and
missing a MinVersion; replace InsecureSkipVerify: true with proper certificate
verification (load/assign a RootCAs pool or use
tls.Config.VerifyPeerCertificate) and set MinVersion: tls.VersionTLS13 to
enforce TLS1.3; update the dataTLS construction so it mirrors the secure
control-plane setup (use the same CA loading/verification logic rather than
skipping verification) to eliminate InsecureSkipVerify while preserving
NextProtos and ServerName.
- Around line 135-140: Add a short explanatory comment above the controlTLS
tls.Config that documents the bootstrap security model (token-based ConnectToken
+ Trust-on-First-Use during initial bootstrap, analogous to the runner join
flow) so readers understand why InsecureSkipVerify is used there; then set
MinVersion: tls.VersionTLS13 in the controlTLS configuration (alongside
NextProtos/http3.NextProtoH3, ServerName, InsecureSkipVerify) to explicitly
enforce TLS 1.3 for defense-in-depth given QUIC/RFC9001.
In `@servers/httpingress/httpingress.go`:
- Around line 1169-1171: The lease acquisition uses context.Background(),
detaching it from the caller; change the timeout context to derive from the
request/caller context so cancellation propagates: replace
context.WithTimeout(context.Background(), leaseAcquisitionTimeout) with
context.WithTimeout(ctx, leaseAcquisitionTimeout) (keeping actContext and
actCancel) and ensure this actContext is passed into AcquireTunnel (and any
downstream calls) so that if the caller cancels the request the lease
acquisition will abort and not retain a lease.
---
Nitpick comments:
In `@pkg/globalrouter/client.go`:
- Around line 79-101: The Run loop currently grows backoff on every disconnect
but never resets it; modify Client.Run so that after a successful sustained
connection (i.e., runOnce returns nil and the ctx hasn't been cancelled) you
reset backoff = initialBackoff (or reset only if the connection lasted at least
a threshold, e.g., 30s) before continuing; locate the backoff, initialBackoff,
maxBackoff and runOnce references in Client.Run and update the logic to track
the last successful connect time (or check runOnce's success) and reset backoff
accordingly.
- Around line 172-187: The per-message Info logs in Client.readLoop (the
c.log.Info calls logging received messages and successful dispatches for
Envelope handling) are too verbose for production; change those per-message logs
to use c.log.Debug instead of c.log.Info (leave the dispatch error warning via
c.log.Warn as-is) so normal Info-level production logging is not flooded—update
the two c.log.Info calls in readLoop to c.log.Debug and ensure the change
applies to the Envelope receive and successful dispatch log paths.
In `@pkg/globalrouter/globalrouter.go`:
- Around line 59-65: The OnConnect callback on client currently calls
SendMessage(TypeTimeRequest, TimeRequest{...}) and
SendMessage(TypeOrgInfoRequest, struct{}{}) but ignores returned errors; update
the client.OnConnect handler to capture the error returned by SendMessage for
both calls and log any non-nil error via gr.log (or return/handle it
appropriately). Specifically modify the anonymous func passed to
client.OnConnect to check the error from client.SendMessage when sending
TimeRequest and when sending the org info request, and call gr.log.Error or
gr.log.Warn with contextual messages referencing TypeTimeRequest/TimeRequest and
TypeOrgInfoRequest so marshaling/transport errors are visible.
In `@pkg/globalrouter/pop_test.go`:
- Around line 52-53: The test setup currently ignores errors returned by
pm.HandleConnectionRequest(...) and pm.RemovePOP(...); update the test to
capture and assert these errors (e.g., err := pm.HandleConnectionRequest(...);
require.NoError(t, err) or t.Fatalf on error) so setup failures fail the test
immediately; do the same for pm.RemovePOP(...) and for the other occurrences at
the later setup lines (the calls on lines 74-75).
In `@pkg/globalrouter/pop.go`:
- Around line 432-472: In readTunnelFrame, the error returned by
localConn.Ping(ctx) is currently discarded; capture the returned error, and log
it at debug/trace level (using the repository's existing logger, e.g.,
processLogger or similar) with context like "keepalive ping failed" but do not
return the error — continue the loop so ping failures don't interrupt
processing; update the localConn.Ping(ctx) call site to check err and log the
message including the err value.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 85b0bf6b-8efe-4b86-b4ff-5d05edfbe3f6
⛔ Files ignored due to path filters (1)
go.sumis excluded by!**/*.sum
📒 Files selected for processing (13)
components/coordinate/coordinate.gogo.modhack/test-groups.jsonhack/test-times.jsonpkg/cloudauth/netcheck.gopkg/cloudauth/netcheck_test.gopkg/globalrouter/client.gopkg/globalrouter/client_test.gopkg/globalrouter/envelope.gopkg/globalrouter/globalrouter.gopkg/globalrouter/pop.gopkg/globalrouter/pop_test.goservers/httpingress/httpingress.go
…tion - Clear netcheckResult on failure/invalid source so stale data doesn't persist - Gate AdditionalIPs filtering on netcheckResult != nil, not just checkedAt - Deduplicate publicAddresses when multiple protocols share the same port - Only skip TLS verification for POP connections to non-public IPs
The tunnel path should abort lease acquisition when the caller cancels, unlike the HTTP request paths which intentionally outlive the client.
There was a problem hiding this comment.
🧹 Nitpick comments (2)
pkg/globalrouter/pop.go (2)
315-316: Consider handlinghttps://→wss://for robustness.The URL scheme replacement only handles
http://→ws://. If sandbox URLs ever use HTTPS, this would produce an invalid URL.Proposed improvement
- sandboxWSURL := strings.Replace(tunnel.URL, "http://", "ws://", 1) + wsPath + sandboxWSURL := tunnel.URL + wsPath + if strings.HasPrefix(sandboxWSURL, "http://") { + sandboxWSURL = "ws://" + strings.TrimPrefix(sandboxWSURL, "http://") + } else if strings.HasPrefix(sandboxWSURL, "https://") { + sandboxWSURL = "wss://" + strings.TrimPrefix(sandboxWSURL, "https://") + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/globalrouter/pop.go` around lines 315 - 316, Replace the naive string replacement that builds sandboxWSURL from tunnel.URL with scheme-aware logic: parse tunnel.URL using net/url, inspect u.Scheme and set it to "ws" if "http" or "wss" if "https" (or return an error for unknown schemes), then reconstruct the URL and append wsPath; update the code that references sandboxWSURL (the variable created from tunnel.URL and wsPath) so it uses this parsed-and-rebuilt URL rather than strings.Replace.
470-472: Consider loggingPingerrors for observability.The error returned by
localConn.Ping(ctx)is silently discarded. While keepalives are best-effort, logging at debug level could help diagnose connectivity issues.Proposed improvement
if msgType == tunnelFrameKeepalive { - localConn.Ping(ctx) + if err := localConn.Ping(ctx); err != nil { + // Debug-level: keepalive pings are best-effort + return 0, buf, fmt.Errorf("keepalive ping failed: %w", err) + } continue }Alternatively, just log at debug level and continue rather than returning an error.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/globalrouter/pop.go` around lines 470 - 472, The keepalive branch discards the error from localConn.Ping(ctx); capture the returned error and log it at debug level (do not change behavior—still continue on error) so connectivity issues are observable; update the block handling msgType == tunnelFrameKeepalive to call err := localConn.Ping(ctx) and, if err != nil, emit a debug log using the existing logger (e.g., r.logger, logger, or ctx-aware logger used elsewhere in this file) including context that this came from the keepalive ping before continuing.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@pkg/globalrouter/pop.go`:
- Around line 315-316: Replace the naive string replacement that builds
sandboxWSURL from tunnel.URL with scheme-aware logic: parse tunnel.URL using
net/url, inspect u.Scheme and set it to "ws" if "http" or "wss" if "https" (or
return an error for unknown schemes), then reconstruct the URL and append
wsPath; update the code that references sandboxWSURL (the variable created from
tunnel.URL and wsPath) so it uses this parsed-and-rebuilt URL rather than
strings.Replace.
- Around line 470-472: The keepalive branch discards the error from
localConn.Ping(ctx); capture the returned error and log it at debug level (do
not change behavior—still continue on error) so connectivity issues are
observable; update the block handling msgType == tunnelFrameKeepalive to call
err := localConn.Ping(ctx) and, if err != nil, emit a debug log using the
existing logger (e.g., r.logger, logger, or ctx-aware logger used elsewhere in
this file) including context that this came from the keepalive ping before
continuing.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 7c1c3f62-567a-4871-b0ea-cd377da50bc9
📒 Files selected for processing (3)
components/coordinate/coordinate.gogo.modpkg/globalrouter/pop.go
There was a problem hiding this comment.
🧹 Nitpick comments (1)
servers/httpingress/httpingress.go (1)
1153-1166: Consider adding a method for callers to invalidate stale leases.When
AcquireTunnelreturns a cached lease that points to a dead sandbox, the caller will get a connection error. However, callingRelease()only decrements the use count—it doesn't remove the lease from cache. SubsequentAcquireTunnelcalls may return the same stale lease untilwatchInvalidationsorexpireLeasescleans it up.Compare with
DoRequest(lines 1002-1007) andserveAuthenticatedRequest(lines 605-613), which callinvalidateAppLeaseson connection errors.An optional improvement would be to add an
Invalidate()method onTunnelConnthat callers can use to signal a bad lease:💡 Optional enhancement
// Invalidate marks the lease as stale and removes it from cache. // Call this when a connection error indicates the sandbox is dead. func (tc *TunnelConn) Invalidate() { if tc.lease != nil { tc.server.invalidateLease(context.Background(), tc.AppID, tc.lease) tc.lease = nil } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@servers/httpingress/httpingress.go` around lines 1153 - 1166, Add an Invalidate method on TunnelConn so callers can remove a stale cached lease when a connection error occurs: implement func (tc *TunnelConn) Invalidate() that, if tc.lease != nil, calls tc.server.invalidateLease(context.Background(), tc.AppID, tc.lease) and then sets tc.lease = nil; reference existing symbols AcquireTunnel, TunnelConn, Release, invalidateAppLeases (used by DoRequest and serveAuthenticatedRequest) to mirror their behavior and ensure invalidation removes the lease from the cache rather than just decrementing use count.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@servers/httpingress/httpingress.go`:
- Around line 1153-1166: Add an Invalidate method on TunnelConn so callers can
remove a stale cached lease when a connection error occurs: implement func (tc
*TunnelConn) Invalidate() that, if tc.lease != nil, calls
tc.server.invalidateLease(context.Background(), tc.AppID, tc.lease) and then
sets tc.lease = nil; reference existing symbols AcquireTunnel, TunnelConn,
Release, invalidateAppLeases (used by DoRequest and serveAuthenticatedRequest)
to mirror their behavior and ensure invalidation removes the lease from the
cache rather than just decrementing use count.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: ca6b7c8d-9d3e-4dcf-896d-f8d53f40378c
📒 Files selected for processing (1)
servers/httpingress/httpingress.go
phinze
left a comment
There was a problem hiding this comment.
Claude and Paul here. Boy are we impressed. Well, Paul was impressed. Claude is a robot. But if Claude could be impressed, this would do it.
We went through the full architecture chunk by chunk, cross-referenced against cloud PR mirendev/cloud#110 and RFD-52 to make sure both sides of the wire protocol agree, and even went down a rabbit hole on QUIC NAT traversal standards to sanity check our approach (tl;dr: we're building exactly the right thing for serving web traffic through edge proxies).
Paul was kind of gushing during the review and Claude didn't stop him. The layering is really clean and the details are 🤌 — sharing the UDP socket across both QUIC connections so source IP matches for phase 2 auth, the http3.Transport.Dial override to reuse an already-dialed connection, keepalive-to-ping translation in the tunnel bridge. And sneaking simplified NTP into the initial WebSocket handshake! That's the kind of "oh of course that should be there" foresight that you don't appreciate until you're debugging clock drift across distributed systems at 2am. Mad props.
We filed a few follow-up tickets for things that came up during review: coordinator decomposition, e2e test suite for the full runtime+cloud+POP path, and a reminder to converge other runtime↔cloud communication onto the WebSocket channel once this graduates from labs (we think it's destined to become the connection between runtime and cloud).
A few small things inline:
- Reset backoff after connection was up for 30s+ before dropping - Block full /.well-known/miren/ prefix for tunnel requests - Use net.SplitHostPort in normalizeAddr for correct IPv6 handling - Remove per-message and per-request logging from message and forwarding loops
Summary
c.Addresscontains a parseable IP before including itTest plan
apiAddresses()excludesc.Addresswhen it lacks a valid IP host (e.g.:8443)pkg/cloudauthandpkg/globalrouterunit tests