Conversation
configurePort() appends three iptables nat rules per NodePort (PREROUTING, OUTPUT, POSTROUTING) but nothing ever deleted them when the sandbox was destroyed. Since iptables evaluates first-match-wins, stale rules pointing at dead sandbox IPs shadow the current sandbox's rules, making node ports unreachable. We found 25 stale DNAT entries for port 6667 on miren-club. Add unconfigureFirewall/unconfigurePort that mirror the creation path but use -D (delete). Errors are logged as warnings — teardown must not fail because a rule is already gone. As defense-in-depth, switch from -A (append) to -I (insert) for the PREROUTING and OUTPUT chains. -I prepends to the top of the chain, so the newest sandbox's rules always match first, even if a prior cleanup was missed.
📝 WalkthroughWalkthroughAdds sandbox firewall cleanup: new unconfigureFirewall and unconfigurePort methods remove iptables NAT rules and parse sandbox network addresses for teardown; Darwin has a noop stub. stopSandbox and Delete flows now attempt to fetch sandbox entities to derive IPs and invoke firewall cleanup before stopping containers. Port rule insertion changed to insert-at-top. Numerous controller Delete method signatures were expanded to accept an additional object parameter, and call sites/tests updated to pass nil where applicable. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@controllers/sandbox/sandbox.go`:
- Around line 2301-2305: The current branch that skips firewall cleanup when
errors.Is(entityErr, cond.ErrNotFound{}) needs a fallback: instead of bypassing
DNAT/port cleanup, fetch the needed port/IP cleanup metadata from an alternate
source and proceed with cleanup. Update the code around the entityErr check in
controllers/sandbox/sandbox.go (the block logging "failed to get sandbox entity
for cleanup" / "sandbox entity already deleted") so that when cond.ErrNotFound{}
is detected you attempt to load port/IP bindings from a fallback store (e.g.,
the firewall metadata table, network binding records, or any
SandboxRecord/cleanup cache your service has) and call the same DNAT/port
cleanup routine with that data; ensure the cleanup routine and logs remain the
same path used when the entity is found so DNAT rules are removed even if the
sandbox entity was directly deleted.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 0a0d0aed-11a5-40ac-abac-ed85a0619bb6
📒 Files selected for processing (3)
controllers/sandbox/firewall.gocontrollers/sandbox/firewall_darwin.gocontrollers/sandbox/sandbox.go
| } else if !errors.Is(entityErr, cond.ErrNotFound{}) { | ||
| c.Log.Warn("failed to get sandbox entity for cleanup", "id", id, "err", entityErr) | ||
| } else { | ||
| c.Log.Warn("sandbox entity already deleted, skipping firewall cleanup", "id", id) | ||
| } |
There was a problem hiding this comment.
Firewall cleanup is skipped on NotFound, which can miss direct-delete teardowns.
When sandbox entity lookup fails with not found, cleanup is fully bypassed. In delete-driven teardown flows, that can leave DNAT rules behind. Please add a fallback source for port/IP cleanup data when the entity is absent (or ensure cleanup happens before entity deletion in that flow).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@controllers/sandbox/sandbox.go` around lines 2301 - 2305, The current branch
that skips firewall cleanup when errors.Is(entityErr, cond.ErrNotFound{}) needs
a fallback: instead of bypassing DNAT/port cleanup, fetch the needed port/IP
cleanup metadata from an alternate source and proceed with cleanup. Update the
code around the entityErr check in controllers/sandbox/sandbox.go (the block
logging "failed to get sandbox entity for cleanup" / "sandbox entity already
deleted") so that when cond.ErrNotFound{} is detected you attempt to load
port/IP bindings from a fallback store (e.g., the firewall metadata table,
network binding records, or any SandboxRecord/cleanup cache your service has)
and call the same DNAT/port cleanup routine with that data; ensure the cleanup
routine and logs remain the same path used when the entity is found so DNAT
rules are removed even if the sandbox entity was directly deleted.
Update GenericController.Delete to receive the decoded entity from delete watch events, completing the plumbing started in PR #600 which made entity data available in delete events but stopped short of threading it through the controller interface. This allows the sandbox controller to clean up iptables DNAT rules even when the entity has already been deleted from the store, since the entity data arrives directly from the watch event rather than requiring a store lookup.
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
controllers/sandbox/sandbox.go (1)
2274-2306:⚠️ Potential issue | 🟠 MajorFallback IP discovery can still leak subnet allocations.
Line 2284 populates
sandboxIPsfrom entity data, but IP release is still underif container != nillater instopSandbox. If the pause container is already gone, those fallback IPs never reachc.Subnet.ReleaseAddr(...).💡 Proposed fix
@@ - // Release IPs - for ipStr := range sandboxIPs { - addr, err := netip.ParseAddr(ipStr) - if err == nil { - err = c.Subnet.ReleaseAddr(addr) - if err != nil { - c.Log.Error("failed to release IP", "addr", addr, "err", err) - } else { - c.Log.Debug("released IP", "addr", addr) - } - } else { - c.Log.Error("failed to parse IP", "addr", ipStr, "err", err) - } - } - c.Log.Info("container stopped", "id", id) } + + // Release IPs even if pause container is already missing. + for ipStr := range sandboxIPs { + addr, err := netip.ParseAddr(ipStr) + if err != nil { + c.Log.Error("failed to parse IP", "addr", ipStr, "err", err) + continue + } + if err := c.Subnet.ReleaseAddr(addr); err != nil { + c.Log.Error("failed to release IP", "addr", addr, "err", err) + } else { + c.Log.Debug("released IP", "addr", addr) + } + }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/sandbox/sandbox.go` around lines 2274 - 2306, The fallback IPs extracted into sandboxIPs from the entity (in the block using c.EAC.Get and compute.Sandbox) can be skipped later because ReleaseAddr is only called when container != nil in stopSandbox; ensure fallback addresses are always released by moving or duplicating the call to c.Subnet.ReleaseAddr (or adding a release path) so it runs even if the pause/container is missing; update stopSandbox (and/or the cleanup path after c.unconfigureFirewall) to iterate sandboxIPs and call c.Subnet.ReleaseAddr(ctx, addr) for each IP (handling errors and logging) so entity-derived IPs are not leaked.
🧹 Nitpick comments (2)
controllers/service/service.go (1)
534-535: Consider using the delete payload to remove service nft state.Now that
Deletereceives the pre-delete service object, you can tear down chain/map entries for that service instead of leaving this as a no-op.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@controllers/service/service.go` around lines 534 - 535, The Delete method on ServiceController is currently a no-op; use the provided pre-delete object (obj *network_v1alpha.Service) to tear down any on-chain and in-memory/map state for that service instead of returning nil. Locate ServiceController.Delete and call the appropriate teardown/remove functions (e.g., invoke your chain client like s.chainClient.DeleteServiceNFT(ctx, obj) and remove map/store entries via s.store.DeleteService(ctx, id) or similar helper methods), handle and return errors from those calls, and ensure any related indexes or caches are evicted so the service NFT and related state are fully removed.pkg/controller/controller_test.go (1)
304-306: Add a delete-path assertion for propagated object data.
Deletenow acceptsobj, but the current adapter tests only validate IDs/call counts. Consider extending the stub and tests to assert decoded object propagation (and nil behavior when event entity is absent).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@pkg/controller/controller_test.go` around lines 304 - 306, The Delete method now receives the object pointer (obj *TestEntity) but tests only assert IDs; update the test stubs and assertions to verify object propagation: extend BasicController (or the test double used in adapter tests) to record received object data (e.g., add a DeleteObjs slice or similar alongside DeleteCalls) inside Delete(id, obj), append a copy or serialized form of obj (handling nil). Then update the adapter tests to assert that when an event includes an entity payload the recorded DeleteObjs entry matches the expected decoded TestEntity fields, and assert nil (or absence) in DeleteObjs when the event has no entity payload. Ensure you reference BasicController.Delete, DeleteCalls and TestEntity in the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@controllers/sandbox/sandbox.go`:
- Around line 2274-2306: The fallback IPs extracted into sandboxIPs from the
entity (in the block using c.EAC.Get and compute.Sandbox) can be skipped later
because ReleaseAddr is only called when container != nil in stopSandbox; ensure
fallback addresses are always released by moving or duplicating the call to
c.Subnet.ReleaseAddr (or adding a release path) so it runs even if the
pause/container is missing; update stopSandbox (and/or the cleanup path after
c.unconfigureFirewall) to iterate sandboxIPs and call c.Subnet.ReleaseAddr(ctx,
addr) for each IP (handling errors and logging) so entity-derived IPs are not
leaked.
---
Nitpick comments:
In `@controllers/service/service.go`:
- Around line 534-535: The Delete method on ServiceController is currently a
no-op; use the provided pre-delete object (obj *network_v1alpha.Service) to tear
down any on-chain and in-memory/map state for that service instead of returning
nil. Locate ServiceController.Delete and call the appropriate teardown/remove
functions (e.g., invoke your chain client like
s.chainClient.DeleteServiceNFT(ctx, obj) and remove map/store entries via
s.store.DeleteService(ctx, id) or similar helper methods), handle and return
errors from those calls, and ensure any related indexes or caches are evicted so
the service NFT and related state are fully removed.
In `@pkg/controller/controller_test.go`:
- Around line 304-306: The Delete method now receives the object pointer (obj
*TestEntity) but tests only assert IDs; update the test stubs and assertions to
verify object propagation: extend BasicController (or the test double used in
adapter tests) to record received object data (e.g., add a DeleteObjs slice or
similar alongside DeleteCalls) inside Delete(id, obj), append a copy or
serialized form of obj (handling nil). Then update the adapter tests to assert
that when an event includes an entity payload the recorded DeleteObjs entry
matches the expected decoded TestEntity fields, and assert nil (or absence) in
DeleteObjs when the event has no entity payload. Ensure you reference
BasicController.Delete, DeleteCalls and TestEntity in the changes.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 467f2f6c-3fa9-40e2-9f29-9eeab837ad78
📒 Files selected for processing (17)
components/runner/runner.gocontrollers/certificate/controller.gocontrollers/disk/disk_controller.gocontrollers/disk/disk_lease_controller.gocontrollers/disk/disk_lease_controller_directory_test.gocontrollers/disk/disk_lease_controller_test.gocontrollers/disk/disk_watch_controller.gocontrollers/disk/mount_watch_controller.gocontrollers/disk/volume_watch_controller.gocontrollers/ingress/defaultroute.gocontrollers/ingress/defaultrouteapp.gocontrollers/sandbox/sandbox.gocontrollers/sandbox/sandbox_test.gocontrollers/service/service.gocontrollers/service/service_test.gopkg/controller/controller.gopkg/controller/controller_test.go
configurePort()adds three iptables nat rules per NodePort (PREROUTING, OUTPUT, POSTROUTING) when a sandbox starts up, but nothing was cleaning them up on teardown. Since iptables evaluates first-match-wins, stale rules pointing at dead sandbox IPs shadow the new sandbox's rules, making the node port unreachable after the first redeploy. We found 25 stale DNAT entries for port 6667 on miren-club — every sandbox replacement was leaving behind three orphaned rules.The fix adds
unconfigureFirewall/unconfigurePortfunctions that mirror the creation path but use-D(delete). They're called fromstopSandboxusing the sandbox entity fetched from the entity store, which already has the network address and port specs we need. Cleanup errors are logged as warnings rather than returned — teardown must not fail because a rule is already gone.As defense-in-depth, the creation path now uses
-I(insert) instead of-A(append) for the PREROUTING and OUTPUT chains.-Iprepends to the top of the chain, so the newest sandbox's rules always match first even if a prior cleanup was somehow missed.Closes MIR-752