Skip to content

Clean up iptables DNAT rules on sandbox teardown#644

Merged
phinze merged 2 commits intomainfrom
phinze/mir-752-node_port-stale-iptables-dnat-rules-shadow-active-sandbox
Mar 4, 2026
Merged

Clean up iptables DNAT rules on sandbox teardown#644
phinze merged 2 commits intomainfrom
phinze/mir-752-node_port-stale-iptables-dnat-rules-shadow-active-sandbox

Conversation

@phinze
Copy link
Copy Markdown
Contributor

@phinze phinze commented Mar 4, 2026

configurePort() adds three iptables nat rules per NodePort (PREROUTING, OUTPUT, POSTROUTING) when a sandbox starts up, but nothing was cleaning them up on teardown. Since iptables evaluates first-match-wins, stale rules pointing at dead sandbox IPs shadow the new sandbox's rules, making the node port unreachable after the first redeploy. We found 25 stale DNAT entries for port 6667 on miren-club — every sandbox replacement was leaving behind three orphaned rules.

The fix adds unconfigureFirewall / unconfigurePort functions that mirror the creation path but use -D (delete). They're called from stopSandbox using the sandbox entity fetched from the entity store, which already has the network address and port specs we need. Cleanup errors are logged as warnings rather than returned — teardown must not fail because a rule is already gone.

As defense-in-depth, the creation path now uses -I (insert) instead of -A (append) for the PREROUTING and OUTPUT chains. -I prepends to the top of the chain, so the newest sandbox's rules always match first even if a prior cleanup was somehow missed.

Closes MIR-752

configurePort() appends three iptables nat rules per NodePort
(PREROUTING, OUTPUT, POSTROUTING) but nothing ever deleted them
when the sandbox was destroyed. Since iptables evaluates
first-match-wins, stale rules pointing at dead sandbox IPs
shadow the current sandbox's rules, making node ports
unreachable. We found 25 stale DNAT entries for port 6667 on
miren-club.

Add unconfigureFirewall/unconfigurePort that mirror the creation
path but use -D (delete). Errors are logged as warnings —
teardown must not fail because a rule is already gone.

As defense-in-depth, switch from -A (append) to -I (insert) for
the PREROUTING and OUTPUT chains. -I prepends to the top of the
chain, so the newest sandbox's rules always match first, even if
a prior cleanup was missed.
@phinze phinze requested a review from a team as a code owner March 4, 2026 21:51
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 4, 2026

📝 Walkthrough

Walkthrough

Adds sandbox firewall cleanup: new unconfigureFirewall and unconfigurePort methods remove iptables NAT rules and parse sandbox network addresses for teardown; Darwin has a noop stub. stopSandbox and Delete flows now attempt to fetch sandbox entities to derive IPs and invoke firewall cleanup before stopping containers. Port rule insertion changed to insert-at-top. Numerous controller Delete method signatures were expanded to accept an additional object parameter, and call sites/tests updated to pass nil where applicable.


Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@controllers/sandbox/sandbox.go`:
- Around line 2301-2305: The current branch that skips firewall cleanup when
errors.Is(entityErr, cond.ErrNotFound{}) needs a fallback: instead of bypassing
DNAT/port cleanup, fetch the needed port/IP cleanup metadata from an alternate
source and proceed with cleanup. Update the code around the entityErr check in
controllers/sandbox/sandbox.go (the block logging "failed to get sandbox entity
for cleanup" / "sandbox entity already deleted") so that when cond.ErrNotFound{}
is detected you attempt to load port/IP bindings from a fallback store (e.g.,
the firewall metadata table, network binding records, or any
SandboxRecord/cleanup cache your service has) and call the same DNAT/port
cleanup routine with that data; ensure the cleanup routine and logs remain the
same path used when the entity is found so DNAT rules are removed even if the
sandbox entity was directly deleted.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 0a0d0aed-11a5-40ac-abac-ed85a0619bb6

📥 Commits

Reviewing files that changed from the base of the PR and between 67528a3 and e053526.

📒 Files selected for processing (3)
  • controllers/sandbox/firewall.go
  • controllers/sandbox/firewall_darwin.go
  • controllers/sandbox/sandbox.go

Comment on lines +2301 to 2305
} else if !errors.Is(entityErr, cond.ErrNotFound{}) {
c.Log.Warn("failed to get sandbox entity for cleanup", "id", id, "err", entityErr)
} else {
c.Log.Warn("sandbox entity already deleted, skipping firewall cleanup", "id", id)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Firewall cleanup is skipped on NotFound, which can miss direct-delete teardowns.

When sandbox entity lookup fails with not found, cleanup is fully bypassed. In delete-driven teardown flows, that can leave DNAT rules behind. Please add a fallback source for port/IP cleanup data when the entity is absent (or ensure cleanup happens before entity deletion in that flow).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/sandbox/sandbox.go` around lines 2301 - 2305, The current branch
that skips firewall cleanup when errors.Is(entityErr, cond.ErrNotFound{}) needs
a fallback: instead of bypassing DNAT/port cleanup, fetch the needed port/IP
cleanup metadata from an alternate source and proceed with cleanup. Update the
code around the entityErr check in controllers/sandbox/sandbox.go (the block
logging "failed to get sandbox entity for cleanup" / "sandbox entity already
deleted") so that when cond.ErrNotFound{} is detected you attempt to load
port/IP bindings from a fallback store (e.g., the firewall metadata table,
network binding records, or any SandboxRecord/cleanup cache your service has)
and call the same DNAT/port cleanup routine with that data; ensure the cleanup
routine and logs remain the same path used when the entity is found so DNAT
rules are removed even if the sandbox entity was directly deleted.

Update GenericController.Delete to receive the decoded entity from
delete watch events, completing the plumbing started in PR #600 which
made entity data available in delete events but stopped short of
threading it through the controller interface.

This allows the sandbox controller to clean up iptables DNAT rules even
when the entity has already been deleted from the store, since the
entity data arrives directly from the watch event rather than requiring
a store lookup.
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
controllers/sandbox/sandbox.go (1)

2274-2306: ⚠️ Potential issue | 🟠 Major

Fallback IP discovery can still leak subnet allocations.

Line 2284 populates sandboxIPs from entity data, but IP release is still under if container != nil later in stopSandbox. If the pause container is already gone, those fallback IPs never reach c.Subnet.ReleaseAddr(...).

💡 Proposed fix
@@
-		// Release IPs
-		for ipStr := range sandboxIPs {
-			addr, err := netip.ParseAddr(ipStr)
-			if err == nil {
-				err = c.Subnet.ReleaseAddr(addr)
-				if err != nil {
-					c.Log.Error("failed to release IP", "addr", addr, "err", err)
-				} else {
-					c.Log.Debug("released IP", "addr", addr)
-				}
-			} else {
-				c.Log.Error("failed to parse IP", "addr", ipStr, "err", err)
-			}
-		}
-
 		c.Log.Info("container stopped", "id", id)
 	}
+
+	// Release IPs even if pause container is already missing.
+	for ipStr := range sandboxIPs {
+		addr, err := netip.ParseAddr(ipStr)
+		if err != nil {
+			c.Log.Error("failed to parse IP", "addr", ipStr, "err", err)
+			continue
+		}
+		if err := c.Subnet.ReleaseAddr(addr); err != nil {
+			c.Log.Error("failed to release IP", "addr", addr, "err", err)
+		} else {
+			c.Log.Debug("released IP", "addr", addr)
+		}
+	}
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/sandbox/sandbox.go` around lines 2274 - 2306, The fallback IPs
extracted into sandboxIPs from the entity (in the block using c.EAC.Get and
compute.Sandbox) can be skipped later because ReleaseAddr is only called when
container != nil in stopSandbox; ensure fallback addresses are always released
by moving or duplicating the call to c.Subnet.ReleaseAddr (or adding a release
path) so it runs even if the pause/container is missing; update stopSandbox
(and/or the cleanup path after c.unconfigureFirewall) to iterate sandboxIPs and
call c.Subnet.ReleaseAddr(ctx, addr) for each IP (handling errors and logging)
so entity-derived IPs are not leaked.
🧹 Nitpick comments (2)
controllers/service/service.go (1)

534-535: Consider using the delete payload to remove service nft state.

Now that Delete receives the pre-delete service object, you can tear down chain/map entries for that service instead of leaving this as a no-op.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@controllers/service/service.go` around lines 534 - 535, The Delete method on
ServiceController is currently a no-op; use the provided pre-delete object (obj
*network_v1alpha.Service) to tear down any on-chain and in-memory/map state for
that service instead of returning nil. Locate ServiceController.Delete and call
the appropriate teardown/remove functions (e.g., invoke your chain client like
s.chainClient.DeleteServiceNFT(ctx, obj) and remove map/store entries via
s.store.DeleteService(ctx, id) or similar helper methods), handle and return
errors from those calls, and ensure any related indexes or caches are evicted so
the service NFT and related state are fully removed.
pkg/controller/controller_test.go (1)

304-306: Add a delete-path assertion for propagated object data.

Delete now accepts obj, but the current adapter tests only validate IDs/call counts. Consider extending the stub and tests to assert decoded object propagation (and nil behavior when event entity is absent).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@pkg/controller/controller_test.go` around lines 304 - 306, The Delete method
now receives the object pointer (obj *TestEntity) but tests only assert IDs;
update the test stubs and assertions to verify object propagation: extend
BasicController (or the test double used in adapter tests) to record received
object data (e.g., add a DeleteObjs slice or similar alongside DeleteCalls)
inside Delete(id, obj), append a copy or serialized form of obj (handling nil).
Then update the adapter tests to assert that when an event includes an entity
payload the recorded DeleteObjs entry matches the expected decoded TestEntity
fields, and assert nil (or absence) in DeleteObjs when the event has no entity
payload. Ensure you reference BasicController.Delete, DeleteCalls and TestEntity
in the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@controllers/sandbox/sandbox.go`:
- Around line 2274-2306: The fallback IPs extracted into sandboxIPs from the
entity (in the block using c.EAC.Get and compute.Sandbox) can be skipped later
because ReleaseAddr is only called when container != nil in stopSandbox; ensure
fallback addresses are always released by moving or duplicating the call to
c.Subnet.ReleaseAddr (or adding a release path) so it runs even if the
pause/container is missing; update stopSandbox (and/or the cleanup path after
c.unconfigureFirewall) to iterate sandboxIPs and call c.Subnet.ReleaseAddr(ctx,
addr) for each IP (handling errors and logging) so entity-derived IPs are not
leaked.

---

Nitpick comments:
In `@controllers/service/service.go`:
- Around line 534-535: The Delete method on ServiceController is currently a
no-op; use the provided pre-delete object (obj *network_v1alpha.Service) to tear
down any on-chain and in-memory/map state for that service instead of returning
nil. Locate ServiceController.Delete and call the appropriate teardown/remove
functions (e.g., invoke your chain client like
s.chainClient.DeleteServiceNFT(ctx, obj) and remove map/store entries via
s.store.DeleteService(ctx, id) or similar helper methods), handle and return
errors from those calls, and ensure any related indexes or caches are evicted so
the service NFT and related state are fully removed.

In `@pkg/controller/controller_test.go`:
- Around line 304-306: The Delete method now receives the object pointer (obj
*TestEntity) but tests only assert IDs; update the test stubs and assertions to
verify object propagation: extend BasicController (or the test double used in
adapter tests) to record received object data (e.g., add a DeleteObjs slice or
similar alongside DeleteCalls) inside Delete(id, obj), append a copy or
serialized form of obj (handling nil). Then update the adapter tests to assert
that when an event includes an entity payload the recorded DeleteObjs entry
matches the expected decoded TestEntity fields, and assert nil (or absence) in
DeleteObjs when the event has no entity payload. Ensure you reference
BasicController.Delete, DeleteCalls and TestEntity in the changes.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 467f2f6c-3fa9-40e2-9f29-9eeab837ad78

📥 Commits

Reviewing files that changed from the base of the PR and between e053526 and 4c78daa.

📒 Files selected for processing (17)
  • components/runner/runner.go
  • controllers/certificate/controller.go
  • controllers/disk/disk_controller.go
  • controllers/disk/disk_lease_controller.go
  • controllers/disk/disk_lease_controller_directory_test.go
  • controllers/disk/disk_lease_controller_test.go
  • controllers/disk/disk_watch_controller.go
  • controllers/disk/mount_watch_controller.go
  • controllers/disk/volume_watch_controller.go
  • controllers/ingress/defaultroute.go
  • controllers/ingress/defaultrouteapp.go
  • controllers/sandbox/sandbox.go
  • controllers/sandbox/sandbox_test.go
  • controllers/service/service.go
  • controllers/service/service_test.go
  • pkg/controller/controller.go
  • pkg/controller/controller_test.go

@phinze phinze merged commit 11190b9 into main Mar 4, 2026
11 checks passed
@phinze phinze deleted the phinze/mir-752-node_port-stale-iptables-dnat-rules-shadow-active-sandbox branch March 4, 2026 23:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants