Conversation
…e counters - Check setns() return value and raise OSError on failure, preventing silent fallback to host namespace when entering container netns fails - Cache libc handle with use_errno=True for reliable errno retrieval - Replace thread pool fallback for daemon processes with fork-based subprocess to ensure proper namespace isolation - Add rate_ceiling to MovingStatistics to clamp physically impossible network rates (>100 Gbps) as a defensive outlier detection layer - Wrap netstat_ns call in sysfs_impl with OSError handler to gracefully skip network stats when namespace entry fails - Add regression tests for setns error propagation, rate ceiling clamping, and netstat_ns_work failure handling Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Fixes incorrect container net_rx/net_tx spikes caused by silently failing network-namespace entry (falling back to host counters), and adds a defensive rate outlier clamp in stats to suppress physically impossible network-rate jumps.
Changes:
- Make
setns()raiseOSErrorwhen the underlying libc call fails; log (non-fatal) failures when restoring the original namespace. - Rework
netstat_ns()execution strategy and add handling forsetns()failures while collecting container net stats. - Add
rate_ceilingsupport toMovingStatistics/ measurements and introduce unit tests for netns failure and rate clamping behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/unit/common/test_netns.py | Adds coverage for setns() failure/success behavior. |
| tests/unit/agent/test_stats.py | Adds tests for MovingStatistics(rate_ceiling=...) clamp semantics. |
| tests/unit/agent/test_docker_intrinsic.py | Adds coverage ensuring netstat_ns_work() propagates setns/nsenter errors. |
| src/ai/backend/common/netns.py | Implements libc caching, errno-aware setns() error handling, and safer __exit__() cleanup/logging. |
| src/ai/backend/agent/stats.py | Threads rate_ceiling through measurements/metrics and clamps extreme rates in MovingStatistics.rate. |
| src/ai/backend/agent/docker/intrinsic.py | Adds a fork-based netns stats path and applies a network rate ceiling to net_rx/net_tx metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Extract libc handle caching (_get_libc) to a separate issue (BA-4890). Keep the bug fix minimal: only add setns() return value check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…path
Remove _netstat_ns_subprocess and _netstat_ns_child which used
multiprocessing.get_context("fork").Process — this raises
AssertionError in daemon processes just like ProcessPoolExecutor.
The agent worker always runs as a daemon process (aiotools
start_server spawns with daemon=True), so the thread pool path
is the only viable option. With setns() now checking return values
(raising OSError on failure), the thread pool path correctly
detects namespace switch failures.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rate ceiling masks symptoms and hinders diagnosis. The core fix (setns return value check + OSError handling) is sufficient. If spikes still occur, the raw values serve as debugging evidence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The test mocks ai.backend.common.netns._get_libc but the function didn't exist — libc loading was inline in setns(). Extract it so the mock works correctly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Demonstrates the root cause of stat spikes: when setns() fails with an invalid fd and the return value is unchecked, psutil reads host namespace counters instead of the container's. Uses ProcessPoolExecutor for namespace isolation, skipped on non-Linux platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use `unshare --net` to create an isolated network namespace instead of pulling a Docker image. Removes aiodocker dependency from the test and avoids CI failures due to missing images. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep only the mock-based TestSetns class. The namespace isolation test requires Linux privileges and is better suited for manual verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use unshare --net subprocess to create an isolated network namespace and verify netstat_ns_work reads the correct namespace counters. - Success: valid namespace fd returns isolated counters (loopback only, zero bytes) - Failure: non-namespace fd (/dev/null) raises OSError from setns() Tests run in ProcessPoolExecutor for namespace isolation, skipped on non-Linux platforms. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace external unshare command with libc unshare() via ctypes in preexec_fn. Removes shutil dependency and unshare command availability check. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop os.unshare/getattr fallback logic. Just use `unshare --net` command which is available on Linux (busybox and util-linux). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Check if the unshare process is still alive before yielding. If it exited immediately (e.g. EPERM in CI), skip the test gracefully. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
setns()return value and raiseOSErroron failure, preventing silent fallback to the host network namespace when entering a container's netns failsOSErrorfromnetstat_ns()ingather_container_measures()to gracefully skip containers with inaccessible namespaces_get_libc()helper fromsetns()so tests can mock the libc layerTestNetstatNsWorkusingunshare --netsubprocessReproduction
Running
psutil.net_io_counters()after valid vs invalidsetns()in a privileged container withunshare --net:setns()(container ns)setns(fd=-1), uncheckedsetns()returns -1 but old code didn't check → stays in host namespace → reads host-level countersThis is the root cause of the stat spikes: when
setns()silently fails,psutilsums cumulative bytes across all host interfaces.Thread pool and namespace safety
The agent runs as a daemon process (
aiotools.start_serverwithdaemon=True), soProcessPoolExecutorcannot spawn children.netstat_ns()always takes the thread pool path (run_in_executor(None, ...)). Container stat collection is sequential (awaitper container), so only one namespace switch happens at a time.setns()is per-thread state — threads don't interfere with each other's namespace. Ifsetns()succeeds in__enter__(returns 0), the thread is guaranteed to be in the correct target namespace. The return value check is sufficient to prevent wrong readings:__enter__setns fails__enter__setns succeeds__exit__restore fails__enter__will enter correct target regardlessTest plan
pants fmt— passedpants fix— passedpants lint— passedpants check— passedpants test tests/unit/common/test_netns.py— passedpants test tests/unit/agent/test_docker_intrinsic.py— passedResolves BA-4889