Skip to content

fix(BA-4889): fix container net_rx/net_tx stats reading host namespace counters#9681

Open
fregataa wants to merge 17 commits intomainfrom
BA-4889
Open

fix(BA-4889): fix container net_rx/net_tx stats reading host namespace counters#9681
fregataa wants to merge 17 commits intomainfrom
BA-4889

Conversation

@fregataa
Copy link
Member

@fregataa fregataa commented Mar 5, 2026

Summary

  • Check setns() return value and raise OSError on failure, preventing silent fallback to the host network namespace when entering a container's netns fails
  • Handle OSError from netstat_ns() in gather_container_measures() to gracefully skip containers with inaccessible namespaces
  • Extract _get_libc() helper from setns() so tests can mock the libc layer
  • Add real namespace switching tests in TestNetstatNsWork using unshare --net subprocess

Reproduction

Running psutil.net_io_counters() after valid vs invalid setns() in a privileged container with unshare --net:

Case bytes_recv bytes_sent total
1) Host namespace (baseline) 404,861 11,301 416,162
2) Valid setns() (container ns) 0 0 0
3) Invalid setns(fd=-1), unchecked 405,081 11,301 416,382
  • Valid fd → enters new namespace → counters are 0 (no traffic)
  • Invalid fd → setns() returns -1 but old code didn't check → stays in host namespace → reads host-level counters

This is the root cause of the stat spikes: when setns() silently fails, psutil sums cumulative bytes across all host interfaces.

Thread pool and namespace safety

The agent runs as a daemon process (aiotools.start_server with daemon=True), so ProcessPoolExecutor cannot spawn children. netstat_ns() always takes the thread pool path (run_in_executor(None, ...)). Container stat collection is sequential (await per container), so only one namespace switch happens at a time.

setns() is per-thread state — threads don't interfere with each other's namespace. If setns() succeeds in __enter__ (returns 0), the thread is guaranteed to be in the correct target namespace. The return value check is sufficient to prevent wrong readings:

Scenario Effect Mitigation
__enter__ setns fails Stays in host ns, would read host counters return value check → OSError → reading blocked
__enter__ setns succeeds In correct target ns No issue
__exit__ restore fails Thread can't return to host ns Warning logged; next __enter__ will enter correct target regardless

Test plan

  • pants fmt — passed
  • pants fix — passed
  • pants lint — passed
  • pants check — passed
  • pants test tests/unit/common/test_netns.py — passed
  • pants test tests/unit/agent/test_docker_intrinsic.py — passed
  • Verify on a live agent that container net_rx/net_tx values are stable and no longer spike

Resolves BA-4889

…e counters

- Check setns() return value and raise OSError on failure, preventing
  silent fallback to host namespace when entering container netns fails
- Cache libc handle with use_errno=True for reliable errno retrieval
- Replace thread pool fallback for daemon processes with fork-based
  subprocess to ensure proper namespace isolation
- Add rate_ceiling to MovingStatistics to clamp physically impossible
  network rates (>100 Gbps) as a defensive outlier detection layer
- Wrap netstat_ns call in sysfs_impl with OSError handler to gracefully
  skip network stats when namespace entry fails
- Add regression tests for setns error propagation, rate ceiling
  clamping, and netstat_ns_work failure handling

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings March 5, 2026 10:15
@github-actions github-actions bot added size:L 100~500 LoC comp:agent Related to Agent component comp:common Related to Common component labels Mar 5, 2026
@fregataa fregataa marked this pull request as draft March 5, 2026 10:17
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes incorrect container net_rx/net_tx spikes caused by silently failing network-namespace entry (falling back to host counters), and adds a defensive rate outlier clamp in stats to suppress physically impossible network-rate jumps.

Changes:

  • Make setns() raise OSError when the underlying libc call fails; log (non-fatal) failures when restoring the original namespace.
  • Rework netstat_ns() execution strategy and add handling for setns() failures while collecting container net stats.
  • Add rate_ceiling support to MovingStatistics / measurements and introduce unit tests for netns failure and rate clamping behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/unit/common/test_netns.py Adds coverage for setns() failure/success behavior.
tests/unit/agent/test_stats.py Adds tests for MovingStatistics(rate_ceiling=...) clamp semantics.
tests/unit/agent/test_docker_intrinsic.py Adds coverage ensuring netstat_ns_work() propagates setns/nsenter errors.
src/ai/backend/common/netns.py Implements libc caching, errno-aware setns() error handling, and safer __exit__() cleanup/logging.
src/ai/backend/agent/stats.py Threads rate_ceiling through measurements/metrics and clamps extreme rates in MovingStatistics.rate.
src/ai/backend/agent/docker/intrinsic.py Adds a fork-based netns stats path and applies a network rate ceiling to net_rx/net_tx metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

fregataa and others added 4 commits March 5, 2026 19:45
Extract libc handle caching (_get_libc) to a separate issue (BA-4890).
Keep the bug fix minimal: only add setns() return value check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…path

Remove _netstat_ns_subprocess and _netstat_ns_child which used
multiprocessing.get_context("fork").Process — this raises
AssertionError in daemon processes just like ProcessPoolExecutor.

The agent worker always runs as a daemon process (aiotools
start_server spawns with daemon=True), so the thread pool path
is the only viable option. With setns() now checking return values
(raising OSError on failure), the thread pool path correctly
detects namespace switch failures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fregataa fregataa added this to the 25.15 milestone Mar 5, 2026
fregataa and others added 3 commits March 5, 2026 20:14
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Rate ceiling masks symptoms and hinders diagnosis. The core fix
(setns return value check + OSError handling) is sufficient. If
spikes still occur, the raw values serve as debugging evidence.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size:M 30~100 LoC and removed size:L 100~500 LoC labels Mar 5, 2026
The test mocks ai.backend.common.netns._get_libc but the function
didn't exist — libc loading was inline in setns(). Extract it so
the mock works correctly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fregataa fregataa marked this pull request as ready for review March 5, 2026 12:36
@fregataa fregataa requested a review from a team March 5, 2026 12:36
@fregataa fregataa marked this pull request as draft March 5, 2026 12:45
Demonstrates the root cause of stat spikes: when setns() fails with
an invalid fd and the return value is unchecked, psutil reads host
namespace counters instead of the container's. Uses ProcessPoolExecutor
for namespace isolation, skipped on non-Linux platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size:L 100~500 LoC and removed size:M 30~100 LoC labels Mar 5, 2026
fregataa and others added 2 commits March 5, 2026 22:19
Use `unshare --net` to create an isolated network namespace instead of
pulling a Docker image. Removes aiodocker dependency from the test and
avoids CI failures due to missing images.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Keep only the mock-based TestSetns class. The namespace isolation test
requires Linux privileges and is better suited for manual verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added the size:M 30~100 LoC label Mar 5, 2026
@github-actions github-actions bot removed the size:L 100~500 LoC label Mar 5, 2026
Use unshare --net subprocess to create an isolated network namespace
and verify netstat_ns_work reads the correct namespace counters.

- Success: valid namespace fd returns isolated counters (loopback only, zero bytes)
- Failure: non-namespace fd (/dev/null) raises OSError from setns()

Tests run in ProcessPoolExecutor for namespace isolation, skipped on
non-Linux platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions github-actions bot added size:L 100~500 LoC and removed size:M 30~100 LoC labels Mar 5, 2026
fregataa and others added 2 commits March 5, 2026 22:42
Replace external unshare command with libc unshare() via ctypes in
preexec_fn. Removes shutil dependency and unshare command availability
check.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Drop os.unshare/getattr fallback logic. Just use `unshare --net`
command which is available on Linux (busybox and util-linux).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@fregataa fregataa marked this pull request as ready for review March 5, 2026 14:04
Check if the unshare process is still alive before yielding. If it
exited immediately (e.g. EPERM in CI), skip the test gracefully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp:agent Related to Agent component comp:common Related to Common component size:L 100~500 LoC

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants