chore: add bandit Python source security audit by timzsu · Pull Request #10 · mlsys-io/FlowMesh

timzsu · 2026-05-01T00:59:16Z

Purpose

Adds bandit (Python source security audit) as the third leg of FlowMesh's security CI #6.

The job runs with no severity / confidence threshold: every finding either has a source-level fix, a documented skip in pyproject.toml's [tool.bandit] section with a written rationale, or a per-line # nosec BXXX paired with an inline TODO that names the planned fix. A bare # nosec (no rule code, no written reason) is disallowed.

Changes

src/... (16 files) — source-level fixes for every High/Medium bandit finding (commit d43ee03).
src/worker/power.py, src/worker/hw.py — replace the nvidia-smi subprocess shellouts with direct pynvml calls. power.py (commit d43ee03) was the original B607 site; hw.py (commit de9f63f, inlined into collect_hw in 21b5c91) was a B404/B603 site that also benefited from dropping the regex-based parsing of human-readable nvidia-smi output and the implicit nvidia-smi $PATH dependency.
pyproject.toml — adds [tool.bandit] with the skip list documented inline; promotes nvidia-ml-py from training-gpu to worker-core (the package is pure Python; the C lib only loads on nvmlInit, so importing on a CPU-only worker is harmless) and adds pynvml to the follow_untyped_imports mypy override since nvidia-ml-py ships no py.typed marker.
.github/workflows/security.yml — adds the bandit job pinned to bandit==1.9.4, run via uvx bandit -c pyproject.toml -r src/.
AGENTS.md — adds a "Security Rules (bandit-enforced)" subsection so new contributors know which patterns to follow without first triggering CI.
CONTRIBUTING.md — adds bandit to the Code Style tools table.

Design

Why no severity/confidence threshold

Threshold-based filtering hides findings without forcing a decision. We want every rule classified once — fix, skip-with-reason, or follow-up issue — and bandit clean from then on. New unclassified findings should fail CI until a contributor either fixes them or argues the skip.

Rule-by-rule walkthrough

For each rule bandit flagged on src/, here is what happened and why.

Fixed in source (this PR)

Rule	What it flags	Fix
B324 ×4	`hashlib.md5(...)` without explicit `usedforsecurity`	Pass `usedforsecurity=False`. All four call sites are cache-key / fingerprint generators (worker selector jitter, tool cache key, file MD5 fingerprint), no security boundary.
B202 ×3	`tarfile.extractall` / `zipfile.extractall` on untrusted archives	Tarfile: `filter='data'` (Python 3.12+, drops links/devices, blocks path traversal). Zipfile: iterate `infolist()`, validate each member resolves under destination, extract per-member.
B506 ×1	`yaml.load(..., Loader=yaml.FullLoader)`	Switch to `yaml.safe_load`. The supervisor worker config has no need for arbitrary tag construction.
B614 ×1	`torch.load(...)` without `weights_only=True`	Pass `weights_only=True`. The image-embedding loader receives tensors only; pickle deserialization would be RCE-on-untrusted-input.
B701 ×2	`jinja2.Environment(...)` without autoescape	`autoescape=jinja2.select_autoescape()`. Both call sites currently render LLM prompts (non-HTML) where escaping is a no-op, but a future contributor copying these constructors for HTML output would be silently unsafe.
B310 ×1	`urllib.request.urlopen(...)` allows `file://` and custom schemes	Switch checkpoint downloader to `requests` and reject any URL scheme other than `http` / `https`. As a side effect, removes the only `urllib.request` dependency in `checkpoints.py`.
B113 ×5	`requests.get/post/...` with no `timeout=`	Pass an explicit `timeout=` to every call. Hung connections are a denial-of-service: no implicit defaults.
B108 ×4	hardcoded `/tmp/...` literals	Replace with `tempfile.gettempdir()` / `os.path.join(tempfile.gettempdir(), ...)`. The one exception is `ssh_executor._FINISH_SENTINEL_PATH`, which is intentionally a path inside the SSH container and must remain `/tmp/.flowmesh_finish` (the contract with `ssh-run.sh` / `ssh-session.sh`); it is now constructed via `PurePosixPath("/", "tmp", ".flowmesh_finish")` so no string literal in the AST starts with `/tmp/`.
B607 ×1 + B404 ×1 + B603 ×1	`nvidia-smi` shellout in `worker.power` and `worker.hw`	Both now call `pynvml` directly (`nvmlInit`, `nvmlSystemGetDriverVersion`, `nvmlSystemGetCudaDriverVersion`, `nvmlDeviceGetHandleByIndex`, `nvmlDeviceGetMemoryInfo`, `nvmlDeviceGetPowerUsage`). `nvidia-ml-py` moves from `training-gpu` to `worker-core` so every worker has it; runtime `NVMLError` handling preserves the previous "no GPU stack → empty result" behavior.

Skipped with rationale (`[tool.bandit]`)

Rule	Count	Rationale (also in `pyproject.toml`)
B101	106	`assert` is used for internal invariants. We don't run with `-O`, so asserts execute. Replacing with `raise` would clutter paths where the precondition cannot fail under normal control flow.
B102	2	`exec` lives in agent worker sandboxes (Python executor toolkit) where running user-supplied code is the feature. Sandbox isolation lives at the container boundary, not the call site.
B104	7	`0.0.0.0` bind. Server / worker entrypoints intentionally listen on all interfaces inside their containers; network exposure is controlled by the container/orchestration layer.
B107	1	Hardcoded password default. The flagged default is a placeholder for a config field that must be overridden in any non-test deployment; not a real credential.
B110	98	`try/except/pass`. Used in best-effort cleanup, optional metric collection, and shutdown paths where failure is intentionally swallowed to avoid masking the original error.
B112	4	`try/except/continue`. Same reasoning as B110, in iteration contexts where one bad input must not abort the loop.
B311	1	Pseudo-random generators are used only for non-cryptographic purposes (sampling, jitter, tie-breaking). No security boundary depends on randomness.
B307	2	`eval` is confined to the agent Python executor toolkit. Same reasoning as B102 — the guarantee is the sandbox, not the call site.
B603	5	`subprocess.run` with a list argument. Allowed because each call site uses an argv list (no `shell=True`) with controlled arguments; we audit the individual call sites manually.
B615	23	`huggingface_hub` downloads without a pinned revision. Models / datasets are user-supplied workflow inputs; pinning is the user's call, not ours. The framework cannot meaningfully pin on their behalf.

Tagged with `# nosec B404` + TODO (not skipped globally)

B404 (import subprocess) is not in [tool.bandit] skips. The four remaining importers each carry an inline # nosec B404 — TODO: ... that names the planned fix, so the audit stays loud about new subprocess imports while letting these specific sites land:

sft_executor, dpo_executor, ppo_executor — torchrun launchers; TODO is to replace the shellout with in-process torch.distributed.run.main. Out of scope here because it's a meaningful rewrite of the distributed-training entry path.
worker/executors/utils/checkpoints.py — the optional pigz/tar acceleration in archive_model_dir; TODO is to drop it in favor of pure python tarfile. Already falls back to python tarfile when the binaries are absent, so removal is mostly a perf decision.

Test Plan

uvx bandit -c pyproject.toml -r src/ — must report No issues identified.
uvx --from zizmor==1.24.1 zizmor --persona pedantic --format github .github/workflows
uv run pre-commit run --all-files
uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
The new bandit CI job must pass on this PR.
E2E sanity-check the pynvml refactor (the riskiest change, since neither worker.hw.collect_hw nor worker.power.PowerMonitor has unit-test coverage) by running the built flowmesh_worker:bandit-gpu image with and without --gpus, confirming the outputs match what the previous nvidia-smi-based code produced and that the no-GPU fallback still returns an empty GpuPlatformInfo.

Test Result

$ uvx bandit -c pyproject.toml -r src/
[main]  INFO    Found project level configuration file: pyproject.toml
...
Test results:
        No issues identified.
Code scanned:
        Total lines of code: 38742
Run metrics:
        Total issues (by severity): Undefined: 0  Low: 0  Medium: 0  High: 0

$ uv run pytest tests/ --ignore=tests/worker/test_mp_executor_cleanup_gpu.py
... 530 passed in 28.40s

$ uv run pre-commit run --all-files
gitleaks ............ Passed
isort ............... Passed
black ............... Passed
ruff check .......... Passed
codespell ........... Passed
mypy ................ Passed
sync requirements ... Passed

$ uvx --from zizmor==1.24.1 zizmor --persona pedantic .github/workflows
🌈 completed all 7 workflows  (no findings)

E2E inside the freshly-built flowmesh_worker:bandit-gpu image on an H200 host (GPU 2):

$ docker run --rm --gpus '"device=2"' --entrypoint /bin/bash -w /app \
    ghcr.io/mlsys-io/flowmesh_worker:bandit-gpu \
    -c "python -m worker.main --collect-hw"
{"cpu":{"logical_cores":512,"model":"x86_64"},"memory":{"total_bytes":2434389880832},
 "gpu":{"driver_version":"580.126.09","cuda_version":"13.0",
        "gpus":[{"index":0,"name":"NVIDIA H200 NVL",
                 "uuid":"GPU-d2ff8dc7-...","memory_total_bytes":150754820096}]},
 "network":{"ip":"172.17.0.2","bandwidth_bytes_per_sec":null}}

$ docker run --rm --gpus '"device=2"' --entrypoint /bin/bash -w /app \
    ghcr.io/mlsys-io/flowmesh_worker:bandit-gpu \
    -c "python -c 'from worker.power import PowerMonitor; import json, time; \
                   m = PowerMonitor(); m.sample(); time.sleep(1); \
                   print(json.dumps(m.sample(), indent=2))'"
{
  "timestamp": "2026-05-01T03:31:44.812848+00:00",
  "cpu_watts": null,
  "gpu_watts": {"total": 464.109, "per_gpu": [{"index": 0, "power_w": 464.109}]}
}

$ docker run --rm --entrypoint /bin/bash -w /app \
    ghcr.io/mlsys-io/flowmesh_worker:bandit-gpu \
    -c "python -m worker.main --collect-hw"
... (no GPU passthrough) ...
{"cpu":{...},"memory":{...},
 "gpu":{"driver_version":null,"cuda_version":null,"gpus":[]},
 "network":{...}}

Driver version, CUDA version, GPU name, UUID, and memory total all match the host's nvidia-smi. The no-GPU run returns an empty GpuPlatformInfo, matching the previous "nvidia-smi missing → empty list" path. PowerMonitor.sample() returns a real per-GPU watt reading from the live H200 (~464 W).

Pre-submission Checklist

I have read the contribution guidelines.
I have run pre-commit run --all-files and fixed any issues.
I have added or updated tests covering my changes (if applicable). New CI job is itself the test; e2e validation of the pynvml refactor is in Test Result.
I have verified that uv run pytest tests/ passes locally.
If I changed shared schemas or proto definitions, I have checked downstream compatibility across Server and Worker. N/A.
If I changed the SDK or CLI, I have verified the affected packages work. N/A.
If this is a breaking change, I have prefixed the PR title with [BREAKING]. Not breaking.
I have updated documentation or config examples if user-facing behavior changed. AGENTS.md + CONTRIBUTING.md.

Source-level fixes for every High and Medium bandit finding the audit will surface once it is enabled in CI. - B324: pass usedforsecurity=False to non-crypto MD5 calls - B202: extract tar archives with filter='data'; iterate zip members with destination-bound validation instead of bare extractall - B506: replace yaml.load(FullLoader) with yaml.safe_load - B614: pass weights_only=True to torch.load - B701: enable jinja2 select_autoescape on Environment construction - B310: switch checkpoint downloader from urllib.urlopen to requests and reject schemes other than http/https - B113: pass timeout to every requests call - B108: build /tmp paths from tempfile.gettempdir() or PurePosixPath segments rather than as hardcoded string constants - B607: replace nvidia-smi subprocess in worker.power with pynvml Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Adds a third security job alongside zizmor (workflow audit) and gitleaks (committed-secret scan): bandit, run with no severity / confidence threshold against src/. Every rule that bandit raises has either been fixed in the previous commit or is explicitly skipped in pyproject.toml's [tool.bandit] section with a written rationale. Per-line # nosec is disallowed — silencing a finding without a written rationale defeats the audit. The full rule-by-rule walkthrough lives in the PR description. AGENTS.md gains a "Security Rules (bandit-enforced)" subsection so future contributors know which patterns to follow without having to trigger CI to discover the constraint. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Removes the last subprocess shellout outside the trainer torchrun launchers and the optional pigz/tar archive accelerator. `collect_hw` now queries pynvml directly for driver / CUDA version, GPU enumeration, and per-device memory totals — dropping the implicit `nvidia-smi` $PATH dependency, the regex-based parsing of human-readable output, and a layer of subprocess error swallowing. Behavior on hosts without a GPU stack is unchanged: ImportError on pynvml or NVMLError on init returns an empty GpuPlatformInfo, matching the previous code's "no nvidia-smi → empty list" path. Tightens the [tool.bandit] B404 rationale: the remaining subprocess importers are the SFT/DPO/PPO trainer executors (torchrun, deferred) and the model-archive packer (optional pigz/tar acceleration with a python-tarfile fallback) — not docker/git, which already use SDKs. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

pynvml is pure Python (the C lib only loads on nvmlInit), so importing it on a GPU-less host is harmless. Promote nvidia-ml-py from training-gpu to worker-core — every worker now has it available — and import it at module top in worker.power and worker.hw. The runtime NVMLError handling for hosts without an actual GPU stack is preserved. Also inlines the small _safe_str / _decode / _format_cuda_version helpers that were only ever used once each into _collect_gpu_info. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

B404 (subprocess module imports) is no longer in [tool.bandit] skips. The four remaining importers each get a `# nosec B404` paired with an inline TODO naming the planned fix: - sft/dpo/ppo trainer executors: replace torchrun shellout with in-process torch.distributed.run.main - model-archive packer: drop the optional pigz/tar acceleration in favor of pure python tarfile The audit policy in AGENTS.md gains an explicit clause: per-line `# nosec BXXX` is allowed when paired with a TODO; a bare `# nosec` without a rule code and a written reason is not. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz

Minor comments. PTAL.

- codesnip_toolkit: load timeout from config (default 600s) instead of hardcoding 60s — code execution can be long-running - checkpoints.download_and_unpack: chunk_size=64 * 1024 to match the rest of the worker's iter_content sites - ssh_executor._FINISH_SENTINEL_PATH: use PurePosixPath.as_posix() instead of str() for consistency with the rest of the file - hw.collect_hw: nvmlSystemGetCudaDriverVersion already returns int; drop the int(...) cast and lift the memory_info try/except up one level so it sits as a sibling of the handle/name/uuid try/except - pyproject.toml [tool.bandit].skips: collapse each rationale onto a single inline trailing comment Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

Pulls request timeouts from the toolkit's `config.config["timeout"]` (matching `bash_toolkit`'s existing pattern) for the three remaining agent toolkits whose timeouts were hardcoded by the B113 fix: github (30s), wikipedia (30s), image (15s). Each toolkit YAML gains the corresponding `timeout:` knob so the default is discoverable. `FileUtils.download_file` and `FileUtils.get_file_md5` are static helpers with no `self.config`, so they take an optional `timeout` argument with the same default the B113 fix introduced (60s). Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

A single try/except NVMLError now wraps every pynvml call in collect_hw — init, driver/CUDA queries, device enumeration, and per-device probes. Any NVMLError anywhere along the chain bails out and returns empty defaults. nvmlShutdown is dropped to match worker.power's precedent (init lazily, let process exit reclaim). Behavior change: a mid-loop failure now drops subsequent GPUs from the list instead of skipping just the failing one. In practice, the NVML state is binary — either healthy or not — so a partial GPU listing in a degraded state is no more useful than an empty one. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

…power Same shape as the worker/hw simplification: a single try/except NVMLError now wraps the device count, handle lookup, and per-device power query. Any NVMLError mid-loop returns whatever was sampled before the failure rather than skipping just the failing device. Also drops the redundant `int(...)` cast on `pynvml.nvmlDeviceGetMemoryInfo(handle).total` in worker/hw — `.total` already returns a Python int. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz

Still a few minor comments.

…ndle cache write - worker/hw.py: cast `pynvml.nvmlDeviceGetMemoryInfo(handle).total` back to int. pynvml ships no type stubs, so Pylance infers `<subclass of bytes and str> | str | Any` without the cast and warns at the GpuInfo construction site. - worker/power._read_gpu_power: only assign `_nvml_handles[idx]` when we actually fetched a new handle, instead of writing back unconditionally on every sample. Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

kaiitunnz

LGTM.

timzsu added 2 commits April 30, 2026 16:10

timzsu changed the title ~~chore(ci): add bandit Python source security audit~~ chore: add bandit Python source security audit May 1, 2026

timzsu added 4 commits May 1, 2026 01:48

refactor(worker): inline _collect_gpu_info into collect_hw

21b5c91

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

timzsu marked this pull request as ready for review May 1, 2026 03:45

timzsu requested a review from kaiitunnz May 1, 2026 03:58

kaiitunnz requested changes May 1, 2026

View reviewed changes

timzsu added 5 commits May 1, 2026 06:09

chore: surface timeout knob in codesnip toolkit config

f01229f

Signed-off-by: Zhengyuan Su <su.zhengyuan@u.nus.edu>

timzsu requested a review from kaiitunnz May 1, 2026 07:48

timzsu mentioned this pull request May 1, 2026

[Feature]: Add security CI scanners (zizmor, gitleaks, bandit, pip-audit) #6

Closed

7 tasks

timzsu linked an issue May 1, 2026 that may be closed by this pull request

[Feature]: Add security CI scanners (zizmor, gitleaks, bandit, pip-audit) #6

Closed

7 tasks

timzsu removed a link to an issue May 1, 2026

[Feature]: Add security CI scanners (zizmor, gitleaks, bandit, pip-audit) #6

Closed

7 tasks

kaiitunnz requested changes May 1, 2026

View reviewed changes

Comment thread src/worker/hw.py

Comment thread src/worker/power.py Outdated

kaiitunnz approved these changes May 1, 2026

View reviewed changes

timzsu merged commit 18b9d45 into main May 1, 2026
9 checks passed

timzsu deleted the zsu/bandit branch May 1, 2026 09:06

timzsu mentioned this pull request May 1, 2026

refactor: in-process torchrun + DeepSpeed launchers; fix multi-GPU SFT/DPO/PPO spawn #12

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: add bandit Python source security audit#10

chore: add bandit Python source security audit#10
timzsu merged 12 commits into
mainfrom
zsu/bandit

timzsu commented May 1, 2026 •

edited

Loading

Uh oh!

kaiitunnz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaiitunnz left a comment

Uh oh!

Uh oh!

Uh oh!

kaiitunnz left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

timzsu commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Changes

Design

Why no severity/confidence threshold

Rule-by-rule walkthrough

Fixed in source (this PR)

Skipped with rationale ([tool.bandit])

Tagged with # nosec B404 + TODO (not skipped globally)

Test Plan

Test Result

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kaiitunnz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

timzsu commented May 1, 2026 •

edited

Loading

Skipped with rationale (`[tool.bandit]`)

Tagged with `# nosec B404` + TODO (not skipped globally)