Skip to content

fix: report actual file size in scan summary when scanner exits early#587

Open
yash2998chhabria wants to merge 8 commits intomainfrom
fix/size-bug
Open

fix: report actual file size in scan summary when scanner exits early#587
yash2998chhabria wants to merge 8 commits intomainfrom
fix/size-bug

Conversation

@yash2998chhabria
Copy link
Contributor

@yash2998chhabria yash2998chhabria commented Feb 25, 2026

Summary

  • Fix scan summary displaying "Size: 0 bytes" for files whose scanner returns early (missing optional dependency, parse error, etc.)
  • Add a fallback in _scan_file_internal (core.py) that sets bytes_scanned to the actual file size whenever a scanner leaves it at zero
  • Affects ONNX, TFLite, Flax, XGBoost, and 20+ other scanners that have early-return paths before setting bytes_scanned

Test plan

  • Verified ONNX file scan now shows correct size (4.72 MB) instead of "0 bytes"
  • Verified safetensors file scan still shows correct size (no regression)
  • Verified GGUF file scan still shows correct size (no regression)
  • All 351 existing tests pass (python -m pytest tests/ -x -q)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • Bug Fixes

    • Scan results now fallback to the actual file size when a scanner leaves bytes-scanned unset.
  • New Features

    • Pickle parsing now handles multiple streams, resynchronizes across separators, and preserves first-stream boundaries for downstream scans.
    • Detection coverage expanded to include many additional suspicious modules/functions and extended pickle variants.
  • Tests

    • Added regression tests for multi-stream parsing, blocklist hardening, broadened format/error detection, and related dangerous-pattern cases.

yash2998chhabria and others added 4 commits February 25, 2026 10:46
Expand ALWAYS_DANGEROUS_MODULES with ~45 new modules not covered by
PR #518: network (smtplib, imaplib, poplib, nntplib, xmlrpc, socketserver,
ssl, requests, aiohttp), compilation (codeop, marshal, types, compileall,
py_compile), FFI (_ctypes), debugging (bdb, trace), operator bypasses
(_operator, functools), pickle recursion (pickle, _pickle, dill,
cloudpickle, joblib), filesystem (tempfile, filecmp, fileinput, glob,
distutils, pydoc, pexpect), venv/pip (venv, ensurepip, pip), threading
(signal, _signal, threading), and more (webbrowser, asyncio, mmap,
select, selectors, logging, syslog, tarfile, zipfile, shelve, sqlite3,
_sqlite3, doctest, idlelib, lib2to3, uuid).

Add uuid as novel detection (VULN-6: _get_command_stdout calls
subprocess.Popen) and pkgutil.resolve_name as dynamic resolution
trampoline.

Mirror new modules in SUSPICIOUS_GLOBALS for defense-in-depth. Fix
multi-stream opcode analysis so detailed checks run across all pickle
streams. Add NEWOBJ_EX to all opcode check lists.

Includes 10 regression tests covering pkgutil trampoline, uuid RCE,
multi-stream exploit, NEWOBJ_EX, and spot-checks for smtplib, sqlite3,
tarfile, marshal, cloudpickle, and webbrowser.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e scanning

- Fix test regression: track first pickle STOP position and use it for
  binary content scanning instead of f.tell() (which advances past
  multi-stream data)
- Fix multi-stream resync bypass: skip junk separator bytes between
  streams (up to 256 bytes) instead of returning immediately
- Gracefully handle ValueErrors on subsequent streams (non-pickle data
  after valid pickle) instead of propagating them as file-level errors
- Add type hints: ScanResult return type on _scan_bytes, -> None on all
  test methods
- Add separator-byte resync regression test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Buffer opcodes for subsequent pickle streams and discard partial
  streams (binary tensor data misinterpreted as opcodes) to prevent
  MANY_DANGEROUS_OPCODES false positives
- Add resync logic to skip up to 256 non-pickle separator bytes
  between streams, preventing single-byte bypass of multi-stream
  detection
- Track pickle memo (BINPUT/BINGET) for safe ML globals so REDUCE
  calls referencing safe callables via memo are not counted as
  dangerous opcodes
- Reset dangerous opcode counters at STOP boundaries so each stream
  is evaluated independently
- Use first_pickle_end_pos for binary content scanning instead of
  f.tell() which advances past tensor data during multi-stream parsing

Validated against 5 HuggingFace models with A/B comparison to main
branch — no new false positives introduced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scanners that return early (missing optional dependency, parse error,
etc.) left bytes_scanned at its default value of 0, causing the scan
summary to always display "Size: 0 bytes". This affected ONNX, TFLite,
Flax, XGBoost, and several other format scanners.

Add a fallback in _scan_file_internal that sets bytes_scanned to the
actual file size (already computed via os.path.getsize) whenever a
scanner leaves it at zero.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Adds a bytes_scanned fallback in core scanning, massively extends the suspicious-globals mapping, implements multi-stream-aware pickle parsing with NEWOBJ_EX handling and first_pickle_end_pos metadata, and adds regression tests covering multi-stream resync, NEWOBJ_EX, and expanded blocklists.

Changes

Cohort / File(s) Summary
Core scanning
modelaudit/core.py
Adds fallback: if a scanner returns bytes_scanned == 0 but actual file size > 0, set bytes_scanned to the file size.
Detectors — suspicious globals
modelaudit/detectors/suspicious_symbols.py
Extensively expands SUSPICIOUS_GLOBALS with many new modules and callable patterns (pkgutil, zipimport, uuid, networking, FFI, serialization, profiling/debugging, packaging, threading, DBs, archives, etc.).
Pickle scanner
modelaudit/scanners/pickle_scanner.py
Adds multi-stream-aware parsing/resynchronization and buffering, exposes first_pickle_end_pos in result metadata, adds multi_stream parameter to _genops_with_fallback, integrates NEWOBJ_EX into detection paths, and expands ALWAYS_DANGEROUS_MODULES/FUNCTIONS (e.g., pkgutil, uuid, networking/import modules).
Tests — pickle scanner
tests/scanners/test_pickle_scanner.py
Adds TestPickleScannerBlocklistHardening with helpers and ~11 regression tests for pkgutil/uuid, NEWOBJ_EX, multi-stream benign→malicious flows, separator-byte resync, and blocking of many modules (smtplib, sqlite3, tarfile, marshal, cloudpickle, webbrowser).
Tests — real-world joblib/dill
tests/test_real_world_dill_joblib.py
Loosens parse/format-failure matching to accept additional keywords (opcode, format, unable to parse, invalid, memoryerror, corrupted) when bytes_scanned == 0 and issues exist.

Sequence Diagram

sequenceDiagram
    participant Client as Client
    participant PickleScanner as PickleScanner
    participant GenOps as _genops_with_fallback
    participant OpcodeProcessor as OpcodeProcessor
    participant Matcher as Matcher

    Client->>PickleScanner: scan(file_obj)
    PickleScanner->>GenOps: _genops_with_fallback(file_obj, multi_stream=True)
    GenOps->>GenOps: parse stream(s), resync, buffer subsequent streams, set first_pickle_end_pos
    GenOps->>OpcodeProcessor: yield opcodes (including NEWOBJ_EX)
    OpcodeProcessor->>Matcher: evaluate sequences / blocklists / CVE patterns
    Matcher->>PickleScanner: report dangerous findings
    PickleScanner->>Client: return ScanResult (metadata:first_pickle_end_pos, bytes_scanned)
Loading

Estimated Code Review Effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hop through pickles, bytes in tow,

I stitch the streams where sneaky bytes flow.
NEWOBJ_EX, I sniff and flag,
Blocklists thick, I guard the bag.
Nibble safety, onward I go.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: fixing scan summaries to report actual file size when scanners exit early, which is the primary objective of this PR.
Docstring Coverage ✅ Passed Docstring coverage is 96.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/size-bug

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-54: The fixed 256-byte resync cap (_MAX_RESYNC_BYTES) combined
with the resync_skipped counter can hide malicious second pickle streams; update
the resynchronization logic in pickle_scanner.py to remove or substantially
raise the fixed cap and instead scan for additional pickle stream opcodes until
EOF (or until a safe global size-based limit tied to the input length), i.e.,
revise the loop that increments resync_skipped and checks _MAX_RESYNC_BYTES
(around the code using resync_skipped and _MAX_RESYNC_BYTES between lines
~106-115) so it continues scanning for stream markers across larger/padded gaps
(or uses a sliding-window opcode search) rather than stopping after 256 bytes.
- Around line 1549-1553: When handling the pickle stream boundary (the branch
where opcode.name == "STOP") you currently reset counters but not the
stream-local memo; add logic to clear or reinitialize the scanner's _safe_memo
at the same place you reset dangerous_opcode_count, consecutive_dangerous and
max_consecutive so memo indexes from a previous stream cannot be reused to mark
BINGET->REDUCE pairs in a new stream as safe. Update the STOP handling block in
pickle_scanner.py to reset _safe_memo (e.g., assign a fresh empty memo
structure) while keeping the existing counter resets to preserve detection
behavior.
- Line 1477: The current conditional incorrectly requires isinstance(arg, str)
for OBJ/NEWOBJ/NEWOBJ_EX, making those branches unreachable; split the logic so
INST is handled when opcode.name == "INST" and isinstance(arg, str) as before,
but handle opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] in a separate branch that
does not require arg to be a string and instead inspects the stack or preceding
GLOBAL/STACK items to resolve the class being instantiated (so NEWOBJ_EX
following a dangerous GLOBAL is flagged); update the logic in pickle_scanner.py
where opcode and arg are checked and ensure the test
test_newobj_ex_dangerous_class can detect the dangerous pattern.

In `@tests/scanners/test_pickle_scanner.py`:
- Around line 647-695: The module-blocklist spot tests (e.g.,
test_smtplib_blocked, test_sqlite3_blocked, test_tarfile_blocked,
test_marshal_blocked, test_cloudpickle_blocked, test_webbrowser_blocked) only
assert result.has_errors and may pass on unrelated failure paths; add an
explicit assert result.success (immediately after obtaining result from
self._scan_bytes(...)) in each of these test_* methods to ensure the
CRITICAL/module checks validate successful detection flow before the existing
severity/message assertions.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e736ebb and 5efc0df.

📒 Files selected for processing (4)
  • modelaudit/core.py
  • modelaudit/detectors/suspicious_symbols.py
  • modelaudit/scanners/pickle_scanner.py
  • tests/scanners/test_pickle_scanner.py

Resolve conflicts in pickle_scanner.py:
- DANGEROUS_FUNCTIONS: keep both branch additions (pkgutil, uuid) and
  main additions (cProfile, profile, pdb, timeit, ctypes functions)
- ALWAYS_DANGEROUS_MODULES: merge main's core blocklist (ctypes, cProfile,
  profile, pdb, timeit, _thread, multiprocessing, zipimport, importlib,
  runpy) with branch's extended coverage (pkgutil, smtplib, marshal,
  functools, pickle, cloudpickle, tempfile, uuid, etc.) and preserve
  main's NOTE about linecache/logging.config
- REDUCE opcode handling: adopt main's resolved_callables lookup with
  branch's _safe_memo fallback for BINGET/memo patterns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

♻️ Duplicate comments (4)
modelaudit/scanners/pickle_scanner.py (3)

2209-2213: ⚠️ Potential issue | 🟠 Major

Reset _safe_memo at each stream boundary (STOP).

Line 2209 resets counters only; memo-index safety state can leak into the next stream.

🔧 Suggested fix
         if opcode.name == "STOP":
             dangerous_opcode_count = 0
             consecutive_dangerous = 0
             max_consecutive = 0
+            _safe_memo.clear()
             continue

Based on learnings: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 2209 - 2213, At each
pickle stream boundary where the code checks opcode.name == "STOP" (inside the
PickleScanner/scan loop), also reset the memo safety state by clearing or
reinitializing _safe_memo so memo-index safety does not leak between streams;
update the STOP handling block that currently resets dangerous_opcode_count,
consecutive_dangerous, and max_consecutive to additionally reset _safe_memo (and
any related memo tracking structures) and run unit tests for both benign and
malicious samples to ensure detections are preserved.

2132-2138: ⚠️ Potential issue | 🟠 Major

OBJ/NEWOBJ/NEWOBJ_EX branch is unreachable behind isinstance(arg, str).

Line 2132 gates stack-based opcodes behind a string-arg check; only INST can satisfy this in practice.

🔧 Suggested fix
-        if opcode.name in ["INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"] and isinstance(arg, str):
+        if opcode.name in {"OBJ", "NEWOBJ", "NEWOBJ_EX"}:
+            return {
+                "pattern": f"{opcode.name}_EXECUTION",
+                "argument": str(arg),
+                "position": pos,
+                "opcode": opcode.name,
+            }
+        if opcode.name == "INST" and isinstance(arg, str):
             return {
                 "pattern": f"{opcode.name}_EXECUTION",
                 "argument": arg,
                 "position": pos,
                 "opcode": opcode.name,
             }
#!/bin/bash
# Verify pickletools opcode argument descriptors for the affected opcodes.
python - <<'PY'
import pickletools
for name in ("INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"):
    op = next(o for o in pickletools.opcodes if o.name == name)
    print(f"{name}: has_stream_argument={op.arg is not None}")
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 2132 - 2138, The branch
currently requires isinstance(arg, str) for all opcodes, making the
OBJ/NEWOBJ/NEWOBJ_EX cases unreachable; update the logic in pickle_scanner.py
(around the block handling opcode.name in ["INST","OBJ","NEWOBJ","NEWOBJ_EX"])
to separate handling: keep the INST path gated by isinstance(arg, str) to return
the INST_EXECUTION pattern, but remove the str requirement for OBJ, NEWOBJ, and
NEWOBJ_EX so they return their respective _EXECUTION pattern even when arg is
not a string (handle None/int safely), i.e., branch on opcode.name first and
only apply isinstance(arg, str) for INST to avoid crashes and correctly detect
stack-based opcodes.

52-54: ⚠️ Potential issue | 🟠 Major

Fixed 256-byte resync limit is bypassable with larger separators.

Line 113 hard-stops resync after 256 bytes, so padded separators can hide later malicious streams.

🔧 Suggested hardening
-    # Maximum number of consecutive non-pickle bytes to skip when resyncing
-    _MAX_RESYNC_BYTES = 256
@@
-            if resync_skipped >= _MAX_RESYNC_BYTES:
-                return

Based on learnings: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".

Also applies to: 113-114

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 52 - 54, The current
fixed resync limit (_MAX_RESYNC_BYTES = 256) and resync_skipped counter can be
bypassed by large padded separators; update the resync logic in the pickle
scanner to stop relying on a small hard cap and instead scan the incoming stream
for valid pickle start/marker patterns until a legitimate pickle header or EOF
is found. Make _MAX_RESYNC_BYTES configurable (or remove the fixed cap) and
change the loop that increments resync_skipped to perform a sliding search for
known pickle opcodes/headers and reset resync_skipped when a candidate is found;
ensure the scanner still fails safely on EOF and include tests for both short
and very large padded separators to validate detection.
tests/scanners/test_pickle_scanner.py (1)

744-792: ⚠️ Potential issue | 🟡 Minor

Add explicit result.success assertions in module spot checks.

These tests currently assert has_errors only; they can pass on non-target failure paths.

🔧 Suggested consolidation
+    def _assert_module_blocked(self, module: str, func: str) -> None:
+        result = self._scan_bytes(self._craft_global_reduce_pickle(module, func))
+        assert result.success
+        assert result.has_errors
+        assert any(
+            i.severity == IssueSeverity.CRITICAL and module in i.message
+            for i in result.issues
+        ), f"Expected CRITICAL {module} issue, got: {[i.message for i in result.issues]}"
@@
     def test_smtplib_blocked(self) -> None:
-        result = self._scan_bytes(self._craft_global_reduce_pickle("smtplib", "SMTP"))
-        assert result.has_errors
-        assert any(i.severity == IssueSeverity.CRITICAL and "smtplib" in i.message for i in result.issues), (
-            f"Expected CRITICAL smtplib issue, got: {[i.message for i in result.issues]}"
-        )
+        self._assert_module_blocked("smtplib", "SMTP")

Based on learnings: "Add comprehensive tests including edge cases and regression coverage for scanner changes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/scanners/test_pickle_scanner.py` around lines 744 - 792, The tests
(e.g., test_smtplib_blocked, test_sqlite3_blocked, test_tarfile_blocked,
test_marshal_blocked, test_cloudpickle_blocked, test_webbrowser_blocked) only
assert result.has_errors and can pass on unrelated failure paths; update each to
also assert the explicit success flag (e.g., assert result.success is False or
assert not result.success) after calling
_scan_bytes(_craft_global_reduce_pickle(...)) so the test guarantees the scanner
reported a failure state for these dangerous-module pickles.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 435-451: The list literal that adds "logging" to the
always-dangerous modules conflicts with the nearby intent note that
"logging.config" is intentionally excluded; either remove "logging" from the
always-dangerous/modules blocklist or update the comment to explicitly state
that only specific submodules (e.g., "logging.handlers" or others) are
considered dangerous while "logging.config" is excluded. Locate the module list
in pickle_scanner.py (the literal containing "webbrowser", "asyncio", ...,
"logging", ...) and make the change so the code and comment consistently reflect
the intended policy.
- Around line 91-95: The early return on a partial-decode inside the pickle
scanner (the block that checks stream_error and had_opcodes) aborts scanning the
rest of the file and can miss later malicious streams; instead, discard the
current buffer and continue scanning subsequent streams. Edit the code in
pickle_scanner.py (the block using stream_error and had_opcodes inside the
stream-processing loop) to remove the return and ensure control continues to the
next stream (e.g., clear/reset the per-stream buffer/state and use continue or
otherwise advance the outer loop), preserving any detection bookkeeping so later
streams are still scanned.

---

Duplicate comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 2209-2213: At each pickle stream boundary where the code checks
opcode.name == "STOP" (inside the PickleScanner/scan loop), also reset the memo
safety state by clearing or reinitializing _safe_memo so memo-index safety does
not leak between streams; update the STOP handling block that currently resets
dangerous_opcode_count, consecutive_dangerous, and max_consecutive to
additionally reset _safe_memo (and any related memo tracking structures) and run
unit tests for both benign and malicious samples to ensure detections are
preserved.
- Around line 2132-2138: The branch currently requires isinstance(arg, str) for
all opcodes, making the OBJ/NEWOBJ/NEWOBJ_EX cases unreachable; update the logic
in pickle_scanner.py (around the block handling opcode.name in
["INST","OBJ","NEWOBJ","NEWOBJ_EX"]) to separate handling: keep the INST path
gated by isinstance(arg, str) to return the INST_EXECUTION pattern, but remove
the str requirement for OBJ, NEWOBJ, and NEWOBJ_EX so they return their
respective _EXECUTION pattern even when arg is not a string (handle None/int
safely), i.e., branch on opcode.name first and only apply isinstance(arg, str)
for INST to avoid crashes and correctly detect stack-based opcodes.
- Around line 52-54: The current fixed resync limit (_MAX_RESYNC_BYTES = 256)
and resync_skipped counter can be bypassed by large padded separators; update
the resync logic in the pickle scanner to stop relying on a small hard cap and
instead scan the incoming stream for valid pickle start/marker patterns until a
legitimate pickle header or EOF is found. Make _MAX_RESYNC_BYTES configurable
(or remove the fixed cap) and change the loop that increments resync_skipped to
perform a sliding search for known pickle opcodes/headers and reset
resync_skipped when a candidate is found; ensure the scanner still fails safely
on EOF and include tests for both short and very large padded separators to
validate detection.

In `@tests/scanners/test_pickle_scanner.py`:
- Around line 744-792: The tests (e.g., test_smtplib_blocked,
test_sqlite3_blocked, test_tarfile_blocked, test_marshal_blocked,
test_cloudpickle_blocked, test_webbrowser_blocked) only assert result.has_errors
and can pass on unrelated failure paths; update each to also assert the explicit
success flag (e.g., assert result.success is False or assert not result.success)
after calling _scan_bytes(_craft_global_reduce_pickle(...)) so the test
guarantees the scanner reported a failure state for these dangerous-module
pickles.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 5efc0df and a13f819.

📒 Files selected for processing (3)
  • modelaudit/detectors/suspicious_symbols.py
  • modelaudit/scanners/pickle_scanner.py
  • tests/scanners/test_pickle_scanner.py

Comment on lines +435 to +451
"webbrowser",
"asyncio",
"mmap",
"select",
"selectors",
"logging",
"syslog",
"tarfile",
"zipfile",
"shelve",
"sqlite3",
"_sqlite3",
"doctest",
"idlelib",
"lib2to3",
# uuid — _get_command_stdout internally calls subprocess.Popen (VULN-6)
"uuid",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

logging blocklist addition conflicts with nearby intent note.

Line 440 adds logging to always-dangerous modules, but the nearby note states logging.config is intentionally not in this set. Please align code or comment to avoid policy ambiguity.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 435 - 451, The list
literal that adds "logging" to the always-dangerous modules conflicts with the
nearby intent note that "logging.config" is intentionally excluded; either
remove "logging" from the always-dangerous/modules blocklist or update the
comment to explicitly state that only specific submodules (e.g.,
"logging.handlers" or others) are considered dangerous while "logging.config" is
excluded. Locate the module list in pickle_scanner.py (the literal containing
"webbrowser", "asyncio", ..., "logging", ...) and make the change so the code
and comment consistently reflect the intended policy.

- Increase _MAX_RESYNC_BYTES from 256 to 8192 to prevent attackers from
  hiding malicious streams behind padding gaps larger than the resync window
- Continue scanning on partial stream decode errors instead of returning
  early, so malicious payloads later in the file are not missed
- Clear _safe_memo at STOP boundaries so stale memo entries from a prior
  stream cannot mark BINGET->REDUCE pairs in a new stream as safe
- Split OBJ/NEWOBJ/NEWOBJ_EX detection from INST: these stack-based opcodes
  have no string arg, so resolve the class via callable_refs instead
- Strengthen module blocklist test assertions with explicit result.success
  and result.issues checks before severity/message checks

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelaudit/scanners/pickle_scanner.py (1)

2149-2156: ⚠️ Potential issue | 🟠 Major

INST path currently flags all string-arg INST opcodes as dangerous.

Unlike OBJ/NEWOBJ/NEWOBJ_EX, the INST branch returns immediately without checking whether the resolved target is actually dangerous. This can create false critical findings.

Proposed fix
         # INST encodes the class directly in the string argument.
         if opcode.name == "INST" and isinstance(arg, str):
-            return {
-                "pattern": f"{opcode.name}_EXECUTION",
-                "argument": arg,
-                "position": pos,
-                "opcode": opcode.name,
-            }
+            parsed = _parse_module_function(arg)
+            if parsed:
+                mod, func = parsed
+                if _is_dangerous_ref(mod, func):
+                    return {
+                        "pattern": f"{opcode.name}_EXECUTION",
+                        "argument": f"{mod}.{func}",
+                        "position": pos,
+                        "opcode": opcode.name,
+                    }
♻️ Duplicate comments (1)
modelaudit/scanners/pickle_scanner.py (1)

52-55: ⚠️ Potential issue | 🟠 Major

Fixed resync cap still allows padded multi-stream bypass.

A malicious stream placed after >8192 non-pickle bytes will still be skipped because scanning terminates at Line 117–Line 118.

Hardening direction
-    _MAX_RESYNC_BYTES = 8192
+    # Avoid fixed-size resync cutoffs that can be bypassed with padding.
+    # Runtime is already bounded by scanner timeout / opcode limits.
@@
-            if resync_skipped >= _MAX_RESYNC_BYTES:
-                return
             continue
Based on learnings: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".

Also applies to: 117-118

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 52 - 55, The scanner's
fixed resync cap (_MAX_RESYNC_BYTES = 8192) plus the early termination in the
resync logic can be bypassed by padding a malicious pickle after a larger gap;
update the resync handling to not silently stop scanning once the skipped byte
count exceeds _MAX_RESYNC_BYTES—either (a) remove the premature termination and
continue scanning the remainder of the stream for additional pickle markers, or
(b) treat exceeding the cap as a detection (fail/flag) instead of skipping
further checks; locate the resync loop/branch that references _MAX_RESYNC_BYTES
and change its control flow accordingly so no multi-stream payload can hide
beyond the cap.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 93-103: The code currently resets stream_error and then still
yields the prior buffered opcodes, allowing false positives; in the block
handling a partial stream (the branch that checks if stream_error and
had_opcodes) explicitly discard the buffered opcode store (e.g., clear or
reassign buffered = [] or buffered.clear()) before resetting
had_opcodes/stream_error so no buffered data is yielded later; update the logic
around the variables buffered, had_opcodes, and stream_error in
pickle_scanner.py to only yield from buffered when stream_error is false and
buffered was not discarded.

In `@tests/scanners/test_pickle_scanner.py`:
- Around line 723-739: The test test_newobj_ex_dangerous_class currently can
pass due to the GLOBAL 'os._wrap_close' match rather than verifying NEWOBJ_EX
handling; update the assertion to specifically verify that the reported issue is
tied to NEWOBJ_EX by checking the issue details from result.issues (e.g.,
message contains "NEWOBJ_EX" or an opcode/source field indicating NEWOBJ_EX) and
still ensure it's CRITICAL; use the existing test helper _scan_bytes and
IssueSeverity.CRITICAL to locate the relevant issue and assert that at least one
issue is NEWOBJ_EX-specific so the test fails if NEWOBJ_EX detection regresses.

---

Duplicate comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-55: The scanner's fixed resync cap (_MAX_RESYNC_BYTES = 8192)
plus the early termination in the resync logic can be bypassed by padding a
malicious pickle after a larger gap; update the resync handling to not silently
stop scanning once the skipped byte count exceeds _MAX_RESYNC_BYTES—either (a)
remove the premature termination and continue scanning the remainder of the
stream for additional pickle markers, or (b) treat exceeding the cap as a
detection (fail/flag) instead of skipping further checks; locate the resync
loop/branch that references _MAX_RESYNC_BYTES and change its control flow
accordingly so no multi-stream payload can hide beyond the cap.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a13f819 and 7a3ce67.

📒 Files selected for processing (2)
  • modelaudit/scanners/pickle_scanner.py
  • tests/scanners/test_pickle_scanner.py

Comment on lines +723 to +739
def test_newobj_ex_dangerous_class(self) -> None:
"""NEWOBJ_EX opcode with a dangerous class should be flagged."""
# Craft pickle: PROTO 4 | GLOBAL 'os _wrap_close' | EMPTY_TUPLE | EMPTY_DICT | NEWOBJ_EX | STOP
# Protocol 4 is needed for NEWOBJ_EX (opcode 0x92)
proto = b"\x80\x04"
global_op = b"c" + b"os\n_wrap_close\n"
empty_tuple = b")"
empty_dict = b"}"
newobj_ex = b"\x92" # NEWOBJ_EX opcode
stop = b"."
data = proto + global_op + empty_tuple + empty_dict + newobj_ex + stop

result = self._scan_bytes(data)
assert result.success
assert result.has_errors
os_issues = [i for i in result.issues if i.severity == IssueSeverity.CRITICAL and "os" in i.message.lower()]
assert os_issues, f"Expected CRITICAL os issue for NEWOBJ_EX, got: {[i.message for i in result.issues]}"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

test_newobj_ex_dangerous_class is too broad to validate NEWOBJ_EX-specific detection.

This test can pass from the GLOBAL os._wrap_close finding alone, even if NEWOBJ_EX handling regresses. Please assert on a check/detail tied to opcode == "NEWOBJ_EX" (or NEWOBJ_EX-specific pattern) to make it a true regression guard.

Based on learnings: "Add comprehensive tests including edge cases and regression coverage for scanner changes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/scanners/test_pickle_scanner.py` around lines 723 - 739, The test
test_newobj_ex_dangerous_class currently can pass due to the GLOBAL
'os._wrap_close' match rather than verifying NEWOBJ_EX handling; update the
assertion to specifically verify that the reported issue is tied to NEWOBJ_EX by
checking the issue details from result.issues (e.g., message contains
"NEWOBJ_EX" or an opcode/source field indicating NEWOBJ_EX) and still ensure
it's CRITICAL; use the existing test helper _scan_bytes and
IssueSeverity.CRITICAL to locate the relevant issue and assert that at least one
issue is NEWOBJ_EX-specific so the test fails if NEWOBJ_EX detection regresses.

yash2998chhabria and others added 2 commits February 26, 2026 06:40
… on Windows

The test_real_joblib_compressed test only accepted "opcode" in the error
message, but compressed joblib files can produce different parse errors
depending on the platform (e.g. MemoryError on Windows).  Widen the
assertion to accept any format/parse-related error keyword so the test
passes across all platforms.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…des and reset symbolic state at STOP

- Fix bug where partial stream opcodes were yielded instead of discarded
  after a ValueError in _genops_with_fallback (clear buffer before resetting
  stream_error flag)
- Reset stack/memo in _build_symbolic_reference_maps at STOP boundaries so
  stale references from stream 1 don't leak into stream 2 resolution
- Apply ruff format fix to test_real_world_dill_joblib.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelaudit/scanners/pickle_scanner.py (1)

2159-2166: ⚠️ Potential issue | 🟠 Major

Gate INST findings on dangerous target resolution.

Line 2160 currently returns INST_EXECUTION for any INST string arg, even benign classes. This can generate false CRITICAL findings and dilute signal quality.

🔧 Suggested fix
-        if opcode.name == "INST" and isinstance(arg, str):
-            return {
-                "pattern": f"{opcode.name}_EXECUTION",
-                "argument": arg,
-                "position": pos,
-                "opcode": opcode.name,
-            }
+        if opcode.name == "INST" and isinstance(arg, str):
+            parsed = _parse_module_function(arg)
+            if parsed:
+                mod, func = parsed
+                if _is_dangerous_ref(mod, func):
+                    return {
+                        "pattern": f"{opcode.name}_EXECUTION",
+                        "argument": arg,
+                        "position": pos,
+                        "opcode": opcode.name,
+                    }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 2159 - 2166, The current
INST handling in pickle_scanner.py returns an INST_EXECUTION finding for any
string arg, causing false positives; update the INST branch (the code checking
if opcode.name == "INST" and isinstance(arg, str)) to suppress benign types by
checking the arg against a safe builtin whitelist (e.g.,
"object","list","dict","set","tuple","str","int","float","bool","bytes") and
only emit the INST_EXECUTION finding when the arg is not in that whitelist
and/or appears to be a fully-qualified or suspicious class name (e.g., contains
a module qualifier like "." or matches known dangerous targets); implement the
whitelist as a module-level SAFE_BUILTINS and reference it inside the INST
handling so harmless classes are ignored while truly suspicious INST args still
produce a finding.
♻️ Duplicate comments (1)
modelaudit/scanners/pickle_scanner.py (1)

52-56: ⚠️ Potential issue | 🟠 Major

Remove fixed resync ceiling; it still permits padded-stream evasion.

At Line 118, multi-stream parsing stops after 8192 non-pickle bytes. A payload can place a malicious second stream beyond that gap and evade detection.

Based on learnings: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".

🔧 Suggested hardening
-    _MAX_RESYNC_BYTES = 8192
     resync_skipped = 0
@@
-            if resync_skipped >= _MAX_RESYNC_BYTES:
-                return
             continue

Also applies to: 118-119

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 52 - 56, The fixed
ceiling _MAX_RESYNC_BYTES and related resync_skipped logic allow evasion by
padding a malicious second pickle stream beyond 8192 bytes; replace the hard
limit with a safer approach: remove or enlarge the fixed cap and instead
implement streaming-aware resync that either (a) reads until EOF when scanning
for additional pickle streams, or (b) enforces a configurable/strict
total-scan-size limit and per-file sanity checks, and update the code paths that
reference _MAX_RESYNC_BYTES and resync_skipped (the multi-stream parsing loop)
to use the new behavior/config parameter; add unit tests covering both benign
long-padding and malicious delayed-stream samples to validate detection.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 2251-2267: The backward scan ranges skip index 0 due to Python's
exclusive stop; update the range stop arguments used in the backward scans in
pickle_scanner.py (the for loops that start with for j in range(i - 1, max(0, i
- 4), -1) and for k in range(j - 1, max(0, j - 10), -1)) so the stop allows
hitting index 0 (e.g., use max(-1, i - 4) and max(-1, j - 10) or otherwise set
the stop to -1) to include the stream-start GLOBAL/STACK_GLOBAL entries when
computing _safe_memo in the surrounding logic that references opcodes,
prev_opcode, prev_arg, and _is_safe_ml_global.

---

Outside diff comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 2159-2166: The current INST handling in pickle_scanner.py returns
an INST_EXECUTION finding for any string arg, causing false positives; update
the INST branch (the code checking if opcode.name == "INST" and isinstance(arg,
str)) to suppress benign types by checking the arg against a safe builtin
whitelist (e.g.,
"object","list","dict","set","tuple","str","int","float","bool","bytes") and
only emit the INST_EXECUTION finding when the arg is not in that whitelist
and/or appears to be a fully-qualified or suspicious class name (e.g., contains
a module qualifier like "." or matches known dangerous targets); implement the
whitelist as a module-level SAFE_BUILTINS and reference it inside the INST
handling so harmless classes are ignored while truly suspicious INST args still
produce a finding.

---

Duplicate comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-56: The fixed ceiling _MAX_RESYNC_BYTES and related
resync_skipped logic allow evasion by padding a malicious second pickle stream
beyond 8192 bytes; replace the hard limit with a safer approach: remove or
enlarge the fixed cap and instead implement streaming-aware resync that either
(a) reads until EOF when scanning for additional pickle streams, or (b) enforces
a configurable/strict total-scan-size limit and per-file sanity checks, and
update the code paths that reference _MAX_RESYNC_BYTES and resync_skipped (the
multi-stream parsing loop) to use the new behavior/config parameter; add unit
tests covering both benign long-padding and malicious delayed-stream samples to
validate detection.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between a7ad66c and 5521b8c.

📒 Files selected for processing (2)
  • modelaudit/scanners/pickle_scanner.py
  • tests/test_real_world_dill_joblib.py

Comment on lines +2251 to +2267
for j in range(i - 1, max(0, i - 4), -1):
prev_opcode, prev_arg, _prev_pos = opcodes[j]
if prev_opcode.name == "GLOBAL" and isinstance(prev_arg, str):
parts = (
prev_arg.split(" ", 1)
if " " in prev_arg
else prev_arg.rsplit(".", 1)
if "." in prev_arg
else [prev_arg, ""]
)
if len(parts) == 2:
_safe_memo[arg] = _is_safe_ml_global(parts[0], parts[1])
break
if prev_opcode.name == "STACK_GLOBAL":
strs: list[str] = []
for k in range(j - 1, max(0, j - 10), -1):
pk_op, pk_arg, _ = opcodes[k]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Fix off-by-one in backward scans (index 0 is currently skipped).

The ranges at Line 2251 and Line 2266 exclude index 0 due Python’s stop-exclusive semantics. This misses stream-start GLOBAL/STACK_GLOBAL references and can misclassify memo safety.

🔧 Suggested fix
-            for j in range(i - 1, max(0, i - 4), -1):
+            for j in range(i - 1, max(-1, i - 5), -1):
@@
-                    for k in range(j - 1, max(0, j - 10), -1):
+                    for k in range(j - 1, max(-1, j - 11), -1):
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 2251 - 2267, The backward
scan ranges skip index 0 due to Python's exclusive stop; update the range stop
arguments used in the backward scans in pickle_scanner.py (the for loops that
start with for j in range(i - 1, max(0, i - 4), -1) and for k in range(j - 1,
max(0, j - 10), -1)) so the stop allows hitting index 0 (e.g., use max(-1, i -
4) and max(-1, j - 10) or otherwise set the stop to -1) to include the
stream-start GLOBAL/STACK_GLOBAL entries when computing _safe_memo in the
surrounding logic that references opcodes, prev_opcode, prev_arg, and
_is_safe_ml_global.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant