fix: suppress false positives in PaddlePaddle scanner#586
fix: suppress false positives in PaddlePaddle scanner#586yash2998chhabria wants to merge 9 commits intomainfrom
Conversation
Expand ALWAYS_DANGEROUS_MODULES with ~45 new modules not covered by PR #518: network (smtplib, imaplib, poplib, nntplib, xmlrpc, socketserver, ssl, requests, aiohttp), compilation (codeop, marshal, types, compileall, py_compile), FFI (_ctypes), debugging (bdb, trace), operator bypasses (_operator, functools), pickle recursion (pickle, _pickle, dill, cloudpickle, joblib), filesystem (tempfile, filecmp, fileinput, glob, distutils, pydoc, pexpect), venv/pip (venv, ensurepip, pip), threading (signal, _signal, threading), and more (webbrowser, asyncio, mmap, select, selectors, logging, syslog, tarfile, zipfile, shelve, sqlite3, _sqlite3, doctest, idlelib, lib2to3, uuid). Add uuid as novel detection (VULN-6: _get_command_stdout calls subprocess.Popen) and pkgutil.resolve_name as dynamic resolution trampoline. Mirror new modules in SUSPICIOUS_GLOBALS for defense-in-depth. Fix multi-stream opcode analysis so detailed checks run across all pickle streams. Add NEWOBJ_EX to all opcode check lists. Includes 10 regression tests covering pkgutil trampoline, uuid RCE, multi-stream exploit, NEWOBJ_EX, and spot-checks for smtplib, sqlite3, tarfile, marshal, cloudpickle, and webbrowser. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e scanning - Fix test regression: track first pickle STOP position and use it for binary content scanning instead of f.tell() (which advances past multi-stream data) - Fix multi-stream resync bypass: skip junk separator bytes between streams (up to 256 bytes) instead of returning immediately - Gracefully handle ValueErrors on subsequent streams (non-pickle data after valid pickle) instead of propagating them as file-level errors - Add type hints: ScanResult return type on _scan_bytes, -> None on all test methods - Add separator-byte resync regression test Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Buffer opcodes for subsequent pickle streams and discard partial streams (binary tensor data misinterpreted as opcodes) to prevent MANY_DANGEROUS_OPCODES false positives - Add resync logic to skip up to 256 non-pickle separator bytes between streams, preventing single-byte bypass of multi-stream detection - Track pickle memo (BINPUT/BINGET) for safe ML globals so REDUCE calls referencing safe callables via memo are not counted as dangerous opcodes - Reset dangerous opcode counters at STOP boundaries so each stream is evaluated independently - Use first_pickle_end_pos for binary content scanning instead of f.tell() which advances past tensor data during multi-stream parsing Validated against 5 HuggingFace models with A/B comparison to main branch — no new false positives introduced. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…and .pdmodel files
Two false-positive sources are fixed:
1. Hex escape FP (.pdiparams): The regex `\x[0-9a-fA-F]{2}` fired on
every .pdiparams file because raw float32 tensor data, when decoded
as lossy UTF-8, produces abundant \xNN sequences. The `__[\w]+__`
(magic-method) pattern had the same issue. Both patterns are now
skipped for .pdiparams binary weight files while still being checked
for .pdmodel files.
2. Magic byte mismatch (.pdmodel/.pdiparams): PaddlePaddle protobuf
model descriptors (.pdmodel) and raw weight tensors (.pdiparams)
have no distinctive magic bytes, so `detect_file_format_from_magic`
legitimately returns "unknown". The file-type validator now accepts
this for the paddle format, matching the approach used for other
formats like flax_msgpack and tflite.
Validated against real HuggingFace models:
- hash71/paddleocr-bengali-english-rec (.pdmodel + .pdiparams)
- PaddlePaddle/PP-LCNet_x0_25_textline_ori (.pdiparams)
- PaddlePaddle/PaddleOCR-VL (.pdmodel + .pdiparams)
- PaddlePaddle/PP-OCRv5_server_det (.pdiparams)
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
Warning Rate limit exceeded
⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (1)
WalkthroughAdds multi-stream, resynchronizing pickle parsing with NEWOBJ_EX-aware detections and expanded dangerous-module/function blocklists; adds Paddle weight (.pdiparams) FP suppression during string/regex checks and accepts Paddle extensions for file-type validation; includes extensive tests for both scanners. Changes
Sequence DiagramssequenceDiagram
participant FileIO as File Reader
participant Parser as Pickle Parser<br/>(multi_stream=True)
participant Buffer as Stream Buffer
participant Checker as Dangerous Module<br/>Checker
participant Result as Scan Result
FileIO->>Parser: Read pickle data
Parser->>Parser: Parse first stream until STOP opcode
Note over Parser: first_stream = True
alt First stream parsed
Parser->>Checker: Yield opcodes (GLOBAL, REDUCE, NEWOBJ_EX ...)
Checker->>Result: Check against ALWAYS_DANGEROUS_MODULES/Funcs
end
Parser->>Parser: STOP reached
Note over Parser: If more data and multi_stream
Parser->>Buffer: Buffer subsequent bytes
Note over Parser: Resync: skip up to 4096 bytes to align
Buffer->>Parser: Provide buffered stream
Parser->>Parser: Parse subsequent stream
alt Parse succeeds
Parser->>Checker: Yield opcodes from subsequent stream
Checker->>Result: Record issues if found
else Parse fails
Note over Parser: Skip this stream (parse error) and continue
end
Parser->>Result: Set first_pickle_end_pos
Result-->>FileIO: Return scan results
sequenceDiagram
participant FileChecker as File Validator
participant Scanner as Paddle Scanner
participant BinaryDetect as Binary Weight<br/>Detector
participant ChunkCheck as Chunk Pattern<br/>Checker
participant Result as Scan Result
FileChecker->>FileChecker: Check extension (.pdmodel/.pdiparams)
alt Paddle extension
FileChecker->>Scanner: Accept file type (bypass magic bytes)
end
Scanner->>Scanner: Read file in chunks
alt ext == .pdiparams
Scanner->>BinaryDetect: is_binary_weights = True
else
Scanner->>BinaryDetect: is_binary_weights = False
end
loop For each chunk
Scanner->>ChunkCheck: _check_chunk(chunk,...,is_binary_weights)
alt is_binary_weights == True
ChunkCheck->>ChunkCheck: Skip _BINARY_WEIGHT_SKIP_PATTERNS
else
ChunkCheck->>ChunkCheck: Apply full pattern checks
end
ChunkCheck->>Result: Append findings
end
Result-->>FileChecker: Return scan results
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tests/scanners/test_paddle_scanner.py (1)
10-156: 🛠️ Refactor suggestion | 🟠 MajorAdd type hints to newly added test functions
The new test functions are untyped. This violates the repository’s Python typing rule.
🔧 Example pattern to apply across this file
+from pathlib import Path @@ -def test_paddle_scanner_can_handle(tmp_path): +def test_paddle_scanner_can_handle(tmp_path: Path) -> None: @@ -def test_pdmodel_magic_bytes_validation_passes(tmp_path): +def test_pdmodel_magic_bytes_validation_passes(tmp_path: Path) -> None:As per coding guidelines: "Always include type hints in Python code".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scanners/test_paddle_scanner.py` around lines 10 - 156, Add Python type hints to each test function that accepts the tmp_path fixture and returns nothing: annotate the tmp_path parameter as pathlib.Path (or Path) and add a -> None return type for the following functions: test_paddle_scanner_can_handle, test_paddle_scanner_cannot_handle_without_paddle, test_paddle_scanner_detects_suspicious_pattern, test_paddle_scanner_missing_dependency, test_pdiparams_hex_escape_not_flagged, test_pdiparams_dunder_pattern_not_flagged, test_pdiparams_real_threats_still_detected, test_pdmodel_hex_escape_still_flagged, test_pdmodel_magic_bytes_validation_passes, and test_pdiparams_magic_bytes_validation_passes; also add the appropriate import (import pathlib or from pathlib import Path) at the top of the file if not already present so the annotations resolve.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 1549-1553: The STOP branch currently resets
dangerous_opcode_count, consecutive_dangerous, and max_consecutive but fails to
clear the pickle scanner's memo safety state; update the STOP handling so that
_safe_memo (the dict/structure tracking safe memoized indices) is cleared or
reinitialized when opcode.name == "STOP" to avoid carrying memoized-safety
across streams (this ensures BINGET/REDUCE pairs in later streams are
re-evaluated for safety); locate the STOP branch in pickle_scanner.py and add
logic to reset _safe_memo alongside the other counters, then run tests for both
benign and malicious multi-stream pickles to verify behavior.
- Around line 91-95: The current early return on "if stream_error and
had_opcodes:" aborts the entire multi-stream scan, letting a crafted partial
secondary stream hide later malicious streams; instead of returning,
discard/reset the current buffer/state for that stream and continue scanning
subsequent streams (replace the return with logic to skip the rest of the
current stream and proceed, e.g., continue the outer loop), preserving any
detection bookkeeping; modify the block around stream_error and had_opcodes to
reset buffer/flags and continue rather than exiting so later streams are still
inspected.
In `@modelaudit/utils/file/detection.py`:
- Around line 492-497: The paddle params file (.pdiparams) isn't included in
detect_format_from_extension_pattern_matching, so ext_format becomes "unknown"
and the ext_format == "paddle" branch is never reached; update
detect_format_from_extension_pattern_matching to recognise ".pdiparams" (in
addition to ".pdmodel") and return the same "paddle" format identifier, or alter
the extension-checking logic used to set ext_format to treat both ".pdmodel" and
".pdiparams" as "paddle" so the branch guarded by ext_format == "paddle"
executes for .pdiparams files (refer to
detect_format_from_extension_pattern_matching and the variable ext_format).
---
Outside diff comments:
In `@tests/scanners/test_paddle_scanner.py`:
- Around line 10-156: Add Python type hints to each test function that accepts
the tmp_path fixture and returns nothing: annotate the tmp_path parameter as
pathlib.Path (or Path) and add a -> None return type for the following
functions: test_paddle_scanner_can_handle,
test_paddle_scanner_cannot_handle_without_paddle,
test_paddle_scanner_detects_suspicious_pattern,
test_paddle_scanner_missing_dependency, test_pdiparams_hex_escape_not_flagged,
test_pdiparams_dunder_pattern_not_flagged,
test_pdiparams_real_threats_still_detected,
test_pdmodel_hex_escape_still_flagged,
test_pdmodel_magic_bytes_validation_passes, and
test_pdiparams_magic_bytes_validation_passes; also add the appropriate import
(import pathlib or from pathlib import Path) at the top of the file if not
already present so the annotations resolve.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (7)
modelaudit/detectors/suspicious_symbols.pymodelaudit/scanners/paddle_scanner.pymodelaudit/scanners/pickle_scanner.pymodelaudit/utils/file/detection.pytests/conftest.pytests/scanners/test_paddle_scanner.pytests/scanners/test_pickle_scanner.py
| if opcode.name == "STOP": | ||
| dangerous_opcode_count = 0 | ||
| consecutive_dangerous = 0 | ||
| max_consecutive = 0 | ||
| continue |
There was a problem hiding this comment.
Reset memo safety state at stream boundaries
At Line 1549-Line 1553, counters are reset on STOP, but _safe_memo is not. In multi-stream mode, memo indices are reused per stream; carrying _safe_memo forward can incorrectly mark later BINGET→REDUCE callables as safe.
🔧 Proposed fix
if opcode.name == "STOP":
dangerous_opcode_count = 0
consecutive_dangerous = 0
max_consecutive = 0
+ _safe_memo.clear()
continueBased on learnings: Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/pickle_scanner.py` around lines 1549 - 1553, The STOP
branch currently resets dangerous_opcode_count, consecutive_dangerous, and
max_consecutive but fails to clear the pickle scanner's memo safety state;
update the STOP handling so that _safe_memo (the dict/structure tracking safe
memoized indices) is cleared or reinitialized when opcode.name == "STOP" to
avoid carrying memoized-safety across streams (this ensures BINGET/REDUCE pairs
in later streams are re-evaluated for safety); locate the STOP branch in
pickle_scanner.py and add logic to reset _safe_memo alongside the other
counters, then run tests for both benign and malicious multi-stream pickles to
verify behavior.
Add type hints (tmp_path: Path, -> None) to all test functions in test_paddle_scanner.py and add .pdiparams to the detect_format_from_extension_pattern_matching() match statement so it returns "paddle" for both .pdmodel and .pdiparams extensions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/scanners/test_paddle_scanner.py`:
- Around line 99-114: The test test_pdiparams_real_threats_still_detected should
first assert that scanner produced issues before checking patterns: add an
assertion that result.issues is non-empty (or len(result.issues) > 0) and/or
assert every issue has a "pattern" detail present, then build patterns_found
from those guaranteed keys and keep the existing substring checks for "import
os", "eval(", and "os.system"; reference result.issues and the patterns_found
set so the test fails loudly if no issues or missing "pattern" keys are present.
- Line 59: Move the local import "import random" out of the function and into
the module-level imports at the top of the test file; specifically, remove the
in-function import inside test_pdiparams_hex_escape_not_flagged and add "import
random" alongside the other top-of-file imports so the test function uses the
module-level random symbol instead.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
modelaudit/utils/file/detection.pytests/scanners/test_paddle_scanner.py
| # Build a payload of 1024 random-looking float32 values whose byte | ||
| # representation, when decoded with errors="ignore", will contain | ||
| # sequences that match \\x[0-9a-fA-F]{2}. | ||
| import random |
There was a problem hiding this comment.
Move import random to the module-level imports.
Placing an import inside a function body is non-idiomatic in Python and violates PEP 8. Move it alongside the other imports at the top of the file.
🔧 Proposed fix
import struct
from pathlib import Path
+import random
from unittest.mock import patchThen remove the local import from within test_pdiparams_hex_escape_not_flagged:
- import random
-
- random.seed(42)
+ random.seed(42)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import random | |
| import struct | |
| from pathlib import Path | |
| import random | |
| from unittest.mock import patch | |
| # ... rest of module code ... | |
| def test_pdiparams_hex_escape_not_flagged(): | |
| random.seed(42) | |
| # ... rest of test function ... |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/scanners/test_paddle_scanner.py` at line 59, Move the local import
"import random" out of the function and into the module-level imports at the top
of the test file; specifically, remove the in-function import inside
test_pdiparams_hex_escape_not_flagged and add "import random" alongside the
other top-of-file imports so the test function uses the module-level random
symbol instead.
| def test_pdiparams_real_threats_still_detected(tmp_path: Path) -> None: | ||
| """Even with FP suppression, genuinely suspicious content in a .pdiparams | ||
| file (e.g. 'import os', 'eval(') must still be reported.""" | ||
| content = b"padding " + b"import os" + b" eval(payload) " + b"os.system('rm -rf /')" | ||
| path = tmp_path / "bad_weights.pdiparams" | ||
| path.write_bytes(content) | ||
|
|
||
| with patch("modelaudit.scanners.paddle_scanner.HAS_PADDLE", True): | ||
| scanner = PaddleScanner() | ||
| result = scanner.scan(str(path)) | ||
|
|
||
| patterns_found = {i.details.get("pattern", "") for i in result.issues} | ||
| # These should NOT be suppressed | ||
| assert any("import os" in p for p in patterns_found), "import os should be detected" | ||
| assert any("eval(" in p for p in patterns_found), "eval( should be detected" | ||
| assert any("os.system" in p for p in patterns_found), "os.system should be detected" |
There was a problem hiding this comment.
🧹 Nitpick | 🔵 Trivial
Consider strengthening the threat-detection assertions.
patterns_found is a set of strings from i.details.get("pattern", ""). The assertions (any("import os" in p ...)) silently pass as vacuously false if result.issues is empty or if details lacks a "pattern" key — neither case would be caught by the current assertions. A more robust approach also asserts that result.issues is non-empty first:
♻️ Proposed hardening
+ assert result.issues, "Expected at least one issue for content with import os / eval / os.system"
patterns_found = {i.details.get("pattern", "") for i in result.issues}
# These should NOT be suppressed
assert any("import os" in p for p in patterns_found), "import os should be detected"
assert any("eval(" in p for p in patterns_found), "eval( should be detected"
assert any("os.system" in p for p in patterns_found), "os.system should be detected"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@tests/scanners/test_paddle_scanner.py` around lines 99 - 114, The test
test_pdiparams_real_threats_still_detected should first assert that scanner
produced issues before checking patterns: add an assertion that result.issues is
non-empty (or len(result.issues) > 0) and/or assert every issue has a "pattern"
detail present, then build patterns_found from those guaranteed keys and keep
the existing substring checks for "import os", "eval(", and "os.system";
reference result.issues and the patterns_found set so the test fails loudly if
no issues or missing "pattern" keys are present.
Resolve merge conflicts in pickle_scanner.py: - BLOCKLISTED_CALLABLES: keep both HEAD's pkgutil/uuid entries and main's profiling/debugging/ctypes entries - ALWAYS_DANGEROUS_MODULES: combine HEAD's comprehensive module blocklist with main's ctypes/profiling/threading/import entries and linecache/logging exclusion note - REDUCE opcode handling: use main's resolved_callables approach with HEAD's _safe_memo fallback for defense-in-depth memo pattern handling
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
modelaudit/scanners/pickle_scanner.py (2)
91-95:⚠️ Potential issue | 🔴 CriticalDo not terminate multi-stream scanning on partial secondary streams.
At Line 94,
returnstops all further stream inspection. A crafted partial stream can prevent scanning of subsequent malicious streams.Based on learnings: Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners.🔧 Proposed fix
- if stream_error and had_opcodes: - # Partial stream: binary data was misinterpreted as opcodes. - # Discard the buffer and stop — no more valid streams. - return + if stream_error and had_opcodes: + # Partial stream: discard buffered opcodes and resync, + # then continue scanning subsequent streams. + file_obj.seek(stream_start, 0) + if not file_obj.read(1): + return + resync_skipped += 1 + if resync_skipped >= _MAX_RESYNC_BYTES: + return + continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 91 - 95, The current logic in pickle_scanner.py inside the loop that checks stream_error and had_opcodes returns early (return) which terminates multi-stream scanning and allows a crafted partial secondary stream to block inspection of later streams; change this to discard the current buffer and continue scanning subsequent streams (replace the return with logic that resets/clears the buffer state and uses continue so the outer loop proceeds), ensuring variables referenced (stream_error, had_opcodes) are handled the same way for the primary stream case and add/update unit tests to cover a partial secondary stream followed by a malicious stream to verify detections still occur.
2209-2213:⚠️ Potential issue | 🟠 MajorReset
_safe_memoat stream boundaries.At Line 2209-Line 2213, counters reset on
STOP, but_safe_memopersists. This can incorrectly treat later stream callables as safe.Based on learnings: Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners.🔧 Proposed fix
if opcode.name == "STOP": dangerous_opcode_count = 0 consecutive_dangerous = 0 max_consecutive = 0 + _safe_memo.clear() continue🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 2209 - 2213, The STOP handler currently resets dangerous_opcode_count, consecutive_dangerous, and max_consecutive but does not clear _safe_memo, which lets safe markers persist across stream boundaries; update the STOP branch (the block checking if opcode.name == "STOP") to also reset or clear the _safe_memo data structure so memoized safety does not leak into subsequent streams, and run unit tests covering both benign and malicious pickle streams to verify security detections still hold.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/detectors/suspicious_symbols.py`:
- Line 229: The rule treating module keys as single tokens misses dotted modules
like logging.config; update the detector in
modelaudit/detectors/suspicious_symbols.py to account for dotted imports by
matching full dotted names and attributes (e.g., accept "logging.config" as a
module key or allow values like "config.listen"), and/or normalize checks by
joining module + attribute into "module.attribute" before lookup (update the
lookup logic in the function that resolves imports/attributes, and adjust the
suspicious_symbols mapping variable used there to include "logging.config":
["listen"] or to perform a combined-key check).
---
Duplicate comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 91-95: The current logic in pickle_scanner.py inside the loop that
checks stream_error and had_opcodes returns early (return) which terminates
multi-stream scanning and allows a crafted partial secondary stream to block
inspection of later streams; change this to discard the current buffer and
continue scanning subsequent streams (replace the return with logic that
resets/clears the buffer state and uses continue so the outer loop proceeds),
ensuring variables referenced (stream_error, had_opcodes) are handled the same
way for the primary stream case and add/update unit tests to cover a partial
secondary stream followed by a malicious stream to verify detections still
occur.
- Around line 2209-2213: The STOP handler currently resets
dangerous_opcode_count, consecutive_dangerous, and max_consecutive but does not
clear _safe_memo, which lets safe markers persist across stream boundaries;
update the STOP branch (the block checking if opcode.name == "STOP") to also
reset or clear the _safe_memo data structure so memoized safety does not leak
into subsequent streams, and run unit tests covering both benign and malicious
pickle streams to verify security detections still hold.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (3)
modelaudit/detectors/suspicious_symbols.pymodelaudit/scanners/pickle_scanner.pytests/scanners/test_pickle_scanner.py
| "_sqlite3": "*", | ||
| "select": "*", | ||
| "selectors": "*", | ||
| "logging": ["config"], |
There was a problem hiding this comment.
logging.config.listen is not covered by this rule as written.
At Line 229, "logging": ["config"] does not match GLOBAL logging.config listen (module is logging.config). This creates a detection gap for the listener path.
🔧 Proposed fix
- "logging": ["config"],
+ "logging": ["config"],
+ "logging.config": ["listen"],📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| "logging": ["config"], | |
| "logging": ["config"], | |
| "logging.config": ["listen"], |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/detectors/suspicious_symbols.py` at line 229, The rule treating
module keys as single tokens misses dotted modules like logging.config; update
the detector in modelaudit/detectors/suspicious_symbols.py to account for dotted
imports by matching full dotted names and attributes (e.g., accept
"logging.config" as a module key or allow values like "config.listen"), and/or
normalize checks by joining module + attribute into "module.attribute" before
lookup (update the lookup logic in the function that resolves
imports/attributes, and adjust the suspicious_symbols mapping variable used
there to include "logging.config": ["listen"] or to perform a combined-key
check).
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Increase _MAX_RESYNC_BYTES from 256 to 4096 with security trade-off documentation to handle larger non-pickle gaps in Paddle files - Clear _safe_memo dict at STOP opcode boundaries so memo entries from one pickle stream cannot leak into the next stream's analysis - Change partial stream handling from early return to continue so that malicious pickle streams hidden after binary data are not missed Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelaudit/scanners/pickle_scanner.py (1)
2139-2145:⚠️ Potential issue | 🔴 CriticalAll four opcodes are unreachable due to type mismatch, not just
NEWOBJ_EX.The condition at line 2139 applies
isinstance(arg, str)toINST,OBJ,NEWOBJ, andNEWOBJ_EX. However:
INSTreturnsargas a(module, name)tupleOBJ,NEWOBJ,NEWOBJ_EXall returnargasNoneNone of these ever match
isinstance(arg, str), so the scanner completely fails to detect these dangerous pickle deserialization techniques. This is a security detection gap.🔧 Proposed fix
- if opcode.name in ["INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"] and isinstance(arg, str): + if opcode.name == "INST" and isinstance(arg, tuple) and len(arg) == 2: + module, name = arg + return { + "pattern": "INST_EXECUTION", + "module": module, + "name": name, + "position": pos, + "opcode": opcode.name, + } + + if opcode.name in {"OBJ", "NEWOBJ", "NEWOBJ_EX"}: return { "pattern": f"{opcode.name}_EXECUTION", - "argument": arg, "position": pos, "opcode": opcode.name, }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelaudit/scanners/pickle_scanner.py` around lines 2139 - 2145, The check is wrong: INST yields arg as a (module,name) tuple and OBJ/NEWOBJ/NEWOBJ_EX yield arg as None, so the current isinstance(arg, str) makes all four branches unreachable; update the logic in pickle_scanner.py to handle each opcode type explicitly — for opcode.name == "INST" accept and validate a tuple (module, name) (e.g., isinstance(arg, tuple) and length==2 and str elements) and use that tuple (or a joined "module:name") as the "argument" in the returned dict, and for opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] accept arg is None (no isinstance check) and still return the pattern "{opcode.name}_EXECUTION" with position and opcode fields so these dangerous opcodes are detected.
♻️ Duplicate comments (2)
tests/scanners/test_paddle_scanner.py (2)
59-59: 🛠️ Refactor suggestion | 🟠 MajorMove
import randomto the module-level imports.The
import randomstatement should be at the top of the file alongside other imports, not inside the function body. This is non-idiomatic and violates PEP 8.🔧 Proposed fix
At top of file:
import struct from pathlib import Path +import random from unittest.mock import patchThen in the function:
- import random - - random.seed(42) + random.seed(42)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scanners/test_paddle_scanner.py` at line 59, Move the local "import random" out of the test function and add it with the other module-level imports at the top of tests/scanners/test_paddle_scanner.py; remove the in-function import so the test uses the top-level random import instead. This ensures PEP8-compliant imports and avoids re-importing inside the test function where "import random" currently appears.
91-106: 🧹 Nitpick | 🔵 TrivialConsider strengthening the threat-detection assertions.
The assertions on lines 104-106 could silently pass if
result.issuesis empty or if issues lack a"pattern"key. Adding an explicit check that issues were found makes the test more robust.♻️ Proposed hardening
with patch("modelaudit.scanners.paddle_scanner.HAS_PADDLE", True): scanner = PaddleScanner() result = scanner.scan(str(path)) + assert result.issues, "Expected at least one issue for content with import os / eval / os.system" patterns_found = {i.details.get("pattern", "") for i in result.issues} # These should NOT be suppressed assert any("import os" in p for p in patterns_found), "import os should be detected"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/scanners/test_paddle_scanner.py` around lines 91 - 106, The test test_pdiparams_real_threats_still_detected should explicitly assert that scanner results contain issues before checking patterns: add an assertion that result.issues is non-empty (e.g., assert result.issues) and/or assert that patterns_found is non-empty to avoid silent passes when issues lack a "pattern" key, then keep the existing asserts that "import os", "eval(", and "os.system" appear in patterns_found; reference result.issues, patterns_found, and the test function name when making the changes.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 120-121: The early return on resync_skipped >= _MAX_RESYNC_BYTES
(symbols: resync_skipped, _MAX_RESYNC_BYTES in pickle_scanner.py) creates a
blind spot where a second malicious pickle stream after ~4KB of padding is never
scanned; remove or replace the hard cutoff with logic that continues scanning
for additional pickle stream headers (e.g., keep searching for valid
MAGIC/opcodes until EOF) while enforcing an overall resource limit (total
scanned bytes or time) to prevent runaway processing. Update the loop that uses
resync_skipped to either reset/adjust the counter when non-pickle bytes are
encountered or eliminate the immediate return and instead break only on EOF or a
safe global cap, and add unit tests covering benign large-padding files and
malicious files with a second stream after >_MAX_RESYNC_BYTES to verify
detection.
---
Outside diff comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 2139-2145: The check is wrong: INST yields arg as a (module,name)
tuple and OBJ/NEWOBJ/NEWOBJ_EX yield arg as None, so the current isinstance(arg,
str) makes all four branches unreachable; update the logic in pickle_scanner.py
to handle each opcode type explicitly — for opcode.name == "INST" accept and
validate a tuple (module, name) (e.g., isinstance(arg, tuple) and length==2 and
str elements) and use that tuple (or a joined "module:name") as the "argument"
in the returned dict, and for opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] accept
arg is None (no isinstance check) and still return the pattern
"{opcode.name}_EXECUTION" with position and opcode fields so these dangerous
opcodes are detected.
---
Duplicate comments:
In `@tests/scanners/test_paddle_scanner.py`:
- Line 59: Move the local "import random" out of the test function and add it
with the other module-level imports at the top of
tests/scanners/test_paddle_scanner.py; remove the in-function import so the test
uses the top-level random import instead. This ensures PEP8-compliant imports
and avoids re-importing inside the test function where "import random" currently
appears.
- Around line 91-106: The test test_pdiparams_real_threats_still_detected should
explicitly assert that scanner results contain issues before checking patterns:
add an assertion that result.issues is non-empty (e.g., assert result.issues)
and/or assert that patterns_found is non-empty to avoid silent passes when
issues lack a "pattern" key, then keep the existing asserts that "import os",
"eval(", and "os.system" appear in patterns_found; reference result.issues,
patterns_found, and the test function name when making the changes.
ℹ️ Review info
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Disabled knowledge base sources:
- Linear integration is disabled
You can enable these sources in your CodeRabbit configuration.
📒 Files selected for processing (2)
modelaudit/scanners/pickle_scanner.pytests/scanners/test_paddle_scanner.py
| if resync_skipped >= _MAX_RESYNC_BYTES: | ||
| return |
There was a problem hiding this comment.
Hard resync cutoff creates a multi-stream evasion path.
At Line 120-Line 121, parsing stops after 4096 non-pickle bytes. A malicious second pickle stream placed after >4KB padding will not be scanned.
🔧 Proposed hardening
-def _genops_with_fallback(file_obj: BinaryIO, *, multi_stream: bool = False) -> Any:
+def _genops_with_fallback(
+ file_obj: BinaryIO,
+ *,
+ multi_stream: bool = False,
+ max_resync_bytes: int | None = None,
+) -> Any:
@@
- _MAX_RESYNC_BYTES = 4096
+ _MAX_RESYNC_BYTES = max_resync_bytes
@@
- if resync_skipped >= _MAX_RESYNC_BYTES:
+ if _MAX_RESYNC_BYTES is not None and resync_skipped >= _MAX_RESYNC_BYTES:
return🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@modelaudit/scanners/pickle_scanner.py` around lines 120 - 121, The early
return on resync_skipped >= _MAX_RESYNC_BYTES (symbols: resync_skipped,
_MAX_RESYNC_BYTES in pickle_scanner.py) creates a blind spot where a second
malicious pickle stream after ~4KB of padding is never scanned; remove or
replace the hard cutoff with logic that continues scanning for additional pickle
stream headers (e.g., keep searching for valid MAGIC/opcodes until EOF) while
enforcing an overall resource limit (total scanned bytes or time) to prevent
runaway processing. Update the loop that uses resync_skipped to either
reset/adjust the counter when non-pickle bytes are encountered or eliminate the
immediate return and instead break only on EOF or a safe global cap, and add
unit tests covering benign large-padding files and malicious files with a second
stream after >_MAX_RESYNC_BYTES to verify detection.
…r Windows CI The test asserted that compressed joblib format issues must contain "opcode" in the message, but on Windows the scanner reports "Unable to parse pickle file: MemoryError" instead. Broadened the keyword check to accept any of: opcode, format, parse, unable, invalid, truncated, corrupted. Also improved the assertion error message to show actual issues for easier debugging. Validated with 5 real PaddlePaddle .pdiparams models (PP-DocLayoutV3, PP-OCRv5_mobile_det, PP-OCRv5_mobile_rec, PP-DocLayoutV2, PP-LCNet_x1_0_doc_ori) - all scan clean with zero false positives. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
\x[0-9a-fA-F]{2}and__[\w]+__regex patterns for.pdiparamsfiles, since raw float32 tensor data decoded as lossy UTF-8 produces abundant matches. Genuine threats (import os,eval(,os.system, etc.) are still detected.validate_file_type()so that.pdmodel(protobuf) and.pdiparams(raw tensors) pass validation despite having no distinctive magic bytes, matching the approach used for flax_msgpack and tflite.Test plan
.pdiparamsfiles no longer trigger hex-escape or dunder false positives.pdmodelfiles no longer trigger magic byte mismatch warnings.pdiparamsis still detected (import os, eval, os.system)🤖 Generated with Claude Code
Summary by CodeRabbit
New Features
Bug Fixes
Tests