Skip to content

fix(deps): promote msgpack to core dependency for Flax scanner#583

Open
yash2998chhabria wants to merge 10 commits intomainfrom
fix/flax-dep
Open

fix(deps): promote msgpack to core dependency for Flax scanner#583
yash2998chhabria wants to merge 10 commits intomainfrom
fix/flax-dep

Conversation

@yash2998chhabria
Copy link
Contributor

@yash2998chhabria yash2998chhabria commented Feb 25, 2026

Summary

  • Promotes msgpack>=1.0.0,<2.0 from optional dependency (extras: flax) to a core dependency in pyproject.toml
  • The flax_msgpack scanner was registered but silently fell back to generic scanning when msgpack was not installed, causing .msgpack Flax/JAX model files to miss specialized security analysis
  • msgpack is a lightweight, zero-dependency C extension (~100KB) that is appropriate as a core dependency

Test plan

  • Verified FlaxMsgpackScanner loads successfully with python -c "from modelaudit.scanners.flax_msgpack_scanner import FlaxMsgpackScanner; print('OK')"
  • Scanned 3 HuggingFace Flax models: sshleifer/tiny-gpt2, hf-internal-testing/tiny-bert-flax-only, patrickvonplaten/t5-tiny-random -- all recognized by flax_msgpack scanner (not generic fallback)
  • Regression tested 2 additional Flax models: ArthurZ/tiny-random-bert-flax-only, sanchit-gandhi/tiny-random-flax-bert -- both scanned correctly
  • Full test suite passes: 377 passed, 0 failed (python -m pytest tests/ -x -q --timeout=60)

🤖 Generated with Claude Code

Summary by CodeRabbit

  • New Features

    • Pickle scanner: multi-stream parsing, resync after non-pickle bytes, and propagation of pickle-boundary metadata for downstream binary scanning.
    • Flax/msgpack scanner: improved transformer-model detection including nested-model and name-key heuristics.
  • Behavior Changes

    • Scanner now flags a much larger set of stdlib/third-party modules and call patterns as suspicious in pickle contexts.
  • Tests

    • New regression suite for crafted pickle payloads, multi-stream scenarios, and expanded blocklist coverage.
  • Chores

    • Added msgpack dependency.

yash2998chhabria and others added 4 commits February 25, 2026 10:46
Expand ALWAYS_DANGEROUS_MODULES with ~45 new modules not covered by
PR #518: network (smtplib, imaplib, poplib, nntplib, xmlrpc, socketserver,
ssl, requests, aiohttp), compilation (codeop, marshal, types, compileall,
py_compile), FFI (_ctypes), debugging (bdb, trace), operator bypasses
(_operator, functools), pickle recursion (pickle, _pickle, dill,
cloudpickle, joblib), filesystem (tempfile, filecmp, fileinput, glob,
distutils, pydoc, pexpect), venv/pip (venv, ensurepip, pip), threading
(signal, _signal, threading), and more (webbrowser, asyncio, mmap,
select, selectors, logging, syslog, tarfile, zipfile, shelve, sqlite3,
_sqlite3, doctest, idlelib, lib2to3, uuid).

Add uuid as novel detection (VULN-6: _get_command_stdout calls
subprocess.Popen) and pkgutil.resolve_name as dynamic resolution
trampoline.

Mirror new modules in SUSPICIOUS_GLOBALS for defense-in-depth. Fix
multi-stream opcode analysis so detailed checks run across all pickle
streams. Add NEWOBJ_EX to all opcode check lists.

Includes 10 regression tests covering pkgutil trampoline, uuid RCE,
multi-stream exploit, NEWOBJ_EX, and spot-checks for smtplib, sqlite3,
tarfile, marshal, cloudpickle, and webbrowser.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…e scanning

- Fix test regression: track first pickle STOP position and use it for
  binary content scanning instead of f.tell() (which advances past
  multi-stream data)
- Fix multi-stream resync bypass: skip junk separator bytes between
  streams (up to 256 bytes) instead of returning immediately
- Gracefully handle ValueErrors on subsequent streams (non-pickle data
  after valid pickle) instead of propagating them as file-level errors
- Add type hints: ScanResult return type on _scan_bytes, -> None on all
  test methods
- Add separator-byte resync regression test

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Buffer opcodes for subsequent pickle streams and discard partial
  streams (binary tensor data misinterpreted as opcodes) to prevent
  MANY_DANGEROUS_OPCODES false positives
- Add resync logic to skip up to 256 non-pickle separator bytes
  between streams, preventing single-byte bypass of multi-stream
  detection
- Track pickle memo (BINPUT/BINGET) for safe ML globals so REDUCE
  calls referencing safe callables via memo are not counted as
  dangerous opcodes
- Reset dangerous opcode counters at STOP boundaries so each stream
  is evaluated independently
- Use first_pickle_end_pos for binary content scanning instead of
  f.tell() which advances past tensor data during multi-stream parsing

Validated against 5 HuggingFace models with A/B comparison to main
branch — no new false positives introduced.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The flax_msgpack scanner was registered but failed to load when msgpack
was not installed, silently falling back to generic scanning. Since
msgpack is a lightweight zero-dependency C extension, promote it from
optional (extras: flax) to a core dependency so Flax/JAX .msgpack files
are scanned correctly out of the box.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 25, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

Expands suspicious globals and always-dangerous lists, adds multi-stream pickle parsing with resync and first_pickle_end_pos propagation, treats NEWOBJ_EX in danger checks, adds tests for blocklist and multi-stream cases, tightens msgpack dependency, and refines Flax/MsgPack transformer detection.

Changes

Cohort / File(s) Summary
Suspicious symbols mapping
modelaudit/detectors/suspicious_symbols.py
Large expansion of SUSPICIOUS_GLOBALS with many module/function entries (networking, execution/compilation, FFI, threading/process, stdlib trampolines, packaging, tooling).
Pickle scanner logic
modelaudit/scanners/pickle_scanner.py
Adds multi-stream parsing with resync & _MAX_RESYNC_BYTES, buffers subsequent streams, tracks/propagates first_pickle_end_pos, treats NEWOBJ_EX like NEWOBJ in many checks, and expands ALWAYS_DANGEROUS_FUNCTIONS/ALWAYS_DANGEROUS_MODULES.
Pickle scanner tests
tests/scanners/test_pickle_scanner.py
New test class covering GLOBAL+REDUCE payloads, NEWOBJ_EX detection, multi-stream benign→malicious flows, separator-byte resync, and blocklist assertions for many modules.
Flax/msgpack scanner
modelaudit/scanners/flax_msgpack_scanner.py
Improves transformer-model detection by inspecting nested dicts and top-level keys matching common model-name tokens when transformer keys are absent.
Dependencies
pyproject.toml
Adds msgpack>=1.0.0,<2.0 constraint.
Real-world tests tweak
tests/test_real_world_dill_joblib.py
Relaxes assertion for compressed joblib scans to require any issue when bytes_scanned == 0.

Sequence Diagram(s)

sequenceDiagram
    participant FileObj as FileObj
    participant Parser as PickleParser
    participant Buffer as StreamBuffer
    participant Detector as OpcodeDetector
    participant Binary as BinaryScanner

    FileObj->>Parser: read bytes
    Parser->>Parser: attempt parse stream
    alt first stream completes (STOP)
        Parser->>Detector: yield ops for analysis
        Parser->>Parser: set first_pickle_end_pos
    else subsequent stream
        Parser->>Buffer: buffer opcodes
        Buffer->>Parser: validate buffered stream
        alt buffered parse succeeds
            Buffer->>Detector: yield ops for analysis
        else within _MAX_RESYNC_BYTES
            Parser->>FileObj: skip separator bytes (resync)
            Parser->>Parser: continue parsing
        end
    end
    Detector->>Binary: attach first_pickle_end_pos to metadata
    Binary->>Detector: return binary-scan results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

"I nibbled bytes and hopped through streams,
Resynced gaps and chased new schemes.
NEWOBJ_EX lit the carrot sky,
Globals grew—oh my, oh my!
🐇✨"

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately describes the primary change: promoting msgpack from an optional dependency to a core dependency for the Flax scanner. It clearly summarizes the main objective of the pull request.
Docstring Coverage ✅ Passed Docstring coverage is 95.83% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch fix/flax-dep

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-54: The scanner currently terminates multi-stream scanning when
a later stream yields at least one opcode before an error; instead, in the
scanning loop that uses resync_skipped and _MAX_RESYNC_BYTES you must remove the
early return on partial parse and implement resync logic: when a stream errors
after yielding opcodes, do not return — increment resync_skipped, advance the
input by one byte and attempt to resync/continue scanning this stream (and
subsequent streams) until a match or until resync_skipped exceeds a sensible
limit; do not treat a single opcode+junk separator as final. Replace the
immediate return with a continue/resync flow, reset resync_skipped on successful
parse, and revise the hard-coded 256-byte ceiling (_MAX_RESYNC_BYTES) to either
be higher or configurable so simple padding cannot evade detection; add tests
for benign and malicious samples to verify detection still fires after padding.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between e736ebb and b1ff559.

📒 Files selected for processing (4)
  • modelaudit/detectors/suspicious_symbols.py
  • modelaudit/scanners/pickle_scanner.py
  • pyproject.toml
  • tests/scanners/test_pickle_scanner.py

yash2998chhabria and others added 3 commits February 25, 2026 12:34
Some HuggingFace Flax models (e.g. bert-for-token-classification) use a
nested structure where model weights are wrapped under a model-name key
(e.g. "bert", "classifier") rather than placing transformer keys like
"encoder" or "embeddings" at the top level. This caused an INFO-level
false positive: "Suspicious data structure - does not match known ML
model patterns".

Add detection for nested model structures by checking if top-level keys
contain transformer sub-keys, and add a set of common HuggingFace model
name keys to recognize known architectures.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Resolve conflicts in modelaudit/scanners/pickle_scanner.py:
- ALWAYS_DANGEROUS_GLOBALS: keep both branch additions (pkgutil.resolve_name,
  uuid internals) and main additions (cProfile/profile/pdb/timeit, ctypes)
- ALWAYS_DANGEROUS_MODULES: merge branch's comprehensive module blocklist with
  main's PR #518 additions (ctypes, profiling, thread/process, module loading)
  and preserve main's linecache/logging.config exclusion note
- REDUCE opcode handler: adopt main's resolved_callables approach with branch's
  _safe_memo fallback for BINGET memo patterns

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 52-54: Document the security trade-off for _MAX_RESYNC_BYTES in
pickle_scanner.py: update the comment near the _MAX_RESYNC_BYTES and
resync_skipped definitions to explain that the 256-byte limit prevents skipping
arbitrarily large gaps between pickle streams but can be evaded by inserting
>=256 bytes of non-pickle data before a malicious stream, note the rationale for
stopping on partial opcode sequences to avoid misinterpreting binary tensor
data, and recommend making _MAX_RESYNC_BYTES configurable (and covered by tests)
if callers need a different performance/security trade-off.
- Around line 409-411: The scan currently lists "functools" in
ALWAYS_DANGEROUS_MODULES which will produce false positives for legitimate ML
usages; update ML_SAFE_GLOBALS to whitelist safe functools members (e.g.,
partial, reduce, lru_cache, wraps) and adjust any checks that treat the entire
module as dangerous so they allow these specific attributes, or add a clarifying
comment (similar to the existing logging/linecache rationale) explaining why the
module is treated specially; reference ALWAYS_DANGEROUS_MODULES and
ML_SAFE_GLOBALS in your change so the scanner accepts common safe usages like
functools.partial or functools.lru_cache while still flagging dangerous patterns
that use functools to wrap unsafe callables.

ℹ️ Review info

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between b1ff559 and c3b8150.

📒 Files selected for processing (4)
  • modelaudit/detectors/suspicious_symbols.py
  • modelaudit/scanners/flax_msgpack_scanner.py
  • modelaudit/scanners/pickle_scanner.py
  • tests/scanners/test_pickle_scanner.py

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelaudit/scanners/pickle_scanner.py (1)

2132-2138: ⚠️ Potential issue | 🟠 Major

Split INST from OBJ/NEWOBJ/NEWOBJ_EX handling—the current condition disables the latter three.

In pickletools.genops(), INST yields a string argument (module and class name), but OBJ, NEWOBJ, and NEWOBJ_EX have no inline arguments and yield arg=None. The isinstance(arg, str) guard prevents these three opcodes from ever triggering the detection pattern, making their addition ineffective.

Suggested fix
-        if opcode.name in ["INST", "OBJ", "NEWOBJ", "NEWOBJ_EX"] and isinstance(arg, str):
-            return {
-                "pattern": f"{opcode.name}_EXECUTION",
-                "argument": arg,
-                "position": pos,
-                "opcode": opcode.name,
-            }
+        if opcode.name == "INST" and isinstance(arg, str):
+            return {
+                "pattern": f"{opcode.name}_EXECUTION",
+                "argument": arg,
+                "position": pos,
+                "opcode": opcode.name,
+            }
+
+        if opcode.name in {"OBJ", "NEWOBJ", "NEWOBJ_EX"}:
+            return {
+                "pattern": f"{opcode.name}_EXECUTION",
+                "argument": arg,
+                "position": pos,
+                "opcode": opcode.name,
+            }

This violates the guideline: "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 2132 - 2138, The current
check conflates INST with OBJ/NEWOBJ/NEWOBJ_EX by requiring isinstance(arg,
str), which blocks OBJ/NEWOBJ/NEWOBJ_EX because they yield arg=None; update the
logic in pickle_scanner.py (around the opcode handling using opcode.name and
arg, e.g., in the code that references pickletools.genops and opcode.name) to
split cases: handle opcode.name == "INST" with the isinstance(arg, str) guard
and emit the INST_EXECUTION pattern including the string argument, and add a
separate branch for opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] that emits a
distinct pattern (e.g., OBJ_EXECUTION or NEWOBJ_EXECUTION) using position and
opcode.name while not requiring arg to be a string so those opcodes are detected
correctly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/flax_msgpack_scanner.py`:
- Around line 726-737: The loop that checks top-level keys (using found_keys,
obj, transformer_keys, model_name_keys, has_transformer_keys) can call
key.lower() on non-string keys and raise AttributeError; guard against that by
first verifying the key is a string (e.g., if isinstance(key, str)) before
calling key.lower(), and only perform the model-name membership checks
(key.lower() in model_name_keys and any(mk in key.lower() for mk in
model_name_keys)) when the key is a string so the scanner won't crash while
preserving the same transformer-key detection logic.

---

Outside diff comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 2132-2138: The current check conflates INST with
OBJ/NEWOBJ/NEWOBJ_EX by requiring isinstance(arg, str), which blocks
OBJ/NEWOBJ/NEWOBJ_EX because they yield arg=None; update the logic in
pickle_scanner.py (around the opcode handling using opcode.name and arg, e.g.,
in the code that references pickletools.genops and opcode.name) to split cases:
handle opcode.name == "INST" with the isinstance(arg, str) guard and emit the
INST_EXECUTION pattern including the string argument, and add a separate branch
for opcode.name in ["OBJ","NEWOBJ","NEWOBJ_EX"] that emits a distinct pattern
(e.g., OBJ_EXECUTION or NEWOBJ_EXECUTION) using position and opcode.name while
not requiring arg to be a string so those opcodes are detected correctly.

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between c3b8150 and 1869d08.

⛔ Files ignored due to path filters (1)
  • uv.lock is excluded by !**/*.lock
📒 Files selected for processing (2)
  • modelaudit/scanners/flax_msgpack_scanner.py
  • modelaudit/scanners/pickle_scanner.py

- Guard key.lower() for non-string msgpack keys to prevent
  AttributeError that could abort the scan (evasion vector)
- Add functools safe functions to ML_SAFE_GLOBALS to whitelist
  common ML utilities like partial, reduce, lru_cache
- Increase _MAX_RESYNC_BYTES from 256 to 8192 so binary tensor
  data between pickle streams doesn't cause early scan termination
- Continue scanning on partial stream errors instead of returning
  early, preventing evasion via hidden payloads after binary blobs
- Clear _safe_memo at STOP boundaries so safe callables from one
  pickle stream cannot whitelist dangerous REDUCEs in the next

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/flax_msgpack_scanner.py`:
- Around line 695-740: The substring check using model_name_keys (especially
short tokens like "opt" or "score") is too permissive; update the logic in the
block that computes has_transformer_keys (references: model_name_keys,
has_transformer_keys, transformer_keys, key_lower, found_keys, obj.get) to use
stricter token/boundary matching instead of any(mk in key_lower for mk in
model_name_keys): normalize key_lower and match whole tokens (e.g., split on
non-alphanumerics or use regex word boundaries) and explicitly check both exact
matches and token-level membership for entries in model_name_keys, and
add/update unit tests to include adversarial keys (e.g., "optional_score",
"desktop") and true positives to ensure the heuristic no longer false-positives
benign payloads.

In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 95-100: When the branch with "if stream_error and had_opcodes" is
taken inside the PickleScanner scanning loop, increment the resync budget by
adding the number of bytes consumed by the discarded buffer to resync_skipped
and abort (raise/stop) when resync_skipped exceeds _MAX_RESYNC_BYTES;
specifically, compute the consumed byte count for the current partial stream
(e.g., bytes between the previous sync point and current file position or
len(buffer) being discarded), add that value to resync_skipped, and enforce the
existing _MAX_RESYNC_BYTES check before continuing so a crafted binary region
cannot indefinitely bypass the resync limit (update references in the scanning
loop where stream_error, had_opcodes, and resync_skipped are used).

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 1869d08 and 6fa1a0e.

📒 Files selected for processing (2)
  • modelaudit/scanners/flax_msgpack_scanner.py
  • modelaudit/scanners/pickle_scanner.py

- Guard non-string keys: skip int/bytes msgpack keys instead of
  converting them to string, preventing potential crashes (CR #1)
- Remove substring matching for model_name_keys: use exact match only
  to avoid false positives like "opt" matching "optional" (CR #2)
- Charge partial-stream bytes against resync budget: prevents attackers
  from using repeated partial streams to bypass _MAX_RESYNC_BYTES (CR #3)
- Fix Windows test: relax compressed joblib assertion to not require
  specific keywords in issue messages, fixing CI on Windows

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/flax_msgpack_scanner.py`:
- Around line 695-745: The current heuristic sets has_transformer_keys true if a
top-level key_lower is in model_name_keys, which lets generic names like
"score", "classifier", "qa_outputs", "lm_head" falsely trigger; change the check
so that ambiguous generic names only count when they have nested transformer
sub-keys (i.e., value is a dict and sub_keys & transformer_keys is non-empty) or
when another more-specific model name is also present. Concretely, introduce an
AMBIGUOUS_KEYS set (e.g., {"score","classifier","qa_outputs","lm_head"}) and in
the loop over found_keys: if key_lower in AMBIGUOUS_KEYS, require the value to
be a dict and contain transformer_keys before setting has_transformer_keys; if
key_lower in model_name_keys minus AMBIGUOUS_KEYS, you may set
has_transformer_keys immediately; keep the existing exact-match and type checks
for keys and use obj.get(key)/sub_keys & transformer_keys as before.

In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 82-90: The buffered list used inside the subsequent-streams loop
(variable buffered in the loop iterating pickletools.genops(file_obj)) can grow
unbounded; introduce a fixed cap (e.g. MAX_BUFFERED_OPCODES) and enforce it
(either by using collections.deque(maxlen=MAX_BUFFERED_OPCODES) or by popping
oldest entries when len(buffered) >= MAX_BUFFERED_OPCODES) so buffered cannot
consume unlimited memory for large valid streams; update the code that appends
to buffered in the for item in pickletools.genops(file_obj) loop (and keep
had_opcodes logic unchanged) to respect this cap.

In `@tests/test_real_world_dill_joblib.py`:
- Around line 151-159: Replace the weak type check on result.success with a
concrete assertion covering expected outcomes: assert that either the scan
succeeded (result.success is True) or, if no bytes were scanned, the scanner
reported issues (result.bytes_scanned == 0 implies len(result.issues) > 0); you
can implement this as a single boolean assertion combining these conditions
(reference symbols: result.success, result.bytes_scanned, result.issues).

ℹ️ Review info

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Disabled knowledge base sources:

  • Linear integration is disabled

You can enable these sources in your CodeRabbit configuration.

📥 Commits

Reviewing files that changed from the base of the PR and between 6fa1a0e and 345ca63.

📒 Files selected for processing (3)
  • modelaudit/scanners/flax_msgpack_scanner.py
  • modelaudit/scanners/pickle_scanner.py
  • tests/test_real_world_dill_joblib.py

Comment on lines +695 to +745
# Common HuggingFace model name keys that wrap transformer substructure
model_name_keys = {
"bert",
"roberta",
"distilbert",
"albert",
"electra",
"xlm",
"gpt2",
"gpt_neo",
"gpt_neox",
"gptj",
"opt",
"llama",
"t5",
"bart",
"pegasus",
"mbart",
"blenderbot",
"vit",
"clip",
"whisper",
"wav2vec2",
"flax_model",
"classifier",
"qa_outputs",
"lm_head",
"score",
}
has_transformer_keys = any(key in found_keys for key in transformer_keys)

# Check if any top-level key contains transformer sub-keys (nested model structure)
if not has_transformer_keys:
for key in found_keys:
value = obj.get(key)
if isinstance(value, dict):
sub_keys = set(value.keys())
if sub_keys & transformer_keys:
has_transformer_keys = True
break
# Also check if the top-level key is a known model name.
# Only check string keys — msgpack allows int/bytes keys which
# are never valid model names and would cause crashes on .lower().
if not isinstance(key, str):
continue
key_lower = key.lower()
# Use exact match only to avoid false positives from substring
# matching (e.g. "opt" matching "optional", "t5" matching "t500").
if key_lower in model_name_keys:
has_transformer_keys = True
break
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Avoid promoting transformer classification from generic key names alone.

Using standalone top-level names like "score"/"classifier" as sufficient evidence is still too permissive and can misclassify benign msgpack payloads.

Suggested hardening
-                    if key_lower in model_name_keys:
+                    # Require corroborating structure before classifying as transformer.
+                    if key_lower in model_name_keys and (
+                        sub_keys & transformer_keys
+                        or any(isinstance(child, dict) for child in value.values())
+                    ):
                         has_transformer_keys = True
                         break

As per coding guidelines "Preserve or strengthen security detections; test both benign and malicious samples when modifying scanners".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/flax_msgpack_scanner.py` around lines 695 - 745, The
current heuristic sets has_transformer_keys true if a top-level key_lower is in
model_name_keys, which lets generic names like "score", "classifier",
"qa_outputs", "lm_head" falsely trigger; change the check so that ambiguous
generic names only count when they have nested transformer sub-keys (i.e., value
is a dict and sub_keys & transformer_keys is non-empty) or when another
more-specific model name is also present. Concretely, introduce an
AMBIGUOUS_KEYS set (e.g., {"score","classifier","qa_outputs","lm_head"}) and in
the loop over found_keys: if key_lower in AMBIGUOUS_KEYS, require the value to
be a dict and contain transformer_keys before setting has_transformer_keys; if
key_lower in model_name_keys minus AMBIGUOUS_KEYS, you may set
has_transformer_keys immediately; keep the existing exact-match and type checks
for keys and use obj.get(key)/sub_keys & transformer_keys as before.

Comment on lines +82 to +90
# Subsequent streams: buffer opcodes so that partial streams
# (e.g. binary tensor data misinterpreted as opcodes) don't
# produce false positives.
buffered: list[Any] = []
try:
for item in pickletools.genops(file_obj):
had_opcodes = True
buffered.append(item)
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bound buffered opcode growth for subsequent streams.

buffered grows without limits for later streams. A large valid second stream can consume substantial memory before any caller-side opcode cap is enforced.

Suggested guardrail
+    _MAX_BUFFERED_OPCODES = 200_000
@@
             try:
                 for item in pickletools.genops(file_obj):
                     had_opcodes = True
                     buffered.append(item)
+                    if len(buffered) > _MAX_BUFFERED_OPCODES:
+                        stream_error = True
+                        logger.warning(
+                            "Aborting oversized secondary pickle stream during resync "
+                            "(>%d buffered opcodes)",
+                            _MAX_BUFFERED_OPCODES,
+                        )
+                        break
             except ValueError:
                 # Any ValueError on a subsequent stream means we hit
                 # non-pickle data or a junk separator byte.
                 stream_error = True
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelaudit/scanners/pickle_scanner.py` around lines 82 - 90, The buffered
list used inside the subsequent-streams loop (variable buffered in the loop
iterating pickletools.genops(file_obj)) can grow unbounded; introduce a fixed
cap (e.g. MAX_BUFFERED_OPCODES) and enforce it (either by using
collections.deque(maxlen=MAX_BUFFERED_OPCODES) or by popping oldest entries when
len(buffered) >= MAX_BUFFERED_OPCODES) so buffered cannot consume unlimited
memory for large valid streams; update the code that appends to buffered in the
for item in pickletools.genops(file_obj) loop (and keep had_opcodes logic
unchanged) to respect this cap.

Comment on lines 151 to +159
assert isinstance(result.success, bool)
# May not scan bytes if compression format isn't recognized as pickle
# The scanner may:
# 1. Successfully scan some bytes (multi-stream parsing may consume compressed data)
# 2. Report issues when the compressed data isn't valid pickle
# Either outcome is acceptable — what matters is no unhandled crash.
if result.bytes_scanned == 0:
# Should have reported format issues
assert len(result.issues) > 0
format_issues = [i for i in result.issues if "opcode" in str(i.message).lower()]
assert len(format_issues) > 0, "Should report format/opcode issues for compressed files"
# When no bytes were scanned, the scanner should have reported
# some kind of issue explaining why.
assert len(result.issues) > 0, "Should report issues for compressed files"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Strengthen this assertion to preserve regression value.

assert isinstance(result.success, bool) is effectively non-informative here and can let scanner regressions slip through undetected. Assert a concrete behavior outcome instead of type shape.

Proposed test tightening
-        assert isinstance(result.success, bool)
+        # Scanner must either parse bytes or explain why it could not.
+        assert result.bytes_scanned > 0 or len(result.issues) > 0
         # The scanner may:
         # 1. Successfully scan some bytes (multi-stream parsing may consume compressed data)
         # 2. Report issues when the compressed data isn't valid pickle
         # Either outcome is acceptable — what matters is no unhandled crash.
         if result.bytes_scanned == 0:
             # When no bytes were scanned, the scanner should have reported
             # some kind of issue explaining why.
             assert len(result.issues) > 0, "Should report issues for compressed files"

As per coding guidelines "Add comprehensive tests including edge cases and regression coverage for scanner changes".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_real_world_dill_joblib.py` around lines 151 - 159, Replace the
weak type check on result.success with a concrete assertion covering expected
outcomes: assert that either the scan succeeded (result.success is True) or, if
no bytes were scanned, the scanner reported issues (result.bytes_scanned == 0
implies len(result.issues) > 0); you can implement this as a single boolean
assertion combining these conditions (reference symbols: result.success,
result.bytes_scanned, result.issues).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant