fix: detect hidden pytorch zip pickles#1043
Conversation
Performance BenchmarksCompared
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f4e6c684e3
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
f4e6c68 to
a274fc8
Compare
Apply the same refinements this reviewer just landed on PR #1048 to the hidden-pickle discovery path: 1. Aggregate probe-failure INFO checks. An adversarial archive with many members that raise on decompression (for example, unsupported methods or intermittent I/O) previously produced one `Pickle Discovery` INFO finding per member, flooding the checks list. Collect failures into a single summary check carrying the per-member exceptions under `details["entries"]`, with `details["zip_entries"]` and `details["failed_count"]` for quick consumers. `mark_inconclusive_ scan_result` is hoisted out of the loop so the metadata reason is recorded exactly once even if many members fail. 2. Document the identity-based dedup invariant in `_discover_pickle_files`. Both passes iterate the same `safe_entries` list, so `id(ZipInfo)` is stable for the duration of discovery; a future refactor that rebuilds the list from separate `infolist()` walks or fresh `ZipInfo` constructions would silently defeat the dedup. Call that out inline so we don't re-learn it the hard way. 3. Explain the magic thresholds in `_looks_like_binary_pickle_prefix`. The `>= 4` clean-parse and `>= 2` truncation thresholds are tuned to balance tensor-storage false positives against real pickle prefixes; undocumented they read like arbitrary numbers. 4. Add "keep in sync" comments on the standalone picklescan copies of `_looks_like_binary_pickle_prefix` and `_looks_like_proto0_or_1_ pickle` so the duplication (intentional for the standalone package) is at least signposted. Expand the CHANGELOG bullet to describe the always-on second-pass sniff, the fail-closed aggregation behavior, and the standalone-package mirror. Adds a new `test_pytorch_zip_discovery_aggregates_probe_failures_ into_single_check` regression test that proves 5 failing members collapse to one `Pickle Discovery` check with a deduplicated reason and all five per-member records under `details["entries"]`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 234dfe66d3
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
…torch-hidden-pickles # Conflicts: # CHANGELOG.md
…torch-hidden-pickles # Conflicts: # CHANGELOG.md # modelaudit/scanners/pytorch_zip_scanner.py
Summary
Finding
Fixes finding 1: hidden PyTorch ZIP pickles are skipped after data.pkl.
Validation