Skip to content

fix: route PyTorch ZIP archives without metadata#1016

Merged
mldangelo-oai merged 2 commits intomainfrom
mdangelo/codex/route-pytorch-zip-data-pkl
Apr 16, 2026
Merged

fix: route PyTorch ZIP archives without metadata#1016
mldangelo-oai merged 2 commits intomainfrom
mdangelo/codex/route-pytorch-zip-data-pkl

Conversation

@mldangelo-oai
Copy link
Copy Markdown
Contributor

Summary

Route ZIP-backed PyTorch archives to the PyTorch ZIP scanner when they contain data.pkl plus PyTorch tensor storage members, even if version and byteorder metadata entries are missing.

Security impact

Previously, a torch-loadable ZIP with data.pkl and storage entries but no metadata could fall through to the generic ZIP scanner when the suffix was misleading. That skipped PyTorch-specific CVE checks and pickle handling paths. The new fallback still requires storage members next to data.pkl, so a generic ZIP containing only data.pkl remains routed to the generic ZIP scanner.

Validation

  • uv run ruff format modelaudit/utils/file/detection.py tests/test_core.py
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest tests/test_core.py tests/test_pytorch_zip_detection.py tests/scanners/test_pytorch_zip_scanner.py -q
  • uv run mypy modelaudit/utils/file/detection.py tests/test_core.py

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

Workflow run and artifacts

Performance Benchmarks

Compared 18 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 1 improved, 17 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 707.87ms -> 687.92ms (-2.8%).

Top improvements:

  • tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip -20.3% (42.72ms -> 34.04ms, state_dict.pt, size=1.5 MiB, files=1)
Benchmark Target Size Files Baseline Current Change Status
tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip state_dict.pt 1.5 MiB 1 42.72ms 34.04ms -20.3% improved
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_chunked_stream chunked_stream 278.2 KiB 1 7.92ms 7.24ms -8.6% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payloads[nested_raw] nested_raw 78 B 1 108.0us 99.6us -7.8% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_dangerous_global_payloads[malicious_reduce] malicious_reduce 52 B 1 83.9us 77.7us -7.4% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_hidden_suspicious_string_budget hidden_suspicious_string 8.0 KiB 1 544.6us 578.4us +6.2% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_opcode_budget_tail_payload opcode_budget_tail 14 B 1 75.6us 70.9us -6.1% stable
tests/benchmarks/test_scan_benchmarks.py::test_detect_file_format_safe_pickle safe_model.pkl 49.4 KiB 1 172.6us 164.8us -4.5% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle safe_model.pkl 49.4 KiB 1 28.62ms 27.48ms -4.0% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_multi_stream_padded_payload multi_stream_padded 4.1 KiB 1 133.4us 128.2us -3.9% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payloads[nested_base64] nested_base64 98 B 1 121.7us 118.0us -3.0% stable
tests/benchmarks/test_scan_benchmarks.py::test_validate_file_type_pytorch_zip state_dict.pt 1.5 MiB 1 49.4us 48.0us -2.8% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_dangerous_global_payloads[stack_global] stack_global 21 B 1 70.2us 68.7us -2.1% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_directory duplicate-corpus 840.0 KiB 81 485.93ms 477.46ms -1.7% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_safe_payloads[safe_large] safe_large 278.2 KiB 1 4.26ms 4.21ms -1.2% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_safe_payloads[long_benign_string] long_benign_string 1.0 MiB 1 1.05ms 1.03ms -1.0% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payloads[nested_hex] nested_hex 130 B 1 136.5us 137.8us +1.0% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_mixed_directory mixed-corpus 1.7 MiB 54 135.83ms 134.91ms -0.7% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_safe_payloads[safe_small] safe_small 68 B 1 58.1us 57.9us -0.3% stable

@mldangelo-oai mldangelo-oai changed the title [codex] fix: route PyTorch ZIP archives without metadata fix: route PyTorch ZIP archives without metadata Apr 15, 2026
@mldangelo-oai mldangelo-oai marked this pull request as ready for review April 15, 2026 23:31
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2bbc973641

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
def _looks_like_pytorch_zip_storage_members(member_names: set[str], prefix: str) -> bool:
"""Detect PyTorch tensor storage members next to data.pkl."""
storage_prefix = f"{prefix}/data/" if prefix else "data/"
return any(name.startswith(storage_prefix) and name != storage_prefix for name in member_names)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restrict storage-member match before PyTorch routing

_looks_like_pytorch_zip_storage_members treats any file under <prefix>/data/ as tensor storage. A generic ZIP containing data.pkl plus data/readme.txt now routes to pytorch_zip, even when it is not a PyTorch archive. That scanner does not do generic ZIP recursive dispatch, so this broad match can reduce coverage (e.g., nested archive threats) compared with ZipScanner.

Useful? React with 👍 / 👎.

@mldangelo-oai mldangelo-oai merged commit 1f56bb8 into main Apr 16, 2026
28 checks passed
@mldangelo-oai mldangelo-oai deleted the mdangelo/codex/route-pytorch-zip-data-pkl branch April 16, 2026 06:34
@github-actions github-actions bot mentioned this pull request Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant