Skip to content

fix: reject compressed wrapper raw trailers#905

Merged
mldangelo-oai merged 4 commits intomdangelo/codex/archive-container-routing-auditfrom
mdangelo/codex/compressed-trailer-audit
Apr 10, 2026
Merged

fix: reject compressed wrapper raw trailers#905
mldangelo-oai merged 4 commits intomdangelo/codex/archive-container-routing-auditfrom
mdangelo/codex/compressed-trailer-audit

Conversation

@mldangelo-oai
Copy link
Copy Markdown
Contributor

@mldangelo-oai mldangelo-oai commented Apr 10, 2026

Summary

This PR continues the picklescan/container audit stack with a fail-closed compressed-wrapper fix. Python bzip2 and xz readers can return the valid first decompressed member while silently consuming raw trailing bytes. For .pkl.bz2 and .pkl.xz wrappers, that could let a benign first pickle hide an unscanned malicious pickle trailer.

Root cause: CompressedScanner used high-level BZ2File and LZMAFile readers for standalone bzip2/xz wrappers. Those readers support convenient streaming, but they do not expose enough member-boundary state for us to distinguish valid concatenated members from raw trailing data. The new low-level loops also needed chunk-sized per-call output probes so configured decompression limits cannot still permit one oversized allocation.

Fix:

  • Add bounded low-level decompression loops for bzip2 and xz.
  • Preserve max decompressed byte and decompression ratio enforcement.
  • Preserve support for concatenated valid bzip2/xz members.
  • Reject raw or corrupt trailing bytes after a valid bzip2/xz member.
  • Cap bzip2, xz, and zlib per-call decompression output probes to the configured chunk size.
  • Add an Unreleased changelog note.
  • Add fail-closed trailer regressions, benign concatenated-member regressions, and per-call decompression probe coverage.

Validation

  • uv run pytest tests/scanners/test_compressed_scanner.py -q --tb=short
  • uv run pytest tests/scanners/test_compressed_scanner.py tests/test_core.py tests/scanners/test_scanner_registry.py tests/utils/file/test_filetype.py -q --tb=short
  • uv run ruff format modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run ruff check --fix modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • uv run pytest -n auto -m "not slow and not integration" --maxfail=1

Full local result: 3376 passed, 75 skipped, 16 warnings.

Use bounded low-level bzip2 and xz decompression so valid first members cannot hide raw unscanned trailing bytes. Preserve concatenated valid-member support and add fail-closed regressions.
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 10, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 026b9b29-4416-43b5-a1bd-14fc4b662fc9

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mdangelo/codex/compressed-trailer-audit

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 10, 2026

Workflow run and artifacts

Performance Benchmarks

Compared 6 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 0 improved, 6 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 684.41ms -> 690.64ms (+0.9%).

Benchmark Target Size Files Baseline Current Change Status
tests/benchmarks/test_scan_benchmarks.py::test_validate_file_type_pytorch_zip state_dict.pt 1.5 MiB 1 48.8us 47.8us -2.1% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle safe_model.pkl 49.4 KiB 1 27.81ms 28.19ms +1.4% stable
tests/benchmarks/test_scan_benchmarks.py::test_detect_file_format_safe_pickle safe_model.pkl 49.4 KiB 1 166.8us 164.5us -1.3% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip state_dict.pt 1.5 MiB 1 33.61ms 33.99ms +1.1% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_directory duplicate-corpus 840.0 KiB 81 486.75ms 491.54ms +1.0% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_mixed_directory mixed-corpus 1.7 MiB 54 136.01ms 136.71ms +0.5% stable

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b4a44225d0

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Cap bzip2/xz per-call decompression max_length to a chunk-sized probe so a single compressed chunk cannot allocate up to the full remaining scan budget before limit checks run.
Use the same chunk-sized decompression probe for zlib streams so per-call output is bounded before size and ratio limit checks run.
* fix: harden lz4 wrapper decoding

Use the lz4 frame decompressor API through the bounded concatenated-stream loop so raw trailers fail closed and per-call output probes stay chunk-sized.

Add fake-frame regressions for optional lz4 coverage.

* fix: support legacy lz4 chunk decoding

Fall back to lz4.frame decompression contexts when LZ4FrameDecompressor is unavailable while keeping bounded output probes and raw-trailer rejection.

Add fallback-specific lz4 regressions for bounded reads, raw trailers, and concatenated frames.
@mldangelo-oai mldangelo-oai merged commit ec6c194 into mdangelo/codex/archive-container-routing-audit Apr 10, 2026
@mldangelo-oai mldangelo-oai deleted the mdangelo/codex/compressed-trailer-audit branch April 10, 2026 08:04
mldangelo-oai added a commit that referenced this pull request Apr 10, 2026
* fix: route misnamed compressed wrappers by header

Detect gzip, bzip2, xz, lz4, and zlib wrappers from magic bytes even when the filename uses a misleading extension. Preserve existing joblib and R serialized scanner precedence, and add positive/negative routing regressions.

* fix: reject compressed wrapper raw trailers (#905)

* fix: reject compressed wrapper raw trailers

Use bounded low-level bzip2 and xz decompression so valid first members cannot hide raw unscanned trailing bytes. Preserve concatenated valid-member support and add fail-closed regressions.

* fix: bound compressed decompression probes

Cap bzip2/xz per-call decompression max_length to a chunk-sized probe so a single compressed chunk cannot allocate up to the full remaining scan budget before limit checks run.

* fix: bound zlib decompression probes

Use the same chunk-sized decompression probe for zlib streams so per-call output is bounded before size and ratio limit checks run.

* fix: harden lz4 wrapper decoding (#906)

* fix: harden lz4 wrapper decoding

Use the lz4 frame decompressor API through the bounded concatenated-stream loop so raw trailers fail closed and per-call output probes stay chunk-sized.

Add fake-frame regressions for optional lz4 coverage.

* fix: support legacy lz4 chunk decoding

Fall back to lz4.frame decompression contexts when LZ4FrameDecompressor is unavailable while keeping bounded output probes and raw-trailer rejection.

Add fallback-specific lz4 regressions for bounded reads, raw trailers, and concatenated frames.
mldangelo-oai added a commit that referenced this pull request Apr 10, 2026
* fix: narrow legacy pickle stdlib policy

Remove broad CRITICAL treatment for noisy stdlib modules in the root pickle scanner while preserving exact dangerous helper coverage and warning-level suspicious refs.

* fix: preserve exact stdlib helper detections

Keep logging configuration loaders and uuid subprocess getnode helpers as exact dangerous callables in both root and standalone pickle policies.

* fix: flag uuid getnode pickle calls

Keep uuid.getnode in the exact dangerous-call policy after narrowing broad uuid module severity, since it can dispatch to platform helper probes that invoke subprocess-backed paths.

Cover both root PickleScanner and standalone picklescan policy regressions.

* fix: route misnamed compressed wrappers by header (#904)

* fix: route misnamed compressed wrappers by header

Detect gzip, bzip2, xz, lz4, and zlib wrappers from magic bytes even when the filename uses a misleading extension. Preserve existing joblib and R serialized scanner precedence, and add positive/negative routing regressions.

* fix: reject compressed wrapper raw trailers (#905)

* fix: reject compressed wrapper raw trailers

Use bounded low-level bzip2 and xz decompression so valid first members cannot hide raw unscanned trailing bytes. Preserve concatenated valid-member support and add fail-closed regressions.

* fix: bound compressed decompression probes

Cap bzip2/xz per-call decompression max_length to a chunk-sized probe so a single compressed chunk cannot allocate up to the full remaining scan budget before limit checks run.

* fix: bound zlib decompression probes

Use the same chunk-sized decompression probe for zlib streams so per-call output is bounded before size and ratio limit checks run.

* fix: harden lz4 wrapper decoding (#906)

* fix: harden lz4 wrapper decoding

Use the lz4 frame decompressor API through the bounded concatenated-stream loop so raw trailers fail closed and per-call output probes stay chunk-sized.

Add fake-frame regressions for optional lz4 coverage.

* fix: support legacy lz4 chunk decoding

Fall back to lz4.frame decompression contexts when LZ4FrameDecompressor is unavailable while keeping bounded output probes and raw-trailer rejection.

Add fallback-specific lz4 regressions for bounded reads, raw trailers, and concatenated frames.
mldangelo-oai added a commit that referenced this pull request Apr 10, 2026
* fix: add standalone-primary pickle migration mode

* fix: isolate standalone pickle primary fallback state

* fix: narrow legacy pickle stdlib policy (#903)

* fix: narrow legacy pickle stdlib policy

Remove broad CRITICAL treatment for noisy stdlib modules in the root pickle scanner while preserving exact dangerous helper coverage and warning-level suspicious refs.

* fix: preserve exact stdlib helper detections

Keep logging configuration loaders and uuid subprocess getnode helpers as exact dangerous callables in both root and standalone pickle policies.

* fix: flag uuid getnode pickle calls

Keep uuid.getnode in the exact dangerous-call policy after narrowing broad uuid module severity, since it can dispatch to platform helper probes that invoke subprocess-backed paths.

Cover both root PickleScanner and standalone picklescan policy regressions.

* fix: route misnamed compressed wrappers by header (#904)

* fix: route misnamed compressed wrappers by header

Detect gzip, bzip2, xz, lz4, and zlib wrappers from magic bytes even when the filename uses a misleading extension. Preserve existing joblib and R serialized scanner precedence, and add positive/negative routing regressions.

* fix: reject compressed wrapper raw trailers (#905)

* fix: reject compressed wrapper raw trailers

Use bounded low-level bzip2 and xz decompression so valid first members cannot hide raw unscanned trailing bytes. Preserve concatenated valid-member support and add fail-closed regressions.

* fix: bound compressed decompression probes

Cap bzip2/xz per-call decompression max_length to a chunk-sized probe so a single compressed chunk cannot allocate up to the full remaining scan budget before limit checks run.

* fix: bound zlib decompression probes

Use the same chunk-sized decompression probe for zlib streams so per-call output is bounded before size and ratio limit checks run.

* fix: harden lz4 wrapper decoding (#906)

* fix: harden lz4 wrapper decoding

Use the lz4 frame decompressor API through the bounded concatenated-stream loop so raw trailers fail closed and per-call output probes stay chunk-sized.

Add fake-frame regressions for optional lz4 coverage.

* fix: support legacy lz4 chunk decoding

Fall back to lz4.frame decompression contexts when LZ4FrameDecompressor is unavailable while keeping bounded output probes and raw-trailer rejection.

Add fallback-specific lz4 regressions for bounded reads, raw trailers, and concatenated frames.
mldangelo-oai added a commit that referenced this pull request Apr 11, 2026
* fix: harden standalone pickle scanner

* fix: fail closed on truncated pickle literal scans

* fix: fail closed on oversized encoded pickle literals

* fix: deep-freeze standalone pickle reports

* fix: mark incomplete nested pickle scans inconclusive

* fix: narrow standalone pickle wildcard globals

* fix: preserve deep nested pickle findings

* fix: bound nested pickle recursion resources

* fix: add standalone-primary pickle migration mode (#902)

* fix: add standalone-primary pickle migration mode

* fix: isolate standalone pickle primary fallback state

* fix: narrow legacy pickle stdlib policy (#903)

* fix: narrow legacy pickle stdlib policy

Remove broad CRITICAL treatment for noisy stdlib modules in the root pickle scanner while preserving exact dangerous helper coverage and warning-level suspicious refs.

* fix: preserve exact stdlib helper detections

Keep logging configuration loaders and uuid subprocess getnode helpers as exact dangerous callables in both root and standalone pickle policies.

* fix: flag uuid getnode pickle calls

Keep uuid.getnode in the exact dangerous-call policy after narrowing broad uuid module severity, since it can dispatch to platform helper probes that invoke subprocess-backed paths.

Cover both root PickleScanner and standalone picklescan policy regressions.

* fix: route misnamed compressed wrappers by header (#904)

* fix: route misnamed compressed wrappers by header

Detect gzip, bzip2, xz, lz4, and zlib wrappers from magic bytes even when the filename uses a misleading extension. Preserve existing joblib and R serialized scanner precedence, and add positive/negative routing regressions.

* fix: reject compressed wrapper raw trailers (#905)

* fix: reject compressed wrapper raw trailers

Use bounded low-level bzip2 and xz decompression so valid first members cannot hide raw unscanned trailing bytes. Preserve concatenated valid-member support and add fail-closed regressions.

* fix: bound compressed decompression probes

Cap bzip2/xz per-call decompression max_length to a chunk-sized probe so a single compressed chunk cannot allocate up to the full remaining scan budget before limit checks run.

* fix: bound zlib decompression probes

Use the same chunk-sized decompression probe for zlib streams so per-call output is bounded before size and ratio limit checks run.

* fix: harden lz4 wrapper decoding (#906)

* fix: harden lz4 wrapper decoding

Use the lz4 frame decompressor API through the bounded concatenated-stream loop so raw trailers fail closed and per-call output probes stay chunk-sized.

Add fake-frame regressions for optional lz4 coverage.

* fix: support legacy lz4 chunk decoding

Fall back to lz4.frame decompression contexts when LZ4FrameDecompressor is unavailable while keeping bounded output probes and raw-trailer rejection.

Add fallback-specific lz4 regressions for bounded reads, raw trailers, and concatenated frames.

* docs: normalize unreleased changelog

* fix: address pickle scanner review feedback

* fix: surface nested pickle incomplete notices
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant