Skip to content

fix(security): bound pickle metadata reads in metadata extraction#712

Merged
mldangelo-oai merged 8 commits intomainfrom
codex/fix-dos-vulnerability-in-pickle-extraction
Mar 20, 2026
Merged

fix(security): bound pickle metadata reads in metadata extraction#712
mldangelo-oai merged 8 commits intomainfrom
codex/fix-dos-vulnerability-in-pickle-extraction

Conversation

@mldangelo
Copy link
Copy Markdown
Member

@mldangelo mldangelo commented Mar 16, 2026

Motivation

  • Prevent unbounded memory reads during metadata extraction of pickle files which allowed a DoS by calling f.read() on arbitrarily large .pkl files.
  • Ensure metadata path respects existing file-size safeguards without changing scanner behavior for normal files.

Description

  • Add a bounded-read guard in PickleScanner.extract_metadata() using a new config key max_metadata_pickle_read_size with a default of 10 MiB and raise a ValueError when the file or read exceeds the limit.
  • Read at most max_metadata_pickle_read_size + 1 bytes and surface a clear extraction_error when the limit is exceeded to avoid large allocations while preserving opcode analysis for small files.
  • Add a regression test test_pickle_metadata_enforces_read_limit to tests/test_metadata_extractor.py that verifies oversized pickle metadata extraction is rejected.

Testing

  • Ran formatting and linting: uv run ruff format modelaudit/ tests/ (reformatted 2 files) and uv run ruff check --fix modelaudit/ tests/ (passed).
  • Type checking: uv run mypy modelaudit/ (passed with no issues).
  • Added regression unit test and validated behavior with a focused runtime check (small Python snippet asserting PickleScanner({'max_metadata_pickle_read_size':64}).extract_metadata() returns an extraction_error for a 128-byte file), which passed.
  • Full test suite uv run pytest -n auto -m "not slow and not integration" --maxfail=1 was attempted but encountered unrelated, pre-existing failures in other tests; the change-specific checks and linters passed.

Codex Task

Summary by CodeRabbit

  • New Features

    • Added a configurable read limit for pickle metadata extraction (default 10 MB). Extraction now enforces a positive limit, fails closed on out‑of‑range values, and clamps any caller-supplied limit to the 10 MB ceiling.
  • Tests

    • Added parameterized tests validating enforcement for valid, zero, negative, and oversized limits, and updated an existing pickle-related test to reflect the new read-limit behavior.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 16, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 492fa2cd-2208-432a-a010-b6329f8a83c6

📥 Commits

Reviewing files that changed from the base of the PR and between 93c38ad and 7cb61a3.

📒 Files selected for processing (2)
  • modelaudit/scanners/pickle_scanner.py
  • tests/test_metadata_extractor.py

Walkthrough

Adds a configurable, validated read-size cap (max_metadata_pickle_read_size, default capped at 10 MiB) to pickle metadata extraction: stat the file, enforce the cap (>0), perform a bounded read (limit + 1) to detect over-limit content, compute metadata from the truncated bytes, and raise ValueError on violations.

Changes

Cohort / File(s) Summary
Pickle Scanner
modelaudit/scanners/pickle_scanner.py
Add max_metadata_pickle_read_size config (clamped to 10 MiB). Validate >0, compare on-disk size to cap, read up to cap + 1 bytes to detect overflow, set pickle_size/pickle_protocol from the (possibly truncated) bytes, and raise ValueError when limits exceeded.
Tests — Metadata Extractor
tests/test_metadata_extractor.py
Add parameterized test_pickle_metadata_enforces_read_limit (cases: 256, 64, 0, -1) asserting success for valid limits and specific extraction errors for invalid/non-positive limits. Add test_pickle_metadata_caps_configured_read_limit_at_10_mib to verify clamping to 10 MiB. Existing dangerous-opcode tests unchanged.

Sequence Diagram

sequenceDiagram
    participant Scanner as PickleScanner
    participant Config as Configuration
    participant FileIO as File I/O
    participant Validator as Validator
    participant Metadata as MetadataExtractor

    Scanner->>Config: read max_metadata_pickle_read_size
    Config-->>Scanner: return limit (clamped to 10 MiB)

    Scanner->>FileIO: stat(file) -> size
    FileIO-->>Scanner: return size

    Scanner->>Validator: ensure limit > 0
    Validator-->>Scanner: ok / raise ValueError

    Scanner->>Validator: compare file size <= limit
    alt file size > limit
        Validator-->>Scanner: raise ValueError
    else file size <= limit
        Scanner->>FileIO: read up to (limit + 1) bytes
        FileIO-->>Scanner: return pickle_data

        Scanner->>Validator: ensure len(pickle_data) <= limit
        alt read exceeded limit
            Validator-->>Scanner: raise ValueError
        else within limit
            Scanner->>Scanner: infer pickle_protocol from pickle_data
            Scanner->>Metadata: extract metadata from limited data
            Metadata-->>Scanner: return metadata
        end
    end
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble bytes with careful paws,
A cap in place to mind the laws.
Read one extra to catch the leak,
If limits break, I sound the beak.
Hop, scan, and stash the tidy freak.

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main security fix: bounding pickle metadata reads to prevent DoS attacks during metadata extraction.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch codex/fix-dos-vulnerability-in-pickle-extraction
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 6129-6148: The code allows non-positive max_metadata_read_size
which causes f.read(-1) to read the entire file and bypass the guard; change the
logic in the pickle metadata read block (where max_metadata_read_size,
get_file_size, open(file_path) and pickle_data are used) to reject non-positive
values or coerce them to a safe default (e.g., 10*1024*1024) before reading;
specifically, validate max_metadata_read_size > 0 at the start of the try block
and if it is <= 0 raise a ValueError (or set it to the documented default) so
that f.read is always called with a bounded positive length and the subsequent
length checks remain effective.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 2bc940b8-4f9f-4bb4-9f56-2b5dbb99ec89

📥 Commits

Reviewing files that changed from the base of the PR and between d9fe283 and b639d2d.

📒 Files selected for processing (2)
  • modelaudit/scanners/pickle_scanner.py
  • tests/test_metadata_extractor.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/test_metadata_extractor.py`:
- Around line 469-489: The parametrized test
test_pickle_metadata_enforces_read_limit currently only asserts failure cases;
add a valid success case (e.g., limit 256) to the ("limit", "expected_error")
matrix and update assertions so that when limit is the success value the test
asserts that PickleScanner({"max_metadata_pickle_read_size":
limit}).extract_metadata(...) does not contain "extraction_error" (i.e., assert
"extraction_error" not in metadata), using the same tmp_path pkl_file setup and
referencing PickleScanner and the max_metadata_pickle_read_size config key.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 4e0768a7-e299-4842-8114-1b30f84ba28b

📥 Commits

Reviewing files that changed from the base of the PR and between b639d2d and d3ff557.

📒 Files selected for processing (2)
  • modelaudit/scanners/pickle_scanner.py
  • tests/test_metadata_extractor.py

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 6885-6895: The code reads max_metadata_read_size from self.config
but only rejects non-positive values and allows arbitrarily large sizes; change
the logic around max_metadata_read_size in pickle_scanner.py (the variable
max_metadata_read_size, the config.get call) so that after parsing you enforce a
hard upper bound of 10 * 1024 * 1024 (10 MiB) — e.g., validate > 0 then apply
max_metadata_read_size = min(max_metadata_read_size, 10 * 1024 * 1024) (or
replace oversized values with the cap) to ensure metadata reads cannot exceed 10
MiB.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 38d7bd2d-53f6-42de-9347-8bad4c7dc0e8

📥 Commits

Reviewing files that changed from the base of the PR and between d3ff557 and 93c38ad.

📒 Files selected for processing (1)
  • modelaudit/scanners/pickle_scanner.py

mldangelo and others added 4 commits March 18, 2026 07:02
Clamp caller-supplied metadata read limits to 10 MiB, keep malformed limit parsing inside extract_metadata error handling, and cover the success and hard-cap paths in regression tests.

Co-authored-by: Codex <noreply@openai.com>
@mldangelo-oai mldangelo-oai merged commit f1d0698 into main Mar 20, 2026
5 of 6 checks passed
@mldangelo-oai mldangelo-oai deleted the codex/fix-dos-vulnerability-in-pickle-extraction branch March 20, 2026 14:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants