fix(security): bound pickle metadata reads in metadata extraction#712
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository UI Review profile: ASSERTIVE Plan: Pro Run ID: 📒 Files selected for processing (2)
WalkthroughAdds a configurable, validated read-size cap ( Changes
Sequence DiagramsequenceDiagram
participant Scanner as PickleScanner
participant Config as Configuration
participant FileIO as File I/O
participant Validator as Validator
participant Metadata as MetadataExtractor
Scanner->>Config: read max_metadata_pickle_read_size
Config-->>Scanner: return limit (clamped to 10 MiB)
Scanner->>FileIO: stat(file) -> size
FileIO-->>Scanner: return size
Scanner->>Validator: ensure limit > 0
Validator-->>Scanner: ok / raise ValueError
Scanner->>Validator: compare file size <= limit
alt file size > limit
Validator-->>Scanner: raise ValueError
else file size <= limit
Scanner->>FileIO: read up to (limit + 1) bytes
FileIO-->>Scanner: return pickle_data
Scanner->>Validator: ensure len(pickle_data) <= limit
alt read exceeded limit
Validator-->>Scanner: raise ValueError
else within limit
Scanner->>Scanner: infer pickle_protocol from pickle_data
Scanner->>Metadata: extract metadata from limited data
Metadata-->>Scanner: return metadata
end
end
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 6129-6148: The code allows non-positive max_metadata_read_size
which causes f.read(-1) to read the entire file and bypass the guard; change the
logic in the pickle metadata read block (where max_metadata_read_size,
get_file_size, open(file_path) and pickle_data are used) to reject non-positive
values or coerce them to a safe default (e.g., 10*1024*1024) before reading;
specifically, validate max_metadata_read_size > 0 at the start of the try block
and if it is <= 0 raise a ValueError (or set it to the documented default) so
that f.read is always called with a bounded positive length and the subsequent
length checks remain effective.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 2bc940b8-4f9f-4bb4-9f56-2b5dbb99ec89
📒 Files selected for processing (2)
modelaudit/scanners/pickle_scanner.pytests/test_metadata_extractor.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/test_metadata_extractor.py`:
- Around line 469-489: The parametrized test
test_pickle_metadata_enforces_read_limit currently only asserts failure cases;
add a valid success case (e.g., limit 256) to the ("limit", "expected_error")
matrix and update assertions so that when limit is the success value the test
asserts that PickleScanner({"max_metadata_pickle_read_size":
limit}).extract_metadata(...) does not contain "extraction_error" (i.e., assert
"extraction_error" not in metadata), using the same tmp_path pkl_file setup and
referencing PickleScanner and the max_metadata_pickle_read_size config key.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 4e0768a7-e299-4842-8114-1b30f84ba28b
📒 Files selected for processing (2)
modelaudit/scanners/pickle_scanner.pytests/test_metadata_extractor.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelaudit/scanners/pickle_scanner.py`:
- Around line 6885-6895: The code reads max_metadata_read_size from self.config
but only rejects non-positive values and allows arbitrarily large sizes; change
the logic around max_metadata_read_size in pickle_scanner.py (the variable
max_metadata_read_size, the config.get call) so that after parsing you enforce a
hard upper bound of 10 * 1024 * 1024 (10 MiB) — e.g., validate > 0 then apply
max_metadata_read_size = min(max_metadata_read_size, 10 * 1024 * 1024) (or
replace oversized values with the cap) to ensure metadata reads cannot exceed 10
MiB.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: ASSERTIVE
Plan: Pro
Run ID: 38d7bd2d-53f6-42de-9347-8bad4c7dc0e8
📒 Files selected for processing (1)
modelaudit/scanners/pickle_scanner.py
…bility-in-pickle-extraction
Clamp caller-supplied metadata read limits to 10 MiB, keep malformed limit parsing inside extract_metadata error handling, and cover the success and hard-cap paths in regression tests. Co-authored-by: Codex <noreply@openai.com>
…bility-in-pickle-extraction
Motivation
f.read()on arbitrarily large .pkl files.Description
PickleScanner.extract_metadata()using a new config keymax_metadata_pickle_read_sizewith a default of 10 MiB and raise aValueErrorwhen the file or read exceeds the limit.max_metadata_pickle_read_size + 1bytes and surface a clearextraction_errorwhen the limit is exceeded to avoid large allocations while preserving opcode analysis for small files.test_pickle_metadata_enforces_read_limittotests/test_metadata_extractor.pythat verifies oversized pickle metadata extraction is rejected.Testing
uv run ruff format modelaudit/ tests/(reformatted 2 files) anduv run ruff check --fix modelaudit/ tests/(passed).uv run mypy modelaudit/(passed with no issues).PickleScanner({'max_metadata_pickle_read_size':64}).extract_metadata()returns anextraction_errorfor a 128-byte file), which passed.uv run pytest -n auto -m "not slow and not integration" --maxfail=1was attempted but encountered unrelated, pre-existing failures in other tests; the change-specific checks and linters passed.Codex Task
Summary by CodeRabbit
New Features
Tests