perf: optimize model scan hot paths#1012
Conversation
Reduce repeated license scans, cheaper skipped-file detection, reuse scan metadata for validation/hash paths, and add profiling/benchmark coverage.
Performance BenchmarksCompared Top improvements:
New benchmarks:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24948dc0b4
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if size >= _TAR_USTAR_MIN_BYTES and len(prefix) >= _TAR_USTAR_MIN_BYTES: | ||
| tar_magic_start = _TAR_USTAR_OFFSET | ||
| tar_magic_end = tar_magic_start + _TAR_USTAR_MAGIC_SIZE | ||
| if prefix[tar_magic_start:tar_magic_end].startswith(b"ustar"): | ||
| return "tar" |
There was a problem hiding this comment.
Fall back to tar probe for non-ustar archives
detect_file_format_for_skip_filter only recognizes uncompressed TAR via the ustar marker and only calls _is_tar_archive() for compressed codecs. Valid TAR variants without ustar (e.g., v7-style) will be returned as unknown, so skipped-extension files (like *.jpg) can be incorrectly skipped instead of preserved for archive scanning.
Useful? React with 👍 / 👎.
| if magic8.startswith(_SEVENZIP_MAGIC): | ||
| return "sevenzip" | ||
| if _is_tar_archive(path): | ||
| if _has_tar_ustar_signature(file_path, size): |
There was a problem hiding this comment.
Restore TAR routing for extensionless non-ustar files
In detect_file_format, replacing _is_tar_archive(path) with _has_tar_ustar_signature(...) removes content-based TAR detection for valid non-ustar archives when the filename lacks a tar extension. Those files now resolve to unknown and may bypass TarScanner, missing nested-member security checks.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5dae8d24f2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
|
||
| def _tar_octal_value(field: bytes) -> int | None: | ||
| stripped = field.split(b"\0", 1)[0].strip() | ||
| if not stripped or any(byte < ord("0") or byte > ord("7") for byte in stripped): |
There was a problem hiding this comment.
Accept blank TAR numeric fields in v7 checksum probe
Fresh evidence: valid v7 TAR headers with space-padded uid/gid fields (e.g. b" \0") pass tarfile.is_tarfile, but _tar_octal_value treats those fields as invalid. That makes _has_valid_tar_checksum_header fail and detect_file_format / detect_file_format_for_skip_filter return unknown for extensionless or disguised TARs, so archive scanning can be skipped.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 17f705f57c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| lightgbm_probe_size = min(size, _LIGHTGBM_SIGNATURE_READ_BYTES) | ||
| if len(prefix) < lightgbm_probe_size: | ||
| prefix += f.read(lightgbm_probe_size - len(prefix)) | ||
| if _is_lightgbm_signature(prefix): | ||
| return "lightgbm" |
There was a problem hiding this comment.
Add CNTK/Torch7 probes to skip-filter detector
should_skip_file() now depends on detect_file_format_for_skip_filter(), but this fast path falls through to LightGBM/pickle/XML checks without running _is_cntk_signature or _is_torch7_signature. Disguised CNTK/Torch7 payloads under skipped suffixes (e.g. .jpg) can therefore resolve to unknown, get skipped, and bypass their scanners. detect_file_format() previously retained these marker-based detections.
Useful? React with 👍 / 👎.
Summary
scripts/profile_scan.py) and cold-cache scan benchmarks, including a skipped-file prefilter microbenchmark.Benchmarks
Compared against the pre-change benchmark JSON captured during the audit:
397.99ms -> 108.35ms(-72.8%).250.02ms -> 33.36ms(-86.7%).98.24ms -> 44.11ms(-55.1%).19.23ms -> 7.44ms(-61.3%).23.34ms -> 16.41ms(-29.7%).9.17msfor 256 files.Command:
Validation
uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ scripts/profile_scan.pyuv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ scripts/profile_scan.pyuv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=13984 passed, 78 skipped, 21 warnings