Skip to content

perf: optimize model scan hot paths#1012

Merged
mldangelo-oai merged 3 commits intomainfrom
mdangelo/codex/perf-hot-paths
Apr 16, 2026
Merged

perf: optimize model scan hot paths#1012
mldangelo-oai merged 3 commits intomainfrom
mdangelo/codex/perf-hot-paths

Conversation

@mldangelo-oai
Copy link
Copy Markdown
Contributor

Summary

  • Add a repeatable profiling harness (scripts/profile_scan.py) and cold-cache scan benchmarks, including a skipped-file prefilter microbenchmark.
  • Optimize skipped-extension content sniffing with a bounded header/pickle prefilter that preserves disguised pickle/archive/model payloads without full format detection for ordinary text/media files.
  • Reduce license metadata rescans by sharing header reads, prefiltering license regexes, and reusing collected license metadata in warning checks.
  • Reuse scanner-computed SHA-256 hashes for single-file aggregate hashes, and reuse core format validation inside scanner path checks.
  • Tighten pickle raw detector seed checks and use larger ZIP/TAR extraction copy chunks.

Benchmarks

Compared against the pre-change benchmark JSON captured during the audit:

  • Shared benchmark aggregate median: 397.99ms -> 108.35ms (-72.8%).
  • Duplicate directory: 250.02ms -> 33.36ms (-86.7%).
  • Mixed directory: 98.24ms -> 44.11ms (-55.1%).
  • Safe pickle scan: 19.23ms -> 7.44ms (-61.3%).
  • PyTorch ZIP scan: 23.34ms -> 16.41ms (-29.7%).
  • New skipped plain-text prefilter benchmark: median 9.17ms for 256 files.

Command:

uv run --locked --with pytest-benchmark pytest tests/benchmarks/test_scan_benchmarks.py tests/benchmarks/test_picklescan_benchmarks.py --benchmark-json=/tmp/modelaudit-bench-pr-final.json -q

Validation

  • uv run ruff format --check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ scripts/profile_scan.py
  • uv run ruff check modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/ scripts/profile_scan.py
  • uv run mypy modelaudit/ packages/modelaudit-picklescan/src packages/modelaudit-picklescan/tests tests/
  • PROMPTFOO_DISABLE_TELEMETRY=1 uv run pytest -n auto -m "not slow and not integration" --maxfail=1
    • Result: 3984 passed, 78 skipped, 21 warnings
  • Focused routing/archive/nested pickle tests passed.

Reduce repeated license scans, cheaper skipped-file detection, reuse scan metadata for validation/hash paths, and add profiling/benchmark coverage.
@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Apr 15, 2026

Workflow run and artifacts

Performance Benchmarks

Compared 18 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 4 improved, 14 stable, 1 new, 0 missing.
Aggregate shared-benchmark median: 694.40ms -> 176.68ms (-74.6%).

Top improvements:

  • tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_directory -90.2% (481.32ms -> 47.05ms, duplicate-corpus, size=840.0 KiB, files=81)
  • tests/benchmarks/test_scan_benchmarks.py::test_detect_file_format_safe_pickle -81.2% (160.8us -> 30.2us, safe_model.pkl, size=49.4 KiB, files=1)
  • tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle -60.5% (27.88ms -> 11.00ms, safe_model.pkl, size=49.4 KiB, files=1)
Benchmark Target Size Files Baseline Current Change Status
tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_directory duplicate-corpus 840.0 KiB 81 481.32ms 47.05ms -90.2% improved
tests/benchmarks/test_scan_benchmarks.py::test_detect_file_format_safe_pickle safe_model.pkl 49.4 KiB 1 160.8us 30.2us -81.2% improved
tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle safe_model.pkl 49.4 KiB 1 27.88ms 11.00ms -60.5% improved
tests/benchmarks/test_scan_benchmarks.py::test_scan_mixed_directory mixed-corpus 1.7 MiB 54 136.76ms 73.92ms -45.9% improved
tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip state_dict.pt 1.5 MiB 1 34.18ms 30.61ms -10.4% stable
tests/benchmarks/test_scan_benchmarks.py::test_validate_file_type_pytorch_zip state_dict.pt 1.5 MiB 1 47.9us 52.7us +9.8% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_opcode_budget_tail_payload opcode_budget_tail 14 B 1 71.1us 74.6us +5.0% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_multi_stream_padded_payload multi_stream_padded 4.1 KiB 1 123.5us 128.1us +3.7% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payloads[nested_hex] nested_hex 130 B 1 130.4us 133.3us +2.2% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_hidden_suspicious_string_budget hidden_suspicious_string 8.0 KiB 1 548.4us 538.5us -1.8% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_safe_payloads[safe_large] safe_large 278.2 KiB 1 4.24ms 4.31ms +1.6% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_safe_payloads[safe_small] safe_small 68 B 1 58.5us 57.5us -1.6% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payloads[nested_base64] nested_base64 98 B 1 117.7us 116.0us -1.5% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_chunked_stream chunked_stream 278.2 KiB 1 7.47ms 7.36ms -1.5% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_dangerous_global_payloads[stack_global] stack_global 21 B 1 67.2us 68.1us +1.4% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_dangerous_global_payloads[malicious_reduce] malicious_reduce 52 B 1 81.7us 80.8us -1.2% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payloads[nested_raw] nested_raw 78 B 1 99.8us 100.9us +1.1% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_safe_payloads[long_benign_string] long_benign_string 1.0 MiB 1 1.04ms 1.04ms -0.1% stable

New benchmarks:

  • tests/benchmarks/test_scan_benchmarks.py::test_skip_filter_plain_text_files

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 24948dc0b4

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread modelaudit/utils/file/detection.py Outdated
Comment on lines +988 to +992
if size >= _TAR_USTAR_MIN_BYTES and len(prefix) >= _TAR_USTAR_MIN_BYTES:
tar_magic_start = _TAR_USTAR_OFFSET
tar_magic_end = tar_magic_start + _TAR_USTAR_MAGIC_SIZE
if prefix[tar_magic_start:tar_magic_end].startswith(b"ustar"):
return "tar"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Fall back to tar probe for non-ustar archives

detect_file_format_for_skip_filter only recognizes uncompressed TAR via the ustar marker and only calls _is_tar_archive() for compressed codecs. Valid TAR variants without ustar (e.g., v7-style) will be returned as unknown, so skipped-extension files (like *.jpg) can be incorrectly skipped instead of preserved for archive scanning.

Useful? React with 👍 / 👎.

Comment thread modelaudit/utils/file/detection.py Outdated
if magic8.startswith(_SEVENZIP_MAGIC):
return "sevenzip"
if _is_tar_archive(path):
if _has_tar_ustar_signature(file_path, size):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Restore TAR routing for extensionless non-ustar files

In detect_file_format, replacing _is_tar_archive(path) with _has_tar_ustar_signature(...) removes content-based TAR detection for valid non-ustar archives when the filename lacks a tar extension. Those files now resolve to unknown and may bypass TarScanner, missing nested-member security checks.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 5dae8d24f2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".


def _tar_octal_value(field: bytes) -> int | None:
stripped = field.split(b"\0", 1)[0].strip()
if not stripped or any(byte < ord("0") or byte > ord("7") for byte in stripped):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Accept blank TAR numeric fields in v7 checksum probe

Fresh evidence: valid v7 TAR headers with space-padded uid/gid fields (e.g. b" \0") pass tarfile.is_tarfile, but _tar_octal_value treats those fields as invalid. That makes _has_valid_tar_checksum_header fail and detect_file_format / detect_file_format_for_skip_filter return unknown for extensionless or disguised TARs, so archive scanning can be skipped.

Useful? React with 👍 / 👎.

@mldangelo-oai mldangelo-oai enabled auto-merge (squash) April 15, 2026 13:53
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 17f705f57c

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +1046 to +1050
lightgbm_probe_size = min(size, _LIGHTGBM_SIGNATURE_READ_BYTES)
if len(prefix) < lightgbm_probe_size:
prefix += f.read(lightgbm_probe_size - len(prefix))
if _is_lightgbm_signature(prefix):
return "lightgbm"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Add CNTK/Torch7 probes to skip-filter detector

should_skip_file() now depends on detect_file_format_for_skip_filter(), but this fast path falls through to LightGBM/pickle/XML checks without running _is_cntk_signature or _is_torch7_signature. Disguised CNTK/Torch7 payloads under skipped suffixes (e.g. .jpg) can therefore resolve to unknown, get skipped, and bypass their scanners. detect_file_format() previously retained these marker-based detections.

Useful? React with 👍 / 👎.

@mldangelo-oai mldangelo-oai merged commit 6a0c53a into main Apr 16, 2026
27 checks passed
@mldangelo-oai mldangelo-oai deleted the mdangelo/codex/perf-hot-paths branch April 16, 2026 06:29
@github-actions github-actions bot mentioned this pull request Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant