Skip to content

fix: route UBJSON-format .bst files to UBJ scanner#1037

Merged
mldangelo-oai merged 1 commit intomainfrom
fix/xgboost-ubjson-bst-routing
Apr 16, 2026
Merged

fix: route UBJSON-format .bst files to UBJ scanner#1037
mldangelo-oai merged 1 commit intomainfrom
fix/xgboost-ubjson-bst-routing

Conversation

@mldangelo
Copy link
Copy Markdown
Member

Summary

  • Modern XGBoost (2.0+) saves .bst files in UBJSON format by default, not the legacy binary format
  • The binary structure validator didn't recognize UBJSON headers, so it marked these scans as binary_structure_unrecognized → inconclusive → success=False
  • Added _is_ubjson_file() detection that checks for UBJSON object header signature ({ + type marker)
  • When a .bst file is detected as UBJSON, it's routed to _scan_ubj_model() for proper analysis

Fixes the Python 3.12 CI failure: TestXGBoostScannerIntegration::test_real_xgboost_model_creation_and_scan was asserting bst_result.success which was False because the fail-closed logic from #1019 correctly rejected the unrecognized binary structure — but the file was actually valid UBJSON.

Test plan

  • test_real_xgboost_model_creation_and_scan passes (was failing)
  • All 45 xgboost scanner tests pass
  • Full fast test suite passes (4483 tests)
  • Slow/integration xgboost tests pass
  • Linting, formatting, and type checking clean

🤖 Generated with Claude Code

Modern XGBoost (2.0+) saves .bst files in UBJSON format by default.
The binary structure validator didn't recognize UBJSON, marking these
scans as inconclusive and failing closed. Detect UBJSON by its header
signature and route to the existing UBJ scanner for proper analysis.

Fixes Python 3.12 CI failure in
TestXGBoostScannerIntegration::test_real_xgboost_model_creation_and_scan.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Workflow run and artifacts

Performance Benchmarks

Compared 19 shared benchmarks with a regression threshold of 15%.
Status: 0 regressions, 0 improved, 19 stable, 0 new, 0 missing.
Aggregate shared-benchmark median: 188.77ms -> 190.46ms (+0.9%).

Benchmark Target Size Files Baseline Current Change Status
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_dangerous_global_payloads[stack_global] stack_global 21 B 1 71.9us 66.9us -6.9% stable
tests/benchmarks/test_scan_benchmarks.py::test_skip_filter_plain_text_files - 4.6 KiB 256 13.56ms 14.11ms +4.1% stable
tests/benchmarks/test_scan_benchmarks.py::test_validate_file_type_pytorch_zip state_dict.pt 1.5 MiB 1 52.4us 54.1us +3.2% stable
tests/benchmarks/test_scan_benchmarks.py::test_detect_file_format_safe_pickle safe_model.pkl 49.4 KiB 1 30.3us 31.3us +3.1% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_opcode_budget_tail_payload opcode_budget_tail 14 B 1 73.8us 71.5us -3.1% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_dangerous_global_payloads[malicious_reduce] malicious_reduce 52 B 1 80.5us 78.1us -2.9% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payloads[nested_base64] nested_base64 98 B 1 102.5us 105.2us +2.6% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_safe_payloads[safe_small] safe_small 68 B 1 56.8us 58.2us +2.3% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payloads[nested_raw] nested_raw 78 B 1 100.0us 102.1us +2.0% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_duplicate_directory duplicate-corpus 840.0 KiB 81 47.05ms 47.92ms +1.9% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_safe_payloads[safe_large] safe_large 278.2 KiB 1 3.56ms 3.50ms -1.8% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_safe_pickle safe_model.pkl 49.4 KiB 1 10.86ms 11.01ms +1.3% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_hidden_suspicious_string_budget hidden_suspicious_string 8.0 KiB 1 586.1us 580.1us -1.0% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_pytorch_zip state_dict.pt 1.5 MiB 1 30.51ms 30.81ms +1.0% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_safe_payloads[long_benign_string] long_benign_string 1.0 MiB 1 1.09ms 1.08ms -0.5% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_multi_stream_padded_payload multi_stream_padded 4.1 KiB 1 134.5us 135.1us +0.4% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_chunked_stream chunked_stream 278.2 KiB 1 6.80ms 6.78ms -0.3% stable
tests/benchmarks/test_scan_benchmarks.py::test_scan_mixed_directory mixed-corpus 1.7 MiB 54 73.96ms 73.86ms -0.1% stable
tests/benchmarks/test_picklescan_benchmarks.py::test_picklescan_nested_payloads[nested_hex] nested_hex 130 B 1 106.8us 106.7us -0.1% stable

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 466858bcc8

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +514 to +516
if self._is_ubjson_file(path):
self._scan_ubj_model(path, result)
return
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve xgboost-load fallback when UBJSON decoder is missing

Routing every UBJSON-looking .bst file directly to _scan_ubj_model() introduces a regression for environments where enable_xgb_loading=True and ubjson is not installed: _scan_ubj_model() marks the scan inconclusive and returns, so _safe_xgboost_load() is never attempted even though the same XGBoost runtime can load modern UBJSON-backed .bst files. Before this change, those files could still be validated via the XGBoost load path; now they fail closed with xgboost_ubj_dependency_missing solely due to the optional Python UBJSON package.

Useful? React with 👍 / 👎.

@mldangelo-oai mldangelo-oai merged commit 52f869a into main Apr 16, 2026
27 checks passed
@mldangelo-oai mldangelo-oai deleted the fix/xgboost-ubjson-bst-routing branch April 16, 2026 14:13
@mldangelo
Copy link
Copy Markdown
Member Author

Addressed review feedback in 4113278:

  • Only routes to _scan_ubj_model() when ubjson is available
  • Skips binary-marker validation for detected UBJSON to avoid false binary_structure_unrecognized
  • Falls through to _safe_xgboost_load() when enable_xgb_loading=True but ubjson is missing
  • Emits explicit inconclusive with ubj_dependency_missing when both are unavailable

@github-actions github-actions bot mentioned this pull request Apr 16, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants