fix: add long-line protection for matched_text to prevent output bloat by mstykow · Pull Request #799 · mstykow/provenant

mstykow · 2026-04-27T10:35:23Z

Summary

When scanning minified JS/CSS files, provenant could emit massively bloated matched_text output because whole-line extraction captured the full minified line for every match. On a single-line minified file, even a small match could therefore duplicate the entire file text across both per-file and top-level license detections.

Difference to ScanCode

ScanCode already protects against this by switching away from whole-line matched-text extraction for long-line queries. Provenant already had long-line handling in the matching pipeline, but it was still using whole-line extraction when rendering matched_text for output. That left the engine protected, but not the emitted license-text snippets.

This PR adopts the important part of that behavior: do not emit whole-line evidence for pathological long-line matches. It does not try to port all of ScanCode''s broader preprocessing around compact text, because the goal here is better scan results and bounded output, not strict parity.

End state

regular files keep normal whole-line matched_text behavior
minified / oversized-line files use compact token-span matched_text output instead of whole-file line capture
when query/span data is unavailable, output falls back to UTF-8-safe bounded whole-line text with an explicit truncation marker
matched_text_diagnostics is capped as well so diagnostics do not reintroduce the same size problem
internal query semantics stay unchanged, so clue promotion and low-quality filtering keep their previous behavior

Tests

Added focused regression coverage for:

normal whole-line matched_text preservation
oversized long-line compaction
output-only fallback truncation when query is unavailable
full-fidelity Query::matched_text() behavior for long input
unchanged exact reference-URL clue promotion behavior

Expected-output fixture changes

None. Existing fixtures do not exercise license-text output on pathological long-line inputs.

For files with very long lines (e.g., minified JS/CSS), matched_text extraction now uses token-span mode instead of whole-line mode. This prevents capturing megabytes of text for a small license match on a single-line minified file. Also adds a 10,000-character safety truncation fallback in matched_text_from_text() for when token-span extraction is unavailable. Signed-off-by: Maxim Stykow <stykowmaxim@meta.com>

Keep Query::matched_text and other internal consumers on full-line semantics while compacting oversized output-only matched_text snippets for long-line files. This avoids output bloat without changing clue promotion or low-quality filtering, and adds focused regression coverage for normal, long-line, and query-missing paths. Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

mstykow and others added 2 commits April 27, 2026 12:34

mstykow merged commit 6751732 into main Apr 27, 2026
15 checks passed

mstykow deleted the fix/long-line-matched-text-protection branch April 27, 2026 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: add long-line protection for matched_text to prevent output bloat#799

fix: add long-line protection for matched_text to prevent output bloat#799
mstykow merged 2 commits intomainfrom
fix/long-line-matched-text-protection

mstykow commented Apr 27, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mstykow commented Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Difference to ScanCode

End state

Tests

Expected-output fixture changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mstykow commented Apr 27, 2026 •

edited

Loading