Skip to content

fix: add long-line protection for matched_text to prevent output bloat#799

Merged
mstykow merged 2 commits intomainfrom
fix/long-line-matched-text-protection
Apr 27, 2026
Merged

fix: add long-line protection for matched_text to prevent output bloat#799
mstykow merged 2 commits intomainfrom
fix/long-line-matched-text-protection

Conversation

@mstykow
Copy link
Copy Markdown
Owner

@mstykow mstykow commented Apr 27, 2026

Summary

When scanning minified JS/CSS files, provenant could emit massively bloated matched_text output because whole-line extraction captured the full minified line for every match. On a single-line minified file, even a small match could therefore duplicate the entire file text across both per-file and top-level license detections.

Difference to ScanCode

ScanCode already protects against this by switching away from whole-line matched-text extraction for long-line queries. Provenant already had long-line handling in the matching pipeline, but it was still using whole-line extraction when rendering matched_text for output. That left the engine protected, but not the emitted license-text snippets.

This PR adopts the important part of that behavior: do not emit whole-line evidence for pathological long-line matches. It does not try to port all of ScanCode''s broader preprocessing around compact text, because the goal here is better scan results and bounded output, not strict parity.

End state

  • regular files keep normal whole-line matched_text behavior
  • minified / oversized-line files use compact token-span matched_text output instead of whole-file line capture
  • when query/span data is unavailable, output falls back to UTF-8-safe bounded whole-line text with an explicit truncation marker
  • matched_text_diagnostics is capped as well so diagnostics do not reintroduce the same size problem
  • internal query semantics stay unchanged, so clue promotion and low-quality filtering keep their previous behavior

Tests

Added focused regression coverage for:

  • normal whole-line matched_text preservation
  • oversized long-line compaction
  • output-only fallback truncation when query is unavailable
  • full-fidelity Query::matched_text() behavior for long input
  • unchanged exact reference-URL clue promotion behavior

Expected-output fixture changes

None. Existing fixtures do not exercise license-text output on pathological long-line inputs.

mstykow and others added 2 commits April 27, 2026 12:34
For files with very long lines (e.g., minified JS/CSS), matched_text
extraction now uses token-span mode instead of whole-line mode. This
prevents capturing megabytes of text for a small license match on a
single-line minified file.

Also adds a 10,000-character safety truncation fallback in
matched_text_from_text() for when token-span extraction is unavailable.

Signed-off-by: Maxim Stykow <stykowmaxim@meta.com>
Keep Query::matched_text and other internal consumers on full-line semantics while compacting oversized output-only matched_text snippets for long-line files. This avoids output bloat without changing clue promotion or low-quality filtering, and adds focused regression coverage for normal, long-line, and query-missing paths.

Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
@mstykow mstykow merged commit 6751732 into main Apr 27, 2026
15 checks passed
@mstykow mstykow deleted the fix/long-line-matched-text-protection branch April 27, 2026 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant