fix: add long-line protection for matched_text to prevent output bloat#799
Merged
fix: add long-line protection for matched_text to prevent output bloat#799
Conversation
For files with very long lines (e.g., minified JS/CSS), matched_text extraction now uses token-span mode instead of whole-line mode. This prevents capturing megabytes of text for a small license match on a single-line minified file. Also adds a 10,000-character safety truncation fallback in matched_text_from_text() for when token-span extraction is unavailable. Signed-off-by: Maxim Stykow <stykowmaxim@meta.com>
Keep Query::matched_text and other internal consumers on full-line semantics while compacting oversized output-only matched_text snippets for long-line files. This avoids output bloat without changing clue promotion or low-quality filtering, and adds focused regression coverage for normal, long-line, and query-missing paths. Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When scanning minified JS/CSS files, provenant could emit massively bloated matched_text output because whole-line extraction captured the full minified line for every match. On a single-line minified file, even a small match could therefore duplicate the entire file text across both per-file and top-level license detections.
Difference to ScanCode
ScanCode already protects against this by switching away from whole-line matched-text extraction for long-line queries. Provenant already had long-line handling in the matching pipeline, but it was still using whole-line extraction when rendering matched_text for output. That left the engine protected, but not the emitted license-text snippets.
This PR adopts the important part of that behavior: do not emit whole-line evidence for pathological long-line matches. It does not try to port all of ScanCode''s broader preprocessing around compact text, because the goal here is better scan results and bounded output, not strict parity.
End state
Tests
Added focused regression coverage for:
Expected-output fixture changes
None. Existing fixtures do not exercise license-text output on pathological long-line inputs.