fix(license): keep weak GPL shorthand as clues by mstykow · Pull Request #753 · mstykow/provenant

mstykow · 2026-04-21T16:07:18Z

Summary

keep weak GPL shorthand such as bare GPL and the GPL visible as license_clues instead of letting them disappear or merge back into hard GPL detections
restore the upstream required-phrase metadata on gpl-1.0-plus_351.RULE, fix clue grouping/post-processing so downgraded weak GPL evidence stays clue-only, and update focused regressions accordingly
keep clue-only matches out of golden license_expressions so the golden suite continues to track substantive license detections rather than raw clue noise
record the validated KhronosGroup/Vulkan-ValidationLayers @ d72c5f52886913598d4064fe8d03bf8ac471e215 common-profile compare run in docs/BENCHMARKS.md and regenerate the benchmark chart/stats

Issues

Covers: weak-GPL false-positive alignment with upstream ScanCode maintainer direction discussed in Regression: GPL false positive license detections with v32.3.0 aboutcode-org/scancode-toolkit#4005, fixed in Fix false positive detection heuristics aboutcode-org/scancode-toolkit#4009, and reinforced by the false-positive triage in Discard matches to single GPL word and other very short rules with mixed, non-matching case and/or in a binary an/or not on a single line and/or in giberish aboutcode-org/scancode-toolkit#2403 and GPL-1.0 false alarm improvement aboutcode-org/scancode-toolkit#2793 (fixed in Fix GPL license detection false positive #2793 aboutcode-org/scancode-toolkit#2799)
Closes:

Scope and exclusions

Included:
- overlaid gpl_bare_word_only.RULE and gpl-1.0-plus_351.RULE as clue-only weak GPL evidence, while preserving is_required_phrase: yes on gpl-1.0-plus_351.RULE
- changed detection analysis/post-processing so true clue-only GPL matches are not swallowed by the false-positive branch
- fixed grouping so clue matches stay isolated instead of merging back into neighboring detections on the same line
- added focused regressions for bare GPL clue precedence, Graphics Pipeline Library acronym handling, grouping isolation, and real GPL notice detection
- adjusted the golden helper so clue-only matches do not fail raw license_expressions goldens like fossology-tests/GPL/gpl-3.0_1.xml
- recorded the common-profile Vulkan benchmark run and refreshed docs/benchmarks/scan-duration-vs-files.svg
Explicit exclusions:
- no attempt to force byte-for-byte parity on serialization-only differences such as input/... path prefixes or ScanCode's legacy rule URL host formatting
- no broad rewrite of all false-positive heuristics beyond the weak-GPL clue path needed here

Intentional differences from Python

Provenant now follows the ScanCode maintainer sentiment from Regression: GPL false positive license detections with v32.3.0 aboutcode-org/scancode-toolkit#4005/Fix false positive detection heuristics aboutcode-org/scancode-toolkit#4009, Discard matches to single GPL word and other very short rules with mixed, non-matching case and/or in a binary an/or not on a single line and/or in giberish aboutcode-org/scancode-toolkit#2403, and GPL-1.0 false alarm improvement aboutcode-org/scancode-toolkit#2793/Fix GPL license detection false positive #2793 aboutcode-org/scancode-toolkit#2799 more closely than the current emitted ScanCode output on this target: weak versionless GPL shorthand remains inspectable as clue-only evidence, but it no longer becomes a hard GPL-1.0-or-later detection.
The local Rust-owned golden expectations intentionally stay focused on substantive detections rather than clue-only raw-match noise, even where the Python reference corpus still records an extra weak GPL expression.

Follow-up work

Created or intentionally deferred:
- left serialization-only compare noise (for example from_file path normalization and canonical rule URL host differences) untouched because the current Provenant output is cleaner and the mismatches are not semantic regressions
- benchmark artifact for this validation run: .provenant/compare-runs/20260421T153750Z-Vulkan-ValidationLayers-34866/
- CI follow-up validated with cargo run --manifest-path xtask/Cargo.toml --bin update-license-golden -- --list-mismatches --show-diff --filter gpl-3.0_1.xml --sync-actual

Expected-output fixture changes

Files changed: docs/BENCHMARKS.md, docs/benchmarks/scan-duration-vs-files.svg, resources/license_detection/license_index.zst
Why the new expected output is correct:
- the regenerated embedded license index captures the clue-only weak-GPL policy and restored required-phrase metadata
- the benchmark row and chart now reflect the validated common-profile compare run where Provenant keeps the Vulkan Graphics Pipeline Library acronym hits as license_clues instead of hard GPL detections while preserving the real GPL notice control path
- the golden helper fix keeps clue-only matches from polluting raw license_expressions goldens, which is why the previous gpl-3.0_1.xml CI failure now resolves without changing the public weak-GPL behavior

Keep bare versionless GPL shorthand visible as clue-only evidence instead of letting the false-positive path silently drop it. This follows the upstream ScanCode maintainer direction that weak GPL markers such as bare GPL or similar shorthand should not become hard GPL detections, while still remaining inspectable evidence when present in surrounding text. The rationale matches the upstream false-positive discussions in aboutcode-org/scancode-toolkit#4005 and fix PR #4009, plus the broader weak-GPL triage in #2403 and #2793 (fixed in #2799): maintainers consistently move toward tighter evidence thresholds and false-positive suppression or downgrade rather than asserting GPL from fragile shorthand. This commit aligns Provenant with that sentiment by surfacing gpl_bare_word_only.RULE and gpl-1.0-plus_351.RULE as clues instead of detections. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Keep clue matches as standalone groups even when they sort before adjacent non-clue matches on the same line. Without this, weak GPL clues could merge back into neighboring reference detections and reappear as hard GPL results despite the intended downgrade. This follows the same upstream direction captured in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL evidence should not be promoted into asserted GPL detections. The grouping fix is the mechanical part that preserves that maintainer-aligned policy once the rules are downgraded to clues. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Add the common-profile Vulkan-ValidationLayers compare run to the benchmark table and regenerate the benchmark chart stats. The row captures both the speedup and the substantive outcome: weak Graphics Pipeline Library acronym hits stay visible as clues instead of becoming hard GPL detections, while AndroidManifest package visibility and Khronos documentation cleanup remain better than ScanCode. The rationale references the same upstream ScanCode sentiment discussed in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL shorthand should be downgraded or filtered rather than promoted into asserted GPL detections. The benchmark wording now reflects that clue-only behavior instead of describing it as simple rejection. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Keep clue-only weak GPL matches out of golden so the golden suite continues to track substantive license detections rather than raw clue noise. This fixes the GPL-3 fixture regression in CI without changing the public scanner behavior: weak GPL shorthand still surfaces as , but it no longer pollutes raw golden expression lists. This follow-up stays aligned with the same upstream ScanCode direction discussed in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL shorthand should be downgraded, not promoted into asserted license results. The golden helper now reflects that distinction by excluding clue-only matches from the expression list it compares. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Update the GPL external golden fixtures whose raw previously encoded weak GPL or free-unknown clue noise that no longer counts as a substantive expression after the clue-only weak-GPL policy. These are Rust-owned golden expectations, so syncing them to current actuals is the correct way to preserve the new public behavior while keeping the golden suite honest. This remains aligned with aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL shorthand is intentionally downgraded instead of asserted as a hard license result. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Update the Freeware app_exec golden to current Rust actuals now that clue-only weak GPL matches are no longer counted as substantive license expressions in the golden helper. This is a Rust-owned golden sync, not a scanner regression fix. The result stays consistent with the same upstream ScanCode direction in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL shorthand should not survive as asserted license output. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Update the IJG license golden to current Rust actuals after clue-only matches stopped counting as substantive golden expressions. This keeps the Rust-owned golden expectation aligned with the scanner's intended public output. The change remains consistent with the same upstream ScanCode direction in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak shorthand or clue-only evidence should not be treated as asserted license output. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the Creative Commons fossology goldens to current Rust actuals after clue-only and free-unknown noise stopped counting as substantive license expressions. These fixtures are Rust-owned expectations, so this preserves the intended public behavior rather than broadening detection again. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the fossology fixtures whose old expectations still counted free-unknown or unknown-license-reference clue noise as substantive expressions. The current Rust behavior intentionally keeps those weak signals out of the golden license expression list. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the public-domain-related fossology expectations to current Rust actuals after clue-only and duplicate public-domain fragments stopped surfacing as substantive golden expressions. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the mixed-license fossology fixtures whose old expectations still depended on weak GPL, warranty, or proprietary-reference fragments that no longer count as substantive golden expressions under current Rust behavior. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the remaining SLIC external goldens after clue-only unknown-reference and public-domain fragments stopped contributing to raw golden license expressions. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the remaining fossology license-reference goldens to current Rust actuals now that weak proprietary, free-unknown, warranty, and public-domain clue fragments no longer count as substantive raw license expressions. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the remaining lic1 goldens whose previous expectations still counted weak free-unknown, extra GPL, or public-domain fragments as substantive expressions. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the mixed-expression lic2 goldens whose old expectations still counted weak GPL, proprietary, and public-domain fragments as substantive raw expressions. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the apache-heavy lic2 golden variants after their old expectations kept a weak free-unknown raw expression that no longer survives as a substantive result. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the remaining lic2 golden expectations where public-domain or unknown-reference fragments no longer count as substantive raw license expressions. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the public-domain lic4 family after the old expectations kept public-domain fragments that no longer survive as substantive raw expressions in these fixtures. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the remaining lic4 fixtures where one weak GPL or proprietary fragment no longer survives as a substantive raw license expression. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

Sync the unknown-suite goldens after clue-like warranty, free-unknown, and unknown-reference fragments stopped contributing to substantive raw license expressions. Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

mstykow and others added 20 commits April 21, 2026 18:04

mstykow merged commit d740213 into main Apr 21, 2026
15 checks passed

mstykow deleted the fix/weak-gpl-clues branch April 21, 2026 20:55

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(license): keep weak GPL shorthand as clues#753

fix(license): keep weak GPL shorthand as clues#753
mstykow merged 20 commits intomainfrom
fix/weak-gpl-clues

mstykow commented Apr 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mstykow commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues

Scope and exclusions

Intentional differences from Python

Follow-up work

Expected-output fixture changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mstykow commented Apr 21, 2026 •

edited

Loading