Skip to content

fix(license): keep weak GPL shorthand as clues#753

Merged
mstykow merged 20 commits intomainfrom
fix/weak-gpl-clues
Apr 21, 2026
Merged

fix(license): keep weak GPL shorthand as clues#753
mstykow merged 20 commits intomainfrom
fix/weak-gpl-clues

Conversation

@mstykow
Copy link
Copy Markdown
Owner

@mstykow mstykow commented Apr 21, 2026

Summary

  • keep weak GPL shorthand such as bare GPL and the GPL visible as license_clues instead of letting them disappear or merge back into hard GPL detections
  • restore the upstream required-phrase metadata on gpl-1.0-plus_351.RULE, fix clue grouping/post-processing so downgraded weak GPL evidence stays clue-only, and update focused regressions accordingly
  • keep clue-only matches out of golden license_expressions so the golden suite continues to track substantive license detections rather than raw clue noise
  • record the validated KhronosGroup/Vulkan-ValidationLayers @ d72c5f52886913598d4064fe8d03bf8ac471e215 common-profile compare run in docs/BENCHMARKS.md and regenerate the benchmark chart/stats

Issues

Scope and exclusions

  • Included:
    • overlaid gpl_bare_word_only.RULE and gpl-1.0-plus_351.RULE as clue-only weak GPL evidence, while preserving is_required_phrase: yes on gpl-1.0-plus_351.RULE
    • changed detection analysis/post-processing so true clue-only GPL matches are not swallowed by the false-positive branch
    • fixed grouping so clue matches stay isolated instead of merging back into neighboring detections on the same line
    • added focused regressions for bare GPL clue precedence, Graphics Pipeline Library acronym handling, grouping isolation, and real GPL notice detection
    • adjusted the golden helper so clue-only matches do not fail raw license_expressions goldens like fossology-tests/GPL/gpl-3.0_1.xml
    • recorded the common-profile Vulkan benchmark run and refreshed docs/benchmarks/scan-duration-vs-files.svg
  • Explicit exclusions:
    • no attempt to force byte-for-byte parity on serialization-only differences such as input/... path prefixes or ScanCode's legacy rule URL host formatting
    • no broad rewrite of all false-positive heuristics beyond the weak-GPL clue path needed here

Intentional differences from Python

Follow-up work

  • Created or intentionally deferred:
    • left serialization-only compare noise (for example from_file path normalization and canonical rule URL host differences) untouched because the current Provenant output is cleaner and the mismatches are not semantic regressions
    • benchmark artifact for this validation run: .provenant/compare-runs/20260421T153750Z-Vulkan-ValidationLayers-34866/
    • CI follow-up validated with cargo run --manifest-path xtask/Cargo.toml --bin update-license-golden -- --list-mismatches --show-diff --filter gpl-3.0_1.xml --sync-actual

Expected-output fixture changes

  • Files changed: docs/BENCHMARKS.md, docs/benchmarks/scan-duration-vs-files.svg, resources/license_detection/license_index.zst
  • Why the new expected output is correct:
    • the regenerated embedded license index captures the clue-only weak-GPL policy and restored required-phrase metadata
    • the benchmark row and chart now reflect the validated common-profile compare run where Provenant keeps the Vulkan Graphics Pipeline Library acronym hits as license_clues instead of hard GPL detections while preserving the real GPL notice control path
    • the golden helper fix keeps clue-only matches from polluting raw license_expressions goldens, which is why the previous gpl-3.0_1.xml CI failure now resolves without changing the public weak-GPL behavior

mstykow and others added 20 commits April 21, 2026 18:04
Keep bare versionless GPL shorthand visible as clue-only evidence instead of letting the false-positive path silently drop it. This follows the upstream ScanCode maintainer direction that weak GPL markers such as bare GPL or similar shorthand should not become hard GPL detections, while still remaining inspectable evidence when present in surrounding text.

The rationale matches the upstream false-positive discussions in aboutcode-org/scancode-toolkit#4005 and fix PR #4009, plus the broader weak-GPL triage in #2403 and #2793 (fixed in #2799): maintainers consistently move toward tighter evidence thresholds and false-positive suppression or downgrade rather than asserting GPL from fragile shorthand. This commit aligns Provenant with that sentiment by surfacing gpl_bare_word_only.RULE and gpl-1.0-plus_351.RULE as clues instead of detections.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Keep clue matches as standalone groups even when they sort before adjacent non-clue matches on the same line. Without this, weak GPL clues could merge back into neighboring reference detections and reappear as hard GPL results despite the intended downgrade.

This follows the same upstream direction captured in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL evidence should not be promoted into asserted GPL detections. The grouping fix is the mechanical part that preserves that maintainer-aligned policy once the rules are downgraded to clues.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Add the common-profile Vulkan-ValidationLayers compare run to the benchmark table and regenerate the benchmark chart stats. The row captures both the speedup and the substantive outcome: weak Graphics Pipeline Library acronym hits stay visible as clues instead of becoming hard GPL detections, while AndroidManifest package visibility and Khronos documentation cleanup remain better than ScanCode.

The rationale references the same upstream ScanCode sentiment discussed in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL shorthand should be downgraded or filtered rather than promoted into asserted GPL detections. The benchmark wording now reflects that clue-only behavior instead of describing it as simple rejection.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Keep clue-only weak GPL matches out of golden  so the golden suite continues to track substantive license detections rather than raw clue noise. This fixes the GPL-3 fixture regression in CI without changing the public scanner behavior: weak GPL shorthand still surfaces as , but it no longer pollutes raw golden expression lists.

This follow-up stays aligned with the same upstream ScanCode direction discussed in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL shorthand should be downgraded, not promoted into asserted license results. The golden helper now reflects that distinction by excluding clue-only matches from the expression list it compares.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Update the GPL external golden fixtures whose raw  previously encoded weak GPL or free-unknown clue noise that no longer counts as a substantive expression after the clue-only weak-GPL policy. These are Rust-owned golden expectations, so syncing them to current actuals is the correct way to preserve the new public behavior while keeping the golden suite honest.

This remains aligned with aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL shorthand is intentionally downgraded instead of asserted as a hard license result.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Update the Freeware app_exec golden to current Rust actuals now that clue-only weak GPL matches are no longer counted as substantive license expressions in the golden helper. This is a Rust-owned golden sync, not a scanner regression fix.

The result stays consistent with the same upstream ScanCode direction in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak GPL shorthand should not survive as asserted license output.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Update the IJG license golden to current Rust actuals after clue-only matches stopped counting as substantive golden expressions. This keeps the Rust-owned golden expectation aligned with the scanner's intended public output.

The change remains consistent with the same upstream ScanCode direction in aboutcode-org/scancode-toolkit#4005, #4009, #2403, and #2793/#2799: weak shorthand or clue-only evidence should not be treated as asserted license output.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the Creative Commons fossology goldens to current Rust actuals after clue-only and free-unknown noise stopped counting as substantive license expressions. These fixtures are Rust-owned expectations, so this preserves the intended public behavior rather than broadening detection again.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the fossology fixtures whose old expectations still counted free-unknown or unknown-license-reference clue noise as substantive expressions. The current Rust behavior intentionally keeps those weak signals out of the golden license expression list.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the public-domain-related fossology expectations to current Rust actuals after clue-only and duplicate public-domain fragments stopped surfacing as substantive golden expressions.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the mixed-license fossology fixtures whose old expectations still depended on weak GPL, warranty, or proprietary-reference fragments that no longer count as substantive golden expressions under current Rust behavior.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the remaining SLIC external goldens after clue-only unknown-reference and public-domain fragments stopped contributing to raw golden license expressions.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the remaining fossology license-reference goldens to current Rust actuals now that weak proprietary, free-unknown, warranty, and public-domain clue fragments no longer count as substantive raw license expressions.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the remaining lic1 goldens whose previous expectations still counted weak free-unknown, extra GPL, or public-domain fragments as substantive expressions.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the mixed-expression lic2 goldens whose old expectations still counted weak GPL, proprietary, and public-domain fragments as substantive raw expressions.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the apache-heavy lic2 golden variants after their old expectations kept a weak free-unknown raw expression that no longer survives as a substantive result.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the remaining lic2 golden expectations where public-domain or unknown-reference fragments no longer count as substantive raw license expressions.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the public-domain lic4 family after the old expectations kept public-domain fragments that no longer survive as substantive raw expressions in these fixtures.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the remaining lic4 fixtures where one weak GPL or proprietary fragment no longer survives as a substantive raw license expression.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
Sync the unknown-suite goldens after clue-like warranty, free-unknown, and unknown-reference fragments stopped contributing to substantive raw license expressions.

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent)

Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai>
Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>
@mstykow mstykow merged commit d740213 into main Apr 21, 2026
15 checks passed
@mstykow mstykow deleted the fix/weak-gpl-clues branch April 21, 2026 20:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant