fix(detection): improve apache camel compare parity by mstykow · Pull Request #795 · mstykow/provenant

mstykow · 2026-04-26T20:50:32Z

Summary

improve shared UTF-16 decoding so BOM-less or corrupted-BOM inputs stay on the decoded-text path, including the Camel template files that previously missed Apache-2.0
recover Apache, Spring, and OpenShift notice attributions and trim written-by prose spillover in shared author heuristics, including the Camel SBOM XML false-positive sentence
filter version-shaped SBOM pseudo-emails before truncation and record the explicit Apache Camel benchmark checkpoint in docs/BENCHMARKS.md

Issues

Covers: Apache Camel compare target from docs/implementation-plans/package-detection/PARSER_VERIFICATION_SCORECARD.md
Closes: none

Scope and exclusions

Included: shared text decoding in src/utils/file.rs, shared copyright/author heuristics, shared email host filtering, Camel benchmark/chart refresh
Explicit exclusions: unrelated remaining Camel compare deltas outside the three targeted gaps, such as the longstanding top-level package-count mismatch

Intentional differences from Python

keep the UTF-16 improvement generic by recognizing BOM-less and corrupted-BOM UTF-16 shapes before falling back to Latin-1 or binary-string extraction
prefer unique SBOM emails before the cap only after filtering version-shaped pseudo-hosts, instead of keeping obviously fake package-version addresses in the truncated result set

Follow-up work

Created or intentionally deferred:
- recorded compare artifacts:
  - baseline before these fixes: .provenant/compare-runs/20260426T074449Z-camel-39047
  - first fully-fixed benchmark checkpoint: .provenant/compare-runs/20260426T202057Z-camel-80585
  - immediate repeat for timing variance: .provenant/compare-runs/20260426T203456Z-camel-82481
- the benchmark row uses camel-80585; the repeat camel-82481 was retained only to check fluctuation and came out 7.84s slower (1.67%), which suggests the earlier 473s vs 514s spread was mostly different code state rather than pure runtime noise
- remaining non-targeted Camel compare deltas are left for later triage

Expected-output fixture changes

Files changed: none
Why the new expected output is correct: this branch changes shared scanner behavior and validates it with targeted Rust tests plus repeated compare-output artifacts; no checked-in golden or expected-output fixtures required updates

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

mstykow and others added 5 commits April 26, 2026 22:46

fix(text): improve utf-16 text decoding

522595c

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

fix(copyright): improve notice and author recovery

1a50670

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

fix(contacts): filter version-like sbom emails

b9fdc3d

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

docs(benchmarks): add apache camel checkpoint

1a6530d

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

fix(ci): resolve camel follow-up regressions

4802534

Ultraworked with [Sisyphus](https://github.com/code-yeongyu/oh-my-openagent) Co-authored-by: Sisyphus <clio-agent@sisyphuslabs.ai> Signed-off-by: Maxim Stykow <maxim.stykow@gmail.com>

mstykow merged commit 0c9d9fc into main Apr 26, 2026
15 checks passed

mstykow deleted the fix/camel-compare-followups branch April 26, 2026 21:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(detection): improve apache camel compare parity#795

fix(detection): improve apache camel compare parity#795
mstykow merged 5 commits intomainfrom
fix/camel-compare-followups

mstykow commented Apr 26, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mstykow commented Apr 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Issues

Scope and exclusions

Intentional differences from Python

Follow-up work

Expected-output fixture changes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mstykow commented Apr 26, 2026 •

edited

Loading