Skip to content

perf: replace difflib fuzzy aligner with O(n*m^2) LCS DP#442

Merged
aksg87 merged 1 commit intomainfrom
perf/lcs-fuzzy-alignment
Apr 14, 2026
Merged

perf: replace difflib fuzzy aligner with O(n*m^2) LCS DP#442
aksg87 merged 1 commit intomainfrom
perf/lcs-fuzzy-alignment

Conversation

@aksg87
Copy link
Copy Markdown
Collaborator

@aksg87 aksg87 commented Apr 14, 2026

Fixes #386, fixes #197. Related: #188.

Summary

Replace the legacy fuzzy aligner with an O(n·m²) time, O(m²) memory
LCS DP gated by coverage and density rules, where n is source length
and m is extraction length (typically 3-8 tokens). The old path
enumerated candidate windows and scored each with difflib, which
scales badly on long documents.

Speedup: 1,000-token sources drop from ~33s to under 5ms; 5,000-token
sources from ~69min (est.) to under 20ms.

Changes

  • LCS DP with rolling rows: O(m²) memory regardless of source length,
    no allocation guard needed
  • Tightest span tracked per achievable match count, so a denser
    sub-match can be considered when the max-match span fails density
  • Coverage and density acceptance gates to reject weak partial matches
  • Pre-normalized source tokens computed once per chunk
  • New params are keyword-only; positional callers are unaffected
  • Fuzzy-param validation and deprecation warning gated behind
    enable_fuzzy_alignment so exact-match-only callers stay quiet
  • Algorithm choice is consistent between prompt validation and runtime
  • Legacy algorithm kept as a deprecated opt-out with DeprecationWarning

Test plan

  • 443 tests pass (39 new)
  • Covers LCS correctness, acceptance gates, sparse-vs-dense submatch
    regression, positional API compat, deprecation warning, param
    validation, fuzzy-disabled quiet path, and end-to-end extract()
    integration

@github-actions github-actions Bot added the size/L Pull request with 600-1000 lines changed label Apr 14, 2026
@aksg87 aksg87 force-pushed the perf/lcs-fuzzy-alignment branch 4 times, most recently from 9636003 to 3691bf3 Compare April 14, 2026 13:53
@aksg87 aksg87 changed the title perf: replace difflib fuzzy aligner with O(n*m) LCS DP perf: replace difflib fuzzy aligner with O(n*m^2) LCS DP Apr 14, 2026
@aksg87 aksg87 force-pushed the perf/lcs-fuzzy-alignment branch 3 times, most recently from c9dc107 to ec99d10 Compare April 14, 2026 15:09
@github-actions github-actions Bot added size/XL Pull request with over 1000 lines changed - too large and removed size/L Pull request with 600-1000 lines changed labels Apr 14, 2026
@aksg87 aksg87 force-pushed the perf/lcs-fuzzy-alignment branch from ec99d10 to 9d8cfb6 Compare April 14, 2026 15:15
@github-actions github-actions Bot added size/L Pull request with 600-1000 lines changed and removed size/XL Pull request with over 1000 lines changed - too large labels Apr 14, 2026
@aksg87 aksg87 force-pushed the perf/lcs-fuzzy-alignment branch from 9d8cfb6 to d8c8e95 Compare April 14, 2026 16:55
@github-actions github-actions Bot added size/XL Pull request with over 1000 lines changed - too large and removed size/L Pull request with 600-1000 lines changed labels Apr 14, 2026
@aksg87 aksg87 force-pushed the perf/lcs-fuzzy-alignment branch 7 times, most recently from 25d0cc4 to ee3d65d Compare April 14, 2026 21:50
Replace the legacy fuzzy aligner with an O(n*m^2) time, O(m^2) memory
LCS DP gated by coverage and density rules, where n is source length
and m is extraction length (typically 3-8 tokens). The old path
enumerated candidate windows and scored each with difflib, which
scales badly on long documents.

Speedup: 1,000-token sources drop from ~33s to under 5ms; 5,000-token
sources from ~69min (est.) to under 20ms.

Addresses #386. Related: #188.
@aksg87 aksg87 force-pushed the perf/lcs-fuzzy-alignment branch from ee3d65d to fc06ba8 Compare April 14, 2026 21:52
@aksg87 aksg87 self-assigned this Apr 14, 2026
@aksg87 aksg87 merged commit 8c86bf6 into main Apr 14, 2026
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Pull request with over 1000 lines changed - too large

Projects

None yet

1 participant