perf: replace difflib fuzzy aligner with O(n*m^2) LCS DP by aksg87 · Pull Request #442 · google/langextract

aksg87 · 2026-04-14T03:23:16Z

Fixes #386, fixes #197. Related: #188.

Summary

Replace the legacy fuzzy aligner with an O(n·m²) time, O(m²) memory
LCS DP gated by coverage and density rules, where n is source length
and m is extraction length (typically 3-8 tokens). The old path
enumerated candidate windows and scored each with difflib, which
scales badly on long documents.

Speedup: 1,000-token sources drop from ~33s to under 5ms; 5,000-token
sources from ~69min (est.) to under 20ms.

Changes

LCS DP with rolling rows: O(m²) memory regardless of source length,
no allocation guard needed
Tightest span tracked per achievable match count, so a denser
sub-match can be considered when the max-match span fails density
Coverage and density acceptance gates to reject weak partial matches
Pre-normalized source tokens computed once per chunk
New params are keyword-only; positional callers are unaffected
Fuzzy-param validation and deprecation warning gated behind
enable_fuzzy_alignment so exact-match-only callers stay quiet
Algorithm choice is consistent between prompt validation and runtime
Legacy algorithm kept as a deprecated opt-out with DeprecationWarning

Test plan

443 tests pass (39 new)
Covers LCS correctness, acceptance gates, sparse-vs-dense submatch
regression, positional API compat, deprecation warning, param
validation, fuzzy-disabled quiet path, and end-to-end extract()
integration

Replace the legacy fuzzy aligner with an O(n*m^2) time, O(m^2) memory LCS DP gated by coverage and density rules, where n is source length and m is extraction length (typically 3-8 tokens). The old path enumerated candidate windows and scored each with difflib, which scales badly on long documents. Speedup: 1,000-token sources drop from ~33s to under 5ms; 5,000-token sources from ~69min (est.) to under 20ms. Addresses #386. Related: #188.

github-actions Bot added the size/L Pull request with 600-1000 lines changed label Apr 14, 2026

aksg87 force-pushed the perf/lcs-fuzzy-alignment branch 4 times, most recently from 9636003 to 3691bf3 Compare April 14, 2026 13:53

aksg87 changed the title ~~perf: replace difflib fuzzy aligner with O(n*m) LCS DP~~ perf: replace difflib fuzzy aligner with O(n*m^2) LCS DP Apr 14, 2026

aksg87 force-pushed the perf/lcs-fuzzy-alignment branch 3 times, most recently from c9dc107 to ec99d10 Compare April 14, 2026 15:09

github-actions Bot added size/XL Pull request with over 1000 lines changed - too large and removed size/L Pull request with 600-1000 lines changed labels Apr 14, 2026

aksg87 force-pushed the perf/lcs-fuzzy-alignment branch from ec99d10 to 9d8cfb6 Compare April 14, 2026 15:15

github-actions Bot added size/L Pull request with 600-1000 lines changed and removed size/XL Pull request with over 1000 lines changed - too large labels Apr 14, 2026

aksg87 force-pushed the perf/lcs-fuzzy-alignment branch from 9d8cfb6 to d8c8e95 Compare April 14, 2026 16:55

github-actions Bot added size/XL Pull request with over 1000 lines changed - too large and removed size/L Pull request with 600-1000 lines changed labels Apr 14, 2026

aksg87 force-pushed the perf/lcs-fuzzy-alignment branch 7 times, most recently from 25d0cc4 to ee3d65d Compare April 14, 2026 21:50

aksg87 force-pushed the perf/lcs-fuzzy-alignment branch from ee3d65d to fc06ba8 Compare April 14, 2026 21:52

aksg87 self-assigned this Apr 14, 2026

aksg87 merged commit 8c86bf6 into main Apr 14, 2026
20 checks passed

This was referenced Apr 14, 2026

Performance: Replace difflib with RapidFuzz for fuzzy alignment and add interval tree for merge #386

Closed

Debug messages about "No clean start index found and Fuzzy Aligning" when processing Vietnamese text #197

Closed

This was referenced Apr 14, 2026

Performance for markdown extracted from large pdfs #188

Closed

perf: replace difflib with RapidFuzz for fuzzy alignment + interval tree merge #387

Closed

perf(WordAligner): optimize _fuzzy_align_extraction function for better performance #410

Closed

dan504512 mentioned this pull request Apr 17, 2026

Performance: prompt building duplicates full few-shot preamble per prompt, causing O(batch_length × preamble_size) peak memory #446

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: replace difflib fuzzy aligner with O(n*m^2) LCS DP#442

perf: replace difflib fuzzy aligner with O(n*m^2) LCS DP#442
aksg87 merged 1 commit intomainfrom
perf/lcs-fuzzy-alignment

aksg87 commented Apr 14, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

aksg87 commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

aksg87 commented Apr 14, 2026 •

edited

Loading