Add string fuzzy matching #1023

larsk21 · 2022-02-18T20:40:45Z

Fuzzy matching based on the dynamic programming algorithm used in clang/LLVM. Works well, but is painfully slow when used for the workspace symbols request.

closes #960

gebner · 2022-02-18T22:51:20Z

but is painfully slow when used for the workspace symbols request.

The search algorithm that we use for the documentation just adds a few weights to the current matching algorithm and seems to work well in practice (and is fast enough for interactive usage): https://github.com/leanprover-community/doc-gen/blob/2b8cb2a3e7471c109f031eda59e5090b1c9b7fe3/search.js#L1-L23

There are certainly some rough edge cases since it's greedy and doesn't find an optimal score, but if the optimal version is already slow on the core Lean code base then it is certainly unusable on any larger project like mathlib.

src/Lean/Data/FuzzyMatching.lean

larsk21 · 2022-03-01T20:37:51Z

Some benchmark I did:

Routine

load mathlib declarations from text file (118'021 declarations)
filter declarations by fuzzy matching on pattern
print filtered declarations (to /dev/null)

Setup

Component	Value
Processor	Intel Core i7-6600U @ 4x 2.6 GHz
Memory	8GB @ 1867 MHz
Storage	Toshiba THNSN5256GPU7 239 GB
OS	Debian GNU/Linux 11
System	5.10.60.1-microsoft-standard-WSL2

Results

user time in seconds as measured by zsh's time including printlns

Benchmark	Time [s]	Pattern	Matching Declarations
Single	0.613	`a`	7659
Short	0.291	`add`	9688
Medium	0.353	`categ_th`	10998
Long	0.311	`category_theory.limits.`	3329
Full	0.006	`algebraic_geometry.PresheafedSpace.is_open_immersion.forget_preserves_limits_of_right`	1

Kha · 2022-03-02T10:21:42Z

Nice data! And I'm surprised at the speed of your machine, my Ryzen 5600X @ 4.2GHz takes 1.2s on the following quick stand-alone variant of "Single", spending ~70% of the time inside fuzzyMatch:

import Lean
open Lean

unsafe def main : IO Unit := do
  initSearchPath (← Lean.findSysroot?)
  withImportModules [{module := `Lean}] {} 0 fun env => do
    let mut n := 0
    IO.println s!"{env.constants.size} decls"
    for (c, _) in env.constants.toList do
      if Lean.FuzzyMatching.fuzzyMatch "a" c.toString then n := n + 1
    IO.println s!"{n} matches"

I didn't take a very close look at the code yet, but here is a quick ~20% speedup fix for the "Single" benchmark:

@[specialize] private def iterateLookaround ...

After that, perf report looks like this for me:

-   37.73%     8.39%  a.out    a.out               [.] l___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_fuzzyMatchRec                                                                                              
   - 29.34% l___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_fuzzyMatchRec                                                                                                                                         
      + 23.59% l___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_fuzzyMatchRec                                                                                                                                      
      + 5.68% lean_dec_ref_cold                                                                                                                                                                                             
   + 7.67% _start                                                                                                                                                                                                           
   + 0.72% 0xffffffffffffffff                                                                                                                                                                                               
-   29.04%     7.25%  a.out    a.out               [.] l_Std_Range_forIn_loop___at___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_reverseStringInfo___spec__2                                                      
   - 21.79% l_Std_Range_forIn_loop___at___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_reverseStringInfo___spec__2                                                                                                 
        10.39% lean_alloc_small                                                                                                                                                                                             
      + 9.77% lean_dec_ref_cold                                                                                                                                                                                             
        0.85% l_Char_isAlphanum                                                                                                                                                                                             
   + 7.25% _start

So we are now spending more time in fuzzyMatchRec than reverseStringInfo, but still most of the overall time in allocating & deallocating. So we should try to reduce allocations first. Here are some rough ideas:

The biggest offenders are probably the lists. With arrays, we could reduce them to one allocation each since we know the required capacity at Array.mkEmpty. The recursion would of course get a little less nice, passing around indices instead of lists presumably.
With Array CharInfo, the CharInfo.char field starts looking redundant because it is just a less dense representation of the original string (if UTF8 decoding ever becomes a bottleneck, we should probably switch to ASCII/bytes like LLVM. I assume the (very) rare Unicode query would still work, just maybe with less accurate scoring). If we reduce it to Array CharRole, we are down to just 1 allocation per string (enumerations are stored inline)! Further possible optimization: compress into ByteArray for 1/8 the length.

In summary, I'm optimistic that we can at least halve the execution time here, perhaps even more. The big question is of course, will it be enough? If we can get it up to reasonable speeds, I believe the improvement in search quality would absolute be worth it. I didn't do a thorough evaluation yet, but here is just one sample query on mathlib-docs where I'd hope this implementation would do a better job: open_imm. I also suspect without proof that Lean 4, at least Lean, might use longer namespaced names than mathlib, which would further penalize any greedy implementation (as you usually want to match something close to the end of the name).

Kha · 2022-03-02T10:29:18Z

I'm a bit confused about the algorithm implementation though, is this still dynamic programming? I would expect some kind of array like in the LLVM impl that avoids identical recursive calls - your current implementation might be exponential on e.g. matching a against a sequence of as. If you do use a dynamic programming array filled during a forward scan, storing the CharRoles in arrays instead of lists like suggested above might also become an even better fit.

larsk21 · 2022-03-02T11:55:16Z

I'm a bit confused about the algorithm implementation though, is this still dynamic programming? I would expect some kind of array like in the LLVM impl that avoids identical recursive calls - your current implementation might be exponential on e.g. matching a against a sequence of as. If you do use a dynamic programming array filled during a forward scan, storing the CharRoles in arrays instead of lists like suggested above might also become an even better fit.

I guess that's true, the idea is still the same as with dynamic programming, but the implementation is less efficient.

gebner · 2022-03-02T12:45:20Z

your current implementation might be exponential on e.g. matching a against a sequence of as.

You probably mean matching aⁿ on a²ⁿ, matching a single a is merely quadratic. 😄

The biggest offenders are probably the lists. With arrays, we could reduce them to one allocation each since we know the required capacity at Array.mkEmpty. The recursion would of course get a little less nice, passing around indices instead of lists presumably.

The lists are also the first thing that jumped into my eye. Like Wojciech has said before, it's also worth a try to skip the precomputation entirely and compute character roles etc. on demand.

Another low-hanging optimization is that fuzzyMatchRec can return immediately if the pattern is longer than the word.

Could you also please commit (or post) the benchmarking code you've been using?

larsk21 · 2022-03-02T18:10:12Z

I just re-implemented the fuzzy matching non-recursively using arrays and to my surprise it was slower than before. I pushed it on fuzzy-matching-arrays, if you want to take a look.

The case when wordIdx < patternIdx could be removed, but the index calculation would be more complicated.

larsk21 · 2022-03-02T18:13:58Z

My benchmark code is just (with varying patterns and iterations):

def main : IO Unit := do
  let mathlibDecls ← IO.FS.lines (System.FilePath.mk "data/mathlib-decls")
  for i in [:100] do
    IO.println <| mathlibDecls.filter (fun s => fuzzyMatch "categ_th" s)

(Output is redirected to /dev/null.)

larsk21 · 2022-03-03T07:26:56Z

I just re-implemented the fuzzy matching non-recursively using arrays and to my surprise it was slower than before. I pushed it on fuzzy-matching-arrays, if you want to take a look.

I also tried to pre-initialize the array with ⟨none, none⟩ and using set!, as well as removing the structure MatchResult (using a twice-as-long array of Option Int), both without success (second one was even slower).

see fuzzy-matching-arrays-plain

Kha · 2022-03-03T14:08:57Z

That is surprising (though I suppose it's definitely possible for the recursive version to require less than pattern.length * word.length steps), but I think it is still a better foundation for further optimizations! Even with the @[specialize] from above, hotspot's flamegraph tells us that most time is still spent with allocations:

The likely allocation source that is left is the Option values, which we can easily avoid by moving the first iteration out of the loop and @[inline]ing charRole: Kha@e7c2e0f.
After that, stringInfo is already more than twice as fast as fuzzyMatchCore, so we should move our focus there next, but suspiciously half of stringInfos time is still spent with allocations. Looking at trace.compiler.ir.result, it turns out that we allocate a 3-tuple (i.e. two nested pairs) per iteration to store the 3 mutable variables, which the ForM abstraction forces us to do. Ideally Lean should be able to eliminate this tuple on the IR level after inlining everything, but right now it doesn't. So since actually accessing the string seems to be pretty cheap, let's just do so redundantly to eliminate prev/curr: 699d5f3.
Now the allocations are actually gone from the loop and stringInfo is pretty fast:

Let's move on to fuzzyMatchCore then. The first thing I noticed in the profiler was a loop (*_forIn_loop_*) taking 7% of the total time, spending 5% of the total time inside Array.push. That must be the initial loop filling the array with nones, which we can reduce using mkArray to just 0.9% of the total time.

After that, I didn't see a clear bottleneck left:

A good amount of time is still spent on allocations, but not overwhelmingly much. Spending 8% total/ 20% relative just on set probably isn't great. I'm not quite sure what to make of the hottest instruction there:

This must be part of Array.set! given the (fortunately skipped, apparently) branch for copying the array. In my experience, the assembly-level annotations by perf can sometimes be off by a few instructions, so this is more likely the array access or perhaps a branch miss (but that branch should be perfectly predictable). It doesn't look like we could improve anything there in either the compiler, runtime, or implementation other than to perhaps decrease the size of array elements, though LLVM's don't look any smaller to me.

I could make some elaborated guesses at further potential optimizations like avoiding more Option allocations, but these are not quite as straightforward and the impact is less clear, so I will leave it at this for now. I've pushed my state to https://github.com/Kha/lean4/tree/fuzzy-matching-arrays-plain. @larsk21 Could you re-run your benchmarks on this so we have a comparison on the same hardware as the previous numbers?

Kha · 2022-03-03T15:23:54Z

There is, of course, always the optimization of decreasing the input size. For example, while @gebner has convinced me in the previous thread that we should not ignore namespaces in e.g. the workspace symbols request, it still isn't clear to me if that is the correct default even just for relevance of results, not just speed. A possible compromise could be to include them only if a . is part of the query.

gebner · 2022-03-03T16:46:00Z

There is, of course, always the optimization of decreasing the input size. For example, while @gebner has convinced me in the previous thread that we should not ignore namespaces in e.g. the workspace symbols request, it still isn't clear to me if that is the correct default even just for relevance of results, not just speed.

There's very few definitions that you can find without looking at the namespace (think of MetaM.run or Int.ofNat or List.filter etc.; all of which require narrowing down on the namespace because there's a lot of functions with that base name). Therefore ignoring the namespace can't be the solution (as tried in the last PR).

A possible compromise could be to include them only if a . is part of the query.

To elaborate on this proposal, how about matching name components separately? That is, say the query is topsp.comp and the declaration name is Mathlib.TopologicalSpace.Compacts.equiv. Then we would try to match comp against TopologicalSpace, Compacts, equiv separately; and then topsp against Mathlib and TopologicalSpace (looking for an embedding of the components). The nice advantage of this approach is that we can cache the component queries.

Kha · 2022-03-03T17:29:00Z

To elaborate on this proposal, how about matching name components separately?

Yes, I think this is the natural conclusion. But doing so anywhere within the compound name seems problematic when combined with the "only on ." rule - is it ok if comp does not match something that topsp.comp does? I would have expected that we match at least the last component of the query only against that of the declaration. Do you have a use case in mind where you want to jump to a symbol by inputting only part of its namespace? Again I'm mostly thinking about documentSymbol here because e.g. auto completion works one-component-at-a-time anyway, currently at least.

gebner · 2022-03-03T18:04:48Z

But doing so anywhere within the compound name seems problematic when combined with the "only on ." rule - is it ok if comp does not match something that topsp.comp does?

Right, I don't think the "only on ." rule is a good idea. Having two different search strategies that are selected based on whether a dot is present in the patterns seems like a recipe for confusion.

In my proposal, topsp.comp would match all of the following:

Mathlib.TopologicalSpace.Compacts
Mathlib.TopologicalSpace.Compacts.equiv
TopologicalSpace.Compacts

I would have expected that we match at least the last component of the query only against that of the declaration. Do you have a use case in mind where you want to jump to a symbol by inputting only part of its namespace?

I don't think the last-component-matches-last-component restriction affects expressivity too much. If you want to search for declarations in a namespace you can just append a dot (e.g., topsp. returns declarations in the Mathlib.TopologicalSpace namespace).

Lifting the restriction is useful for set_option where set_option simp<TAB> should also offer trace.Meta.Tactic.simp.congr as a completion.

It's also useful if you search for theorems and want a theorem about gravel and want to see both gravel_iff_rock as well as gravel.to_stone. But if we match components separately, then you'll need to try various dot-combinations anyhow (i.e., gravelrock and gravel.rock).

You also type topsp.comp letter-by-letter and I would at least a priori expect that the result set gets smaller with each letter that you write (i.e., that topsp. returns a subset of the results of topsp).

To sum up, I think the last-component-matches-last-component could be sensible for declarations. The "only on ." rule is only confusing. (Same goes for the current "case-sensitive if the query contains an upper-case character" rule, which is fortunately gone in this PR.)

Kha · 2022-03-04T09:14:59Z

Okay, if we want to do component-wise matching, I'd say let's leave out the restriction for now and see if we need it for either performance or relevance (for which it could simply be a score adjustment).

But for what it's worth, Lars' original benchmarks are no slower than ~220ms on my machine (which isn't the slowest, sure) now with my branch, so maybe this is good enough to be merged for now and evaluate performance & relevance in practice?

gebner · 2022-03-04T09:55:28Z

But for what it's worth, Lars' original benchmarks are no slower than ~220ms on my machine (which isn't the slowest, sure) now with my branch, so maybe this is good enough to be merged for now and evaluate performance & relevance in practice?

For me it's between 60-340ms with your branch (and 60-460ms with the PR), which I think is still acceptable for interactive usage. I agree we should merge this now, it's much better than what's currently in Lean and we can iterate over it later.

larsk21 · 2022-03-10T10:38:35Z

Could you re-run your benchmarks on this so we have a comparison on the same hardware as the previous numbers?

Of course, here are the results:

Benchmark	Time [s]
Single	0.694
Short	0.391
Medium	0.696
Long	0.376
Full	0.003

The new version is slightly slower, but in the same range. The Medium benchmark is differing the most, I would guess because of the overhead from the complete forward pass.

larsk21 · 2022-03-10T10:48:03Z

Currently the fuzzy matching algorithm is used in two places: the workspace symbols request and the (id, dot, option, ...) completion request. In the workspace symbols request, all available declarations are tried to match, while in the completion request, the previous step-wise matching is used. This benefits performance and, I would say, also usability. (Imagine using a similar naming scheme as an imported namespace, resulting in a lot of completion items.)

larsk21 · 2022-03-10T11:03:11Z

The threshold for displaying matched items is currently set to 0.2, we probably have to wait how this works in practice before changing it, if necessary.

src/Lean/Data/FuzzyMatching.lean

Kha · 2022-03-10T11:35:49Z

The Medium benchmark is differing the most, I would guess because of the overhead from the complete forward pass.

Makes sense. I suppose there is also the option of mixing the two solutions - a recursive backward search that uses the array for memoization. Not sure if it's worth investigating that direction.

larsk21 · 2022-03-10T14:17:45Z

I suppose there is also the option of mixing the two solutions - a recursive backward search that uses the array for memoization.

I also thought of this, but haven't had time to try it yet.

src/Lean/Data/FuzzyMatching.lean

Kha · 2022-03-11T10:53:39Z

src/Lean/Data/FuzzyMatching.lean

+    return true
+  let mut aIt := a.mkIterator
+  for i in [:b.bsize] do
+    if aIt.curr.toLower == (b.get i).toLower then


Note that b.get i returns arbitrary (= 'a') if i is not at a UTF-8 char boundary, so this should reject any Unicode queries. Which is fine I think, and we definitely don't want to pay any extra overhead for a use case we don't even know is relevant at all in practice. In the future, we should introduce a new primitive String.getByte : String.Pos -> UInt8, which would allow us to further optimize the ASCII case and at least accept Unicode queries, even if the results might not be quite as expected.

Since Iterator.curr uses get as well, doesn't this mean that any characters outside the UTF-8 boundary are effectively equal to 'a', 'A' and any other characters outside the boundaries here?

Yes, but note that next only visits char boundaries :)

So next would have to reject the move when the target is not a valid UTF-8 character? But this code increments the position by 1 for non-UTF-8 characters.

/* invalid UTF-8 encoded string */ return lean_box(i+1);

(from lean_string_utf8_next)

src/Lean/Server/Completion.lean

larsk21 · 2022-03-11T11:14:03Z

I know you already approved the PR, but I just tried to use the combination of backward search and array memoization. It is slightly slower for short patterns, but provides a speedup for longer patterns (compared to the current array implementation).

see fuzzy-matching-cache

Benchmark

Benchmark	Time [s]
Single	0.742
Short	0.399
Medium	0.423
Long	0.280
Full	0.140

Kha · 2022-03-11T11:22:36Z

Interesting. I was actually thinking about this approach as well just now - if we think of the recursion as paths through the array from the bottom right corner to the top left, the only "join points" of paths, i.e. where we should memoize, are the coordinates where needle and haystack agree, no? So could we elide loading and storing the cache at all other coordinates?

larsk21 · 2022-03-11T11:33:41Z

If we start in the same cell, the two paths [miss, match] and [match, miss] will reach the same cell (both consuming two word characters and one pattern character), right? Wouldn't this mean that most of the cells that are reached by the recursive backward search can be reached by using different paths? The only thing filtering paths is allowMatch, but I'm not sure if we can know beforehand if we need to store the result in the array based on this.

larsk21 · 2022-03-11T11:56:05Z

I tried this just out of curiosity, but using a HashMap with the Nat indices as keys is slower (~ 0.6s for the medium benchmark, ~ 1.2s for the single benchmark).

gebner reviewed Feb 18, 2022

View reviewed changes

src/Lean/Data/FuzzyMatching.lean Outdated Show resolved Hide resolved

Vtec234 reviewed Feb 19, 2022

View reviewed changes

src/Lean/Data/FuzzyMatching.lean Outdated Show resolved Hide resolved

larsk21 force-pushed the fuzzy-matching branch from 93702b7 to ab07463 Compare February 25, 2022 08:37

larsk21 and others added 5 commits March 10, 2022 11:14

feat: add string fuzzy matching

80d7ce3

perf: specialize iterateLookaround

e8539e3

perf: eliminate some allocations in iterateLookaround

a82c948

perf: eliminate MProd allocations in iterateLookaround

74115f2

perf: use mkArray in fuzzyMatchCore

c12096a

larsk21 added 2 commits March 10, 2022 11:59

feat: enable fuzzy matching for completion

f6430bf

feat: enable fuzzy matching for workspace symbols

dd3ed82

larsk21 force-pushed the fuzzy-matching branch from 20102f1 to dd3ed82 Compare March 10, 2022 11:01

gebner reviewed Mar 10, 2022

View reviewed changes

src/Lean/Data/FuzzyMatching.lean Outdated Show resolved Hide resolved

fix: review suggestions + code structure

3f8e5d2

larsk21 marked this pull request as ready for review March 10, 2022 14:18

doc: fix fuzzy matching docs

2a199ec

Kha approved these changes Mar 11, 2022

View reviewed changes

leodemoura merged commit e430496 into leanprover:master Mar 12, 2022

larsk21 mentioned this pull request Mar 12, 2022

Fix: Fuzzy Matching Review Comments #1043

Merged

gebner mentioned this pull request Oct 13, 2022

test: benchmark workspace symbols search #1723

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add string fuzzy matching #1023

Add string fuzzy matching #1023

larsk21 commented Feb 18, 2022

gebner commented Feb 18, 2022

larsk21 commented Mar 1, 2022

Kha commented Mar 2, 2022 •

edited

Loading

Kha commented Mar 2, 2022

larsk21 commented Mar 2, 2022

gebner commented Mar 2, 2022 •

edited

Loading

larsk21 commented Mar 2, 2022 •

edited

Loading

larsk21 commented Mar 2, 2022 •

edited

Loading

larsk21 commented Mar 3, 2022 •

edited

Loading

Kha commented Mar 3, 2022 •

edited

Loading

Kha commented Mar 3, 2022

gebner commented Mar 3, 2022

Kha commented Mar 3, 2022

gebner commented Mar 3, 2022 •

edited

Loading

Kha commented Mar 4, 2022 •

edited

Loading

gebner commented Mar 4, 2022

larsk21 commented Mar 10, 2022

larsk21 commented Mar 10, 2022

larsk21 commented Mar 10, 2022

Kha commented Mar 10, 2022

larsk21 commented Mar 10, 2022

Kha Mar 11, 2022

larsk21 Mar 11, 2022

Kha Mar 11, 2022

larsk21 Mar 11, 2022

larsk21 commented Mar 11, 2022

Kha commented Mar 11, 2022

larsk21 commented Mar 11, 2022

larsk21 commented Mar 11, 2022 •

edited

Loading

Add string fuzzy matching #1023

Add string fuzzy matching #1023

Conversation

larsk21 commented Feb 18, 2022

gebner commented Feb 18, 2022

larsk21 commented Mar 1, 2022

Routine

Setup

Results

Kha commented Mar 2, 2022 • edited Loading

Kha commented Mar 2, 2022

larsk21 commented Mar 2, 2022

gebner commented Mar 2, 2022 • edited Loading

larsk21 commented Mar 2, 2022 • edited Loading

larsk21 commented Mar 2, 2022 • edited Loading

larsk21 commented Mar 3, 2022 • edited Loading

Kha commented Mar 3, 2022 • edited Loading

Kha commented Mar 3, 2022

gebner commented Mar 3, 2022

Kha commented Mar 3, 2022

gebner commented Mar 3, 2022 • edited Loading

Kha commented Mar 4, 2022 • edited Loading

gebner commented Mar 4, 2022

larsk21 commented Mar 10, 2022

larsk21 commented Mar 10, 2022

larsk21 commented Mar 10, 2022

Kha commented Mar 10, 2022

larsk21 commented Mar 10, 2022

Kha Mar 11, 2022

Choose a reason for hiding this comment

larsk21 Mar 11, 2022

Choose a reason for hiding this comment

Kha Mar 11, 2022

Choose a reason for hiding this comment

larsk21 Mar 11, 2022

Choose a reason for hiding this comment

larsk21 commented Mar 11, 2022

Benchmark

Kha commented Mar 11, 2022

larsk21 commented Mar 11, 2022

larsk21 commented Mar 11, 2022 • edited Loading

Kha commented Mar 2, 2022 •

edited

Loading

gebner commented Mar 2, 2022 •

edited

Loading

larsk21 commented Mar 2, 2022 •

edited

Loading

larsk21 commented Mar 2, 2022 •

edited

Loading

larsk21 commented Mar 3, 2022 •

edited

Loading

Kha commented Mar 3, 2022 •

edited

Loading

gebner commented Mar 3, 2022 •

edited

Loading

Kha commented Mar 4, 2022 •

edited

Loading

larsk21 commented Mar 11, 2022 •

edited

Loading