Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add string fuzzy matching #1023

Merged
merged 9 commits into from
Mar 12, 2022
Merged

Conversation

larsk21
Copy link
Contributor

@larsk21 larsk21 commented Feb 18, 2022

Fuzzy matching based on the dynamic programming algorithm used in clang/LLVM. Works well, but is painfully slow when used for the workspace symbols request.

closes #960

@gebner
Copy link
Member

gebner commented Feb 18, 2022

but is painfully slow when used for the workspace symbols request.

The search algorithm that we use for the documentation just adds a few weights to the current matching algorithm and seems to work well in practice (and is fast enough for interactive usage): https://github.com/leanprover-community/doc-gen/blob/2b8cb2a3e7471c109f031eda59e5090b1c9b7fe3/search.js#L1-L23

There are certainly some rough edge cases since it's greedy and doesn't find an optimal score, but if the optimal version is already slow on the core Lean code base then it is certainly unusable on any larger project like mathlib.

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 1, 2022

Some benchmark I did:

Routine

  • load mathlib declarations from text file (118'021 declarations)
  • filter declarations by fuzzy matching on pattern
  • print filtered declarations (to /dev/null)

Setup

Component Value
Processor Intel Core i7-6600U @ 4x 2.6 GHz
Memory 8GB @ 1867 MHz
Storage Toshiba THNSN5256GPU7 239 GB
OS Debian GNU/Linux 11
System 5.10.60.1-microsoft-standard-WSL2

Results

user time in seconds as measured by zsh's time including printlns

Benchmark Time [s] Pattern Matching Declarations
Single 0.613 a 7659
Short 0.291 add 9688
Medium 0.353 categ_th 10998
Long 0.311 category_theory.limits. 3329
Full 0.006 algebraic_geometry.PresheafedSpace.is_open_immersion.forget_preserves_limits_of_right 1

@Kha
Copy link
Member

Kha commented Mar 2, 2022

Nice data! And I'm surprised at the speed of your machine, my Ryzen 5600X @ 4.2GHz takes 1.2s on the following quick stand-alone variant of "Single", spending ~70% of the time inside fuzzyMatch:

import Lean
open Lean

unsafe def main : IO Unit := do
  initSearchPath (← Lean.findSysroot?)
  withImportModules [{module := `Lean}] {} 0 fun env => do
    let mut n := 0
    IO.println s!"{env.constants.size} decls"
    for (c, _) in env.constants.toList do
      if Lean.FuzzyMatching.fuzzyMatch "a" c.toString then n := n + 1
    IO.println s!"{n} matches"

I didn't take a very close look at the code yet, but here is a quick ~20% speedup fix for the "Single" benchmark:

@[specialize] private def iterateLookaround ...

After that, perf report looks like this for me:

-   37.73%     8.39%  a.out    a.out               [.] l___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_fuzzyMatchRec                                                                                              
   - 29.34% l___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_fuzzyMatchRec                                                                                                                                         
      + 23.59% l___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_fuzzyMatchRec                                                                                                                                      
      + 5.68% lean_dec_ref_cold                                                                                                                                                                                             
   + 7.67% _start                                                                                                                                                                                                           
   + 0.72% 0xffffffffffffffff                                                                                                                                                                                               
-   29.04%     7.25%  a.out    a.out               [.] l_Std_Range_forIn_loop___at___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_reverseStringInfo___spec__2                                                      
   - 21.79% l_Std_Range_forIn_loop___at___private_Lean_Data_FuzzyMatching_0__Lean_FuzzyMatching_reverseStringInfo___spec__2                                                                                                 
        10.39% lean_alloc_small                                                                                                                                                                                             
      + 9.77% lean_dec_ref_cold                                                                                                                                                                                             
        0.85% l_Char_isAlphanum                                                                                                                                                                                             
   + 7.25% _start                                                                                                                                                                                                                                                      

So we are now spending more time in fuzzyMatchRec than reverseStringInfo, but still most of the overall time in allocating & deallocating. So we should try to reduce allocations first. Here are some rough ideas:

  • The biggest offenders are probably the lists. With arrays, we could reduce them to one allocation each since we know the required capacity at Array.mkEmpty. The recursion would of course get a little less nice, passing around indices instead of lists presumably.
  • With Array CharInfo, the CharInfo.char field starts looking redundant because it is just a less dense representation of the original string (if UTF8 decoding ever becomes a bottleneck, we should probably switch to ASCII/bytes like LLVM. I assume the (very) rare Unicode query would still work, just maybe with less accurate scoring). If we reduce it to Array CharRole, we are down to just 1 allocation per string (enumerations are stored inline)! Further possible optimization: compress into ByteArray for 1/8 the length.

In summary, I'm optimistic that we can at least halve the execution time here, perhaps even more. The big question is of course, will it be enough? If we can get it up to reasonable speeds, I believe the improvement in search quality would absolute be worth it. I didn't do a thorough evaluation yet, but here is just one sample query on mathlib-docs where I'd hope this implementation would do a better job: open_imm. I also suspect without proof that Lean 4, at least Lean, might use longer namespaced names than mathlib, which would further penalize any greedy implementation (as you usually want to match something close to the end of the name).

@Kha
Copy link
Member

Kha commented Mar 2, 2022

I'm a bit confused about the algorithm implementation though, is this still dynamic programming? I would expect some kind of array like in the LLVM impl that avoids identical recursive calls - your current implementation might be exponential on e.g. matching a against a sequence of as. If you do use a dynamic programming array filled during a forward scan, storing the CharRoles in arrays instead of lists like suggested above might also become an even better fit.

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 2, 2022

I'm a bit confused about the algorithm implementation though, is this still dynamic programming? I would expect some kind of array like in the LLVM impl that avoids identical recursive calls - your current implementation might be exponential on e.g. matching a against a sequence of as. If you do use a dynamic programming array filled during a forward scan, storing the CharRoles in arrays instead of lists like suggested above might also become an even better fit.

I guess that's true, the idea is still the same as with dynamic programming, but the implementation is less efficient.

@gebner
Copy link
Member

gebner commented Mar 2, 2022

your current implementation might be exponential on e.g. matching a against a sequence of as.

You probably mean matching aⁿ on a²ⁿ, matching a single a is merely quadratic. 😄

The biggest offenders are probably the lists. With arrays, we could reduce them to one allocation each since we know the required capacity at Array.mkEmpty. The recursion would of course get a little less nice, passing around indices instead of lists presumably.

The lists are also the first thing that jumped into my eye. Like Wojciech has said before, it's also worth a try to skip the precomputation entirely and compute character roles etc. on demand.

Another low-hanging optimization is that fuzzyMatchRec can return immediately if the pattern is longer than the word.

Could you also please commit (or post) the benchmarking code you've been using?

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 2, 2022

I just re-implemented the fuzzy matching non-recursively using arrays and to my surprise it was slower than before. I pushed it on fuzzy-matching-arrays, if you want to take a look.

The case when wordIdx < patternIdx could be removed, but the index calculation would be more complicated.

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 2, 2022

My benchmark code is just (with varying patterns and iterations):

def main : IO Unit := do
  let mathlibDecls ← IO.FS.lines (System.FilePath.mk "data/mathlib-decls")
  for i in [:100] do
    IO.println <| mathlibDecls.filter (fun s => fuzzyMatch "categ_th" s)

(Output is redirected to /dev/null.)

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 3, 2022

I just re-implemented the fuzzy matching non-recursively using arrays and to my surprise it was slower than before. I pushed it on fuzzy-matching-arrays, if you want to take a look.

I also tried to pre-initialize the array with ⟨none, none⟩ and using set!, as well as removing the structure MatchResult (using a twice-as-long array of Option Int), both without success (second one was even slower).

see fuzzy-matching-arrays-plain

@Kha
Copy link
Member

Kha commented Mar 3, 2022

That is surprising (though I suppose it's definitely possible for the recursive version to require less than pattern.length * word.length steps), but I think it is still a better foundation for further optimizations! Even with the @[specialize] from above, hotspot's flamegraph tells us that most time is still spent with allocations:
image
The likely allocation source that is left is the Option values, which we can easily avoid by moving the first iteration out of the loop and @[inline]ing charRole: Kha@e7c2e0f.
After that, stringInfo is already more than twice as fast as fuzzyMatchCore, so we should move our focus there next, but suspiciously half of stringInfos time is still spent with allocations. Looking at trace.compiler.ir.result, it turns out that we allocate a 3-tuple (i.e. two nested pairs) per iteration to store the 3 mutable variables, which the ForM abstraction forces us to do. Ideally Lean should be able to eliminate this tuple on the IR level after inlining everything, but right now it doesn't. So since actually accessing the string seems to be pretty cheap, let's just do so redundantly to eliminate prev/curr: 699d5f3.
Now the allocations are actually gone from the loop and stringInfo is pretty fast:
image

Let's move on to fuzzyMatchCore then. The first thing I noticed in the profiler was a loop (*_forIn_loop_*) taking 7% of the total time, spending 5% of the total time inside Array.push. That must be the initial loop filling the array with nones, which we can reduce using mkArray to just 0.9% of the total time.

After that, I didn't see a clear bottleneck left:
image

A good amount of time is still spent on allocations, but not overwhelmingly much. Spending 8% total/ 20% relative just on set probably isn't great. I'm not quite sure what to make of the hottest instruction there:
image
This must be part of Array.set! given the (fortunately skipped, apparently) branch for copying the array. In my experience, the assembly-level annotations by perf can sometimes be off by a few instructions, so this is more likely the array access or perhaps a branch miss (but that branch should be perfectly predictable). It doesn't look like we could improve anything there in either the compiler, runtime, or implementation other than to perhaps decrease the size of array elements, though LLVM's don't look any smaller to me.

I could make some elaborated guesses at further potential optimizations like avoiding more Option allocations, but these are not quite as straightforward and the impact is less clear, so I will leave it at this for now. I've pushed my state to https://github.com/Kha/lean4/tree/fuzzy-matching-arrays-plain. @larsk21 Could you re-run your benchmarks on this so we have a comparison on the same hardware as the previous numbers?

@Kha
Copy link
Member

Kha commented Mar 3, 2022

There is, of course, always the optimization of decreasing the input size. For example, while @gebner has convinced me in the previous thread that we should not ignore namespaces in e.g. the workspace symbols request, it still isn't clear to me if that is the correct default even just for relevance of results, not just speed. A possible compromise could be to include them only if a . is part of the query.

@gebner
Copy link
Member

gebner commented Mar 3, 2022

There is, of course, always the optimization of decreasing the input size. For example, while @gebner has convinced me in the previous thread that we should not ignore namespaces in e.g. the workspace symbols request, it still isn't clear to me if that is the correct default even just for relevance of results, not just speed.

There's very few definitions that you can find without looking at the namespace (think of MetaM.run or Int.ofNat or List.filter etc.; all of which require narrowing down on the namespace because there's a lot of functions with that base name). Therefore ignoring the namespace can't be the solution (as tried in the last PR).

A possible compromise could be to include them only if a . is part of the query.

To elaborate on this proposal, how about matching name components separately? That is, say the query is topsp.comp and the declaration name is Mathlib.TopologicalSpace.Compacts.equiv. Then we would try to match comp against TopologicalSpace, Compacts, equiv separately; and then topsp against Mathlib and TopologicalSpace (looking for an embedding of the components). The nice advantage of this approach is that we can cache the component queries.

@Kha
Copy link
Member

Kha commented Mar 3, 2022

To elaborate on this proposal, how about matching name components separately?

Yes, I think this is the natural conclusion. But doing so anywhere within the compound name seems problematic when combined with the "only on ." rule - is it ok if comp does not match something that topsp.comp does? I would have expected that we match at least the last component of the query only against that of the declaration. Do you have a use case in mind where you want to jump to a symbol by inputting only part of its namespace? Again I'm mostly thinking about documentSymbol here because e.g. auto completion works one-component-at-a-time anyway, currently at least.

@gebner
Copy link
Member

gebner commented Mar 3, 2022

But doing so anywhere within the compound name seems problematic when combined with the "only on ." rule - is it ok if comp does not match something that topsp.comp does?

Right, I don't think the "only on ." rule is a good idea. Having two different search strategies that are selected based on whether a dot is present in the patterns seems like a recipe for confusion.

In my proposal, topsp.comp would match all of the following:

  • Mathlib.TopologicalSpace.Compacts
  • Mathlib.TopologicalSpace.Compacts.equiv
  • TopologicalSpace.Compacts

I would have expected that we match at least the last component of the query only against that of the declaration. Do you have a use case in mind where you want to jump to a symbol by inputting only part of its namespace?

I don't think the last-component-matches-last-component restriction affects expressivity too much. If you want to search for declarations in a namespace you can just append a dot (e.g., topsp. returns declarations in the Mathlib.TopologicalSpace namespace).

Lifting the restriction is useful for set_option where set_option simp<TAB> should also offer trace.Meta.Tactic.simp.congr as a completion.

It's also useful if you search for theorems and want a theorem about gravel and want to see both gravel_iff_rock as well as gravel.to_stone. But if we match components separately, then you'll need to try various dot-combinations anyhow (i.e., gravelrock and gravel.rock).

You also type topsp.comp letter-by-letter and I would at least a priori expect that the result set gets smaller with each letter that you write (i.e., that topsp. returns a subset of the results of topsp).

To sum up, I think the last-component-matches-last-component could be sensible for declarations. The "only on ." rule is only confusing. (Same goes for the current "case-sensitive if the query contains an upper-case character" rule, which is fortunately gone in this PR.)

@Kha
Copy link
Member

Kha commented Mar 4, 2022

Okay, if we want to do component-wise matching, I'd say let's leave out the restriction for now and see if we need it for either performance or relevance (for which it could simply be a score adjustment).

But for what it's worth, Lars' original benchmarks are no slower than ~220ms on my machine (which isn't the slowest, sure) now with my branch, so maybe this is good enough to be merged for now and evaluate performance & relevance in practice?

@gebner
Copy link
Member

gebner commented Mar 4, 2022

But for what it's worth, Lars' original benchmarks are no slower than ~220ms on my machine (which isn't the slowest, sure) now with my branch, so maybe this is good enough to be merged for now and evaluate performance & relevance in practice?

For me it's between 60-340ms with your branch (and 60-460ms with the PR), which I think is still acceptable for interactive usage. I agree we should merge this now, it's much better than what's currently in Lean and we can iterate over it later.

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 10, 2022

Could you re-run your benchmarks on this so we have a comparison on the same hardware as the previous numbers?

Of course, here are the results:

Benchmark Time [s]
Single 0.694
Short 0.391
Medium 0.696
Long 0.376
Full 0.003

The new version is slightly slower, but in the same range. The Medium benchmark is differing the most, I would guess because of the overhead from the complete forward pass.

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 10, 2022

Currently the fuzzy matching algorithm is used in two places: the workspace symbols request and the (id, dot, option, ...) completion request. In the workspace symbols request, all available declarations are tried to match, while in the completion request, the previous step-wise matching is used. This benefits performance and, I would say, also usability. (Imagine using a similar naming scheme as an imported namespace, resulting in a lot of completion items.)

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 10, 2022

The threshold for displaying matched items is currently set to 0.2, we probably have to wait how this works in practice before changing it, if necessary.

@Kha
Copy link
Member

Kha commented Mar 10, 2022

The Medium benchmark is differing the most, I would guess because of the overhead from the complete forward pass.

Makes sense. I suppose there is also the option of mixing the two solutions - a recursive backward search that uses the array for memoization. Not sure if it's worth investigating that direction.

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 10, 2022

I suppose there is also the option of mixing the two solutions - a recursive backward search that uses the array for memoization.

I also thought of this, but haven't had time to try it yet.

@larsk21 larsk21 marked this pull request as ready for review March 10, 2022 14:18
src/Lean/Data/FuzzyMatching.lean Show resolved Hide resolved
src/Lean/Data/FuzzyMatching.lean Show resolved Hide resolved
return true
let mut aIt := a.mkIterator
for i in [:b.bsize] do
if aIt.curr.toLower == (b.get i).toLower then
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that b.get i returns arbitrary (= 'a') if i is not at a UTF-8 char boundary, so this should reject any Unicode queries. Which is fine I think, and we definitely don't want to pay any extra overhead for a use case we don't even know is relevant at all in practice. In the future, we should introduce a new primitive String.getByte : String.Pos -> UInt8, which would allow us to further optimize the ASCII case and at least accept Unicode queries, even if the results might not be quite as expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since Iterator.curr uses get as well, doesn't this mean that any characters outside the UTF-8 boundary are effectively equal to 'a', 'A' and any other characters outside the boundaries here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but note that next only visits char boundaries :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So next would have to reject the move when the target is not a valid UTF-8 character? But this code increments the position by 1 for non-UTF-8 characters.

/* invalid UTF-8 encoded string */
return lean_box(i+1);

(from lean_string_utf8_next)

src/Lean/Server/Completion.lean Show resolved Hide resolved
@larsk21
Copy link
Contributor Author

larsk21 commented Mar 11, 2022

I know you already approved the PR, but I just tried to use the combination of backward search and array memoization. It is slightly slower for short patterns, but provides a speedup for longer patterns (compared to the current array implementation).

see fuzzy-matching-cache

Benchmark

Benchmark Time [s]
Single 0.742
Short 0.399
Medium 0.423
Long 0.280
Full 0.140

@Kha
Copy link
Member

Kha commented Mar 11, 2022

Interesting. I was actually thinking about this approach as well just now - if we think of the recursion as paths through the array from the bottom right corner to the top left, the only "join points" of paths, i.e. where we should memoize, are the coordinates where needle and haystack agree, no? So could we elide loading and storing the cache at all other coordinates?

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 11, 2022

If we start in the same cell, the two paths [miss, match] and [match, miss] will reach the same cell (both consuming two word characters and one pattern character), right? Wouldn't this mean that most of the cells that are reached by the recursive backward search can be reached by using different paths? The only thing filtering paths is allowMatch, but I'm not sure if we can know beforehand if we need to store the result in the array based on this.

@larsk21
Copy link
Contributor Author

larsk21 commented Mar 11, 2022

I tried this just out of curiosity, but using a HashMap with the Nat indices as keys is slower (~ 0.6s for the medium benchmark, ~ 1.2s for the single benchmark).

@leodemoura leodemoura merged commit e430496 into leanprover:master Mar 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fuzzy completion matching
5 participants