fix: improve fuzzy-matching heuristics #1710

rish987 · 2022-10-09T14:02:54Z

Closes #1546. There were no bugs with the current implementation as far as I could tell, but I noticed three areas that could be improved here:

Penalties should really only apply to the beginning of consecutive runs, and they shouldn't accumulate across namespaces. For example, if we had Lean.AVeryLongNamespace.AnotherVeryLongNamespace.SMap every miss would incur an additional penalty that would overwhelmingly negate the merit of a perfect match with the main identifier and give it no chance to show up in the top 100 results. So, rather than accumulating penalties for every miss, we apply a penalty once at the start of every consecutive run to indicate how "awkward" of a place to start a match this is. This awkwardness factor increases within a namespace but is reset at the beginning of every new namespace. However, we still do add a constant amount of penalty for every namespace so that we prefer less nested namespaces.
Matches in the main identifier (after all of the namespaces) should get a bonus.
Consecutive runs should be valued a lot more. What I've done is make a bonus for every match that is proportional to the length of the current run. So longer matches will get exponentially higher scores than ones that are more split up, which I think is better than a constant difference that could more easily have been negated.

LMK what you think of the above heuristics. After some (very brief) testing it looks like it's working great, but if someone could test-drive this and give me feedback it would be much appreciated!

github-actions · 2022-10-09T14:03:08Z

Thanks for your contribution! Please make sure to follow our Commit Convention.

src/Lean/Data/FuzzyMatching.lean

Kha · 2022-10-13T14:05:27Z

src/Lean/Data/FuzzyMatching.lean


  -- TODO: the following code is assuming all characters are ASCII
  for patternIdx in [:pattern.length] do
-    let patternComplete := patternIdx == pattern.length - 1
-
+    -- for the IH, it's only necessary to populate a range of length `word.length - pattern.length` at each index


What's the IH?

Inductive hypothesis for showing the correctness of the dynamic program -- I've rephrased it.

Kha · 2022-10-13T19:45:06Z

@rish987 Could you rebase against master for benchmarking? Locally it looks like there is some overhead of up to 25%, unfortunately.

13:0: 
160678 decls
104155 matches

13:0: 
interpretation of _eval took 1.46s

14:0: 
160678 decls
143691 matches

14:0: 
interpretation of _eval took 3.18s

15:0: 
160678 decls
9890 matches

15:0: 
interpretation of _eval took 1.05s

16:0: 
160678 decls
334 matches

16:0: 
interpretation of _eval took 546ms

Kha · 2022-10-13T19:58:22Z

!bench

leanprover-bot · 2022-10-13T20:45:17Z

Here are the benchmark results for commit 2bbdd62.
There were significant changes against commit 828aea4:

  Benchmark          Metric         Change
  ===================================================
- workspaceSymbols   branches          16% (2831.8 σ)
- workspaceSymbols   instructions      18% (4354.7 σ)
- workspaceSymbols   task-clock        21%   (72.0 σ)
- workspaceSymbols   wall-clock        21%   (71.7 σ)

Kha · 2022-10-14T14:17:02Z

On the other hand, this is still fast enough that we can test-drive it and then think about further behavior/performance refinements. The improvements should be worth it.

rish987 · 2022-10-14T14:24:59Z

At first glance I'm a bit puzzled that it's slower, since I actually made the result table that we have to populate with scores smaller, so there might be some expensive operation somewhere that I've introduced. Or maybe it's just the overhead of the new runLengths table that we're populating in parallel to result? I'll investigate later today.

rish987 force-pushed the fuzzy-fix branch 2 times, most recently from 4684119 to 495d5c6 Compare October 11, 2022 02:12

leodemoura requested a review from Kha October 13, 2022 01:43

Kha approved these changes Oct 13, 2022

View reviewed changes

rish987 force-pushed the fuzzy-fix branch 3 times, most recently from ee66ff6 to ba51b46 Compare October 13, 2022 15:03

rish987 changed the title ~~fix(lsp): improve fuzzy-matching logic for getting workspace symbols~~ fix: improve fuzzy-matching heuristics Oct 13, 2022

rish987 added 3 commits October 13, 2022 19:57

fix: improve fuzzy-matching logic for getting workspace symbols

2f61681

update tests

5c65d53

minor

2bbdd62

rish987 force-pushed the fuzzy-fix branch from ba51b46 to 2bbdd62 Compare October 13, 2022 19:57

Kha merged commit 76c4693 into leanprover:master Oct 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve fuzzy-matching heuristics #1710

fix: improve fuzzy-matching heuristics #1710

rish987 commented Oct 9, 2022 •

edited

github-actions bot commented Oct 9, 2022

Kha Oct 13, 2022

rish987 Oct 13, 2022

Kha commented Oct 13, 2022 •

edited

Kha commented Oct 13, 2022

leanprover-bot commented Oct 13, 2022

Kha commented Oct 14, 2022

rish987 commented Oct 14, 2022

fix: improve fuzzy-matching heuristics #1710

fix: improve fuzzy-matching heuristics #1710

Conversation

rish987 commented Oct 9, 2022 • edited

github-actions bot commented Oct 9, 2022

Kha Oct 13, 2022

Choose a reason for hiding this comment

rish987 Oct 13, 2022

Choose a reason for hiding this comment

Kha commented Oct 13, 2022 • edited

Kha commented Oct 13, 2022

leanprover-bot commented Oct 13, 2022

Kha commented Oct 14, 2022

rish987 commented Oct 14, 2022

rish987 commented Oct 9, 2022 •

edited

Kha commented Oct 13, 2022 •

edited