Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: improve fuzzy-matching heuristics #1710

Merged
merged 3 commits into from Oct 14, 2022
Merged

Conversation

rish987
Copy link
Contributor

@rish987 rish987 commented Oct 9, 2022

Closes #1546. There were no bugs with the current implementation as far as I could tell, but I noticed three areas that could be improved here:

  1. Penalties should really only apply to the beginning of consecutive runs, and they shouldn't accumulate across namespaces. For example, if we had Lean.AVeryLongNamespace.AnotherVeryLongNamespace.SMap every miss would incur an additional penalty that would overwhelmingly negate the merit of a perfect match with the main identifier and give it no chance to show up in the top 100 results. So, rather than accumulating penalties for every miss, we apply a penalty once at the start of every consecutive run to indicate how "awkward" of a place to start a match this is. This awkwardness factor increases within a namespace but is reset at the beginning of every new namespace. However, we still do add a constant amount of penalty for every namespace so that we prefer less nested namespaces.
  2. Matches in the main identifier (after all of the namespaces) should get a bonus.
  3. Consecutive runs should be valued a lot more. What I've done is make a bonus for every match that is proportional to the length of the current run. So longer matches will get exponentially higher scores than ones that are more split up, which I think is better than a constant difference that could more easily have been negated.

LMK what you think of the above heuristics. After some (very brief) testing it looks like it's working great, but if someone could test-drive this and give me feedback it would be much appreciated!

@github-actions
Copy link
Contributor

github-actions bot commented Oct 9, 2022

Thanks for your contribution! Please make sure to follow our Commit Convention.

@rish987 rish987 force-pushed the fuzzy-fix branch 2 times, most recently from 4684119 to 495d5c6 Compare October 11, 2022 02:12
@leodemoura leodemoura requested a review from Kha October 13, 2022 01:43
src/Lean/Data/FuzzyMatching.lean Outdated Show resolved Hide resolved

-- TODO: the following code is assuming all characters are ASCII
for patternIdx in [:pattern.length] do
let patternComplete := patternIdx == pattern.length - 1

-- for the IH, it's only necessary to populate a range of length `word.length - pattern.length` at each index
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the IH?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inductive hypothesis for showing the correctness of the dynamic program -- I've rephrased it.

@rish987 rish987 force-pushed the fuzzy-fix branch 3 times, most recently from ee66ff6 to ba51b46 Compare October 13, 2022 15:03
@rish987 rish987 changed the title fix(lsp): improve fuzzy-matching logic for getting workspace symbols fix: improve fuzzy-matching heuristics Oct 13, 2022
@Kha
Copy link
Member

Kha commented Oct 13, 2022

@rish987 Could you rebase against master for benchmarking? Locally it looks like there is some overhead of up to 25%, unfortunately.

13:0: 
160678 decls
104155 matches

13:0: 
interpretation of _eval took 1.46s

14:0: 
160678 decls
143691 matches

14:0: 
interpretation of _eval took 3.18s

15:0: 
160678 decls
9890 matches

15:0: 
interpretation of _eval took 1.05s

16:0: 
160678 decls
334 matches

16:0: 
interpretation of _eval took 546ms

@Kha
Copy link
Member

Kha commented Oct 13, 2022

!bench

@leanprover-bot
Copy link
Collaborator

Here are the benchmark results for commit 2bbdd62.
There were significant changes against commit 828aea4:

  Benchmark          Metric         Change
  ===================================================
- workspaceSymbols   branches          16% (2831.8 σ)
- workspaceSymbols   instructions      18% (4354.7 σ)
- workspaceSymbols   task-clock        21%   (72.0 σ)
- workspaceSymbols   wall-clock        21%   (71.7 σ)

@Kha
Copy link
Member

Kha commented Oct 14, 2022

On the other hand, this is still fast enough that we can test-drive it and then think about further behavior/performance refinements. The improvements should be worth it.

@Kha Kha merged commit 76c4693 into leanprover:master Oct 14, 2022
@rish987
Copy link
Contributor Author

rish987 commented Oct 14, 2022

At first glance I'm a bit puzzled that it's slower, since I actually made the result table that we have to populate with scores smaller, so there might be some expensive operation somewhere that I've introduced. Or maybe it's just the overhead of the new runLengths table that we're populating in parallel to result? I'll investigate later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Unexpected workspace symbol search results
3 participants