New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revising ranking algorithm #638

Closed
junegunn opened this Issue Aug 19, 2016 · 22 comments

Comments

7 participants
@junegunn
Copy link
Owner

junegunn commented Aug 19, 2016

There is a long-standing FIXME entry in the code

// 0. (FIXME) How to find the shortest match?

// 0. (FIXME) How to find the shortest match?
//    a_____b__c__abc
//    ^^^^^^^^^^  ^^^

That is fzf stops immediately when the first match is found. This gives nice O(n) time complexity but it's not guaranteed to find the best match as shown in the above example. Most users use fzf for finding files, and it's more likely what they're looking for is at the latter part of the string, and we're missing the opportunity to give proper scores to better matches.

Algorithms that find globally best match are known to have O(nm) time complexity where n is the length of the string and m is the length of the pattern. Though it's definitely more involved than the current method, it's not so bad considering that m is expected to be small in this application. Since fzf query responses almost instantly for usual input sizes (a few hundreds of ks), it may make sense to put more effort to give better results.

Clarification: We can still get the best match in O(n) time unless we change our definition of "best"; the substring with the shortest length. Instead of breaking out of the loop as soon as we find the first match, we just keep scanning till the end of the string. O(nm) algorithms are needed only if we want to introduce more sophisticated scoring mechanism.

@balta2ar

This comment has been minimized.

Copy link

balta2ar commented Aug 20, 2016

I once hacked a small and simple source for deoplete. Due to my exotic requirements, I had to filter results by my own. I think I had the same challenge there: I wanted shortest matches to be displayed first. I didn't code the algorithm myself, I rather used this library: https://bitbucket.org/mrabarnett/mrab-regex.
Specifically, this construction (quote from the docs): (?:foo){i} match "foo", permitting insertions.
This gave me the best results, now the completer puts first exactly those items which I expect. I'm not sure about the complexity of the underlying algorithm, but considering it gives very nice results, it may be worth taking a look at.

As for the current matching algorithm in fzf, I may be biased, I'm more than happy with it.

@balta2ar

This comment has been minimized.

Copy link

balta2ar commented Aug 31, 2016

FYI: https://github.com/jhawthorn/fzy/blob/master/ALGORITHM.md
Used by: https://github.com/cloudhead/neovim-fuzzy

With my love to fzf, people say sad things :(

If fzy supports streaming I can finally move away from fzf. (jhawthorn/fzy#17)
I'd like to try out fzy as a replacement with fzf for my workflows because of the better ranking (jhawthorn/fzy#22)

reddit discussion: https://www.reddit.com/r/neovim/comments/4zly16/neovimfuzzy_proper_fzf_integration_for_neovim_not/

@junegunn

This comment has been minimized.

Copy link
Owner

junegunn commented Aug 31, 2016

@balta2ar Hey, thanks for the pointers. If they don't need the features and stability of fzf and prefer the result of fzy, well I guess, why not? I can't help but think that they will reconsider if they better understand the power of extended-search mode and smart-case search, but I don't really have to win the internet, so yeah.

What's clear though is that it would be much easier for fzf to steal good stuff from fzy than the other way around :)

fzy does implement interesting scoring mechanism compared to the other fuzzy finders e.g. selecta, pick, etc. And I can surely understand that some users prefer the way it works, especially when they want to search with acronyms. However, it's hard to argue that it gives the best result in all cases, for example consider the following input:

  • such_a_great_day
  • good/bad/worse/worst/ridiculous/pleasant/audition/triangle
  • grim/worst/ridiculous/pleasant/eater/triangle

Suppose I want to search for "great", and since it's a fuzzy finder, I type in "grat", "grt" or "gret" instead of "great", and here are the results from fzy:

> grat
good/bad/worse/worst/ridiculous/pleasant/audition/triangle
such_a_great_day
grim/worst/ridiculous/pleasant/eater/triangle

> grt
grim/worst/ridiculous/pleasant/eater/triangle
good/bad/worse/worst/ridiculous/pleasant/audition/triangle
such_a_great_day

> gret
grim/worst/ridiculous/pleasant/eater/triangle
such_a_great_day
good/bad/worse/worst/ridiculous/pleasant/audition/triangle

All three queries give different results, and they are not giving the best result for my intention, which is obviously "such_a_great_day". If the patterns I typed in were acronyms, then it's probable that fzy is giving the right answer, but it's not. Since it cannot read my mind ("is it an acronym or a fuzzy word?"), it simply guesses using arbitrarily chosen coefficients. The strategy may give the best of both worlds, or may not. It depends.

fzf, on the other hand, gives the consistent results for all three queries. One can argue that the scoring mechanism of fzf is too simplistic, but at least it's consistent and easier to understand. And I can almost always get the right answer with extended-search mode and smart-case search. Oh, I have another idea. How about "smarter-case search", where uppercase letters not only match themselves but also lowercase letters at word boundaries? We can use it to explicitly tell fzf to perform acronym search.

Having said all that, I do see the value of more sophisticated scoring mechanisms. Actually I implemented a prototype of Smith-Waterman algorithm on fzf just after I created this issue and was thinking of adding an option to choose it over the current algorithm (e.g. --algo=sw or --algo=classic, ...) It works, but I'm not confident about the result yet and haven't had time to clean things up and decide how to reconcile the new mechanism with the existing infrastructure, such as extended-search mode and --tiebreak.

@balta2ar

This comment has been minimized.

Copy link

balta2ar commented Aug 31, 2016

they will reconsider if they better understand the power of extended-search mode and smart-case search

I used to run fzf with extended search, but then I found myself typing ' way too often, so I switched to exact mode. I think I need to give fuzzy search and smart case search another try.

While we are on this page, is it possible to override FZF_DEFAULT_OPTS=-e with fzf +e? If so, it's not immediately clear from this help line: https://github.com/junegunn/fzf/blob/master/src/options.go#L21. Maybe mention "+e" in parenthesis?

How about "smarter-case search", where uppercase letters not only match themselves but also lowercase letters at word boundaries?

I think something similar is implemented in YouCompleteMe. If your request is "gbo", then string "def GetBudgetObject(args" will be prioritised. Some ideas about that are discussed here: Valloric/YouCompleteMe#1757. When I was still using YCM, this strategy worked well for code completion, I noticed it to be helpful many times. Now I'm on deoplete and I'm not sure whether such strategy is there or not. Besides, fzf is not code completion tool. I can't tell in advance whether it's worthwhile or not without using it for a while in practice.

was thinking of adding an option to choose it over the current algorithm

Yes, I had this in my mind as well. Frankly speaking, I was also thinking about a shared library of such algorithms (instead of projects' "stealing" algorithms from each other), where any project could just reuse them as needed. But that obviously requires more effort. And I'm probably way too optimistic to consider it.

@TheZoq2

This comment has been minimized.

Copy link

TheZoq2 commented Sep 6, 2016

was thinking of adding an option to choose it over the current algorithm (e.g. --algo=sw or --algo=classic, ...)

I would love to see that implemented. I really like fzf from a usability standpoint but sometimes the program matches a blatantly wrong result.

For example, my main use of fzf is to cd into folders. I have a folder called Documents/elm and I want to cd into it by typing docelm. However, as can be seen in this screenshot, fzf scores a really long path that contains all those letters instead. of the short Documents/elm which im looking for. For my usecase, an algorithm which priorotises more matches per length would be more usefull.

@balta2ar

This comment has been minimized.

Copy link

balta2ar commented Sep 6, 2016

My experience is similar to the one of @TheZoq2. Basically I switched to exact mode because it gives the same results in less keystrokes. Yes, fuzzy search can give the same result, but in that case I usually have to type more.

@khalidchawtany

This comment has been minimized.

Copy link

khalidchawtany commented Sep 6, 2016

@junegunn I have been using FZF for everything since its creation. I still use it for a lot. However, currently I prefer fzy for finding files. This is due to the fact that, when I open a file I usually (90% of the times) know where it is. Typing some of the first letters of the path to takes me to it.

Knowing that you have already stated that

What's clear though is that it would be much easier for fzf to steal good stuff from fzy than the other way around :)

Can we have both of the scoring methods and make it changeable through an argument --sorter : sorter_type. This way everybody will be happy :)

@mjwestcott

This comment has been minimized.

Copy link
Contributor

mjwestcott commented Sep 6, 2016

I agree with the sentiments expressed above. fzf is still my favoured tool. Its only weakness is the scoring algorithm.

@junegunn is right that there's no universally acknowledged solution to the scoring problem. My own use of fuzzy finding is based largely on word prefixes. So if the file I'm looking for is:

/Users/mattwestcott/repos/python/trains/models.py

I'm very likely to choose a pattern like

pytrains
pytramod
repotrmod

That is, I naturally go for consecutive characters following word boundaries. I'm not likely to choose

postrans
wetepo

or some other misspelling that is technically a subsequence.

My view is that fzf doesn't support this usage well enough. @TheZoq2 seems to use it the same way.

For some background: a few months ago I proposed a scoring algorithm that emphasises word boundaries and, in particular, gave a penalty to characters of the pattern that matched far from the start of the word.

This led to the current scoring system (via this commit) which was adapted to be more conservative. If I understand correctly, it gives a bonus if the first character of the pattern matches the start of a word, but doesn't penalise patterns that match in the middle or the end of a word. The total match length is the dominant feature of the algorithm.

I understand @junegunn's desire to keep it simple and choose a solution that everyone can understand. But perhaps, as others have suggested, we can start with a more sophisticated solution as an optional flag. The approach of fzy is certainly interesting and in my limited testing I like its results.

@junegunn

This comment has been minimized.

Copy link
Owner

junegunn commented Sep 6, 2016

Thanks for the suggestions. It's over 2AM here in Seoul, I'll take a closer look at your comments later.

Anyway as you guys really seem interested in the direction, I uploaded the alpha binaries with Smith-Waterman algorithm I mentioned above.

https://github.com/junegunn/fzf-bin/releases/tag/alpha

Try it and let me know how it goes for you, but don't expect it to be stable :) I've been pretty busy lately and haven't found time to really clean things up. I'll probably have more time next week, and hopefully wrap this up.

Also note that fzf is still a fuzzy finder even if we do this, not "acronym finder" so it will not rank a______________/b______________/c______________/d______________/e______________/ higher over axcxe on ace. But it may prefer a_______/c_______/e_______/. So that's the hard part, finding "the right" threshold that tends to give better answers in "most" cases. The decision should be made not arbitrarily but based on empirical study and statistics such as median length of words found in file names, etc.

@junegunn junegunn changed the title Find the best match Revising ranking algorithm Sep 6, 2016

@TheZoq2

This comment has been minimized.

Copy link

TheZoq2 commented Sep 6, 2016

I did some small tests with that alpha build and it seems to work just fine for my use case. It feels a bit slower though, especially for really large sets of files.

@junegunn

This comment has been minimized.

Copy link
Owner

junegunn commented Sep 7, 2016

@TheZoq2 Thanks for the feedback. I just updated the binaries with some optimizations, should be noticeably faster. I believe there is more room for improvement but nevertheless the new algorithm will be slower no matter what because what we're doing here is putting more effort to get better results. i.e. go over all occurrences of the pattern instead of stopping immediately after finding the first one.

@mjwestcott

This comment has been minimized.

Copy link
Contributor

mjwestcott commented Sep 7, 2016

Thanks @junegunn. Based on a few hours' use, the alpha version is awesome. Even before your optimisations, I found it usable on 1-2 million filenames. The results are definitely worth the cost.

I'll be interested to check out your implementation.

Also, highlighting only the characters that match is a big improvement.

@khalidchawtany

This comment has been minimized.

Copy link

khalidchawtany commented Sep 7, 2016

I really enjoy using this new feature. 👍
Please can we have a way to toggle the sorting algorithm (maybe using a flag)?

@deathmaz

This comment has been minimized.

Copy link

deathmaz commented Sep 7, 2016

Work's great!

@TheZoq2

This comment has been minimized.

Copy link

TheZoq2 commented Sep 7, 2016

I tried the latest alpha and it feels really good. It's still a bit slow when I run find ~ | fzf... but for my usecase of looking for only directories instead of files it feels instant.

A flag to select algorithm would defenitivley be usefull since it's obviously a bit slower but it works great for me.

@balta2ar

This comment has been minimized.

Copy link

balta2ar commented Sep 8, 2016

Not really related to the ranking, but what I really love about ' is that it reverses the meaning depending on the mode. If you're in exact mode, 'substring becomes a fuzzy pattern, and vice versa. Awesome!

@junegunn junegunn referenced this issue Sep 11, 2016

Merged

Update algo.go #657

@junegunn

This comment has been minimized.

Copy link
Owner

junegunn commented Sep 11, 2016

To be clear, this issue deals with two orthogonal concerns.

  1. Scoring criteria. The way we estimate the relevancy of the match. With the suggested scheme, the match length is no longer the dominant factor. Change of scoring criteria doesn't affect the performance of fzf. What is difficult is to come up with an objective model to evaluate the effectiveness of the criteria. It is especially hard because the effectiveness usually depends on the context and fzf is a general-purpose filter that can be used with any type of entities.
  2. Finding the best occurrence of the pattern within a match. Instead of stopping immediately after finding the first occurrence, we examine the entire line to find the best one among all occurrences. In this case the performance overhead is unfortunately inevitable.

Performance

I tried really hard to squeeze the last drop of performance out of the new code with inelegant hacks like manually inlining function calls, reusing memory slices, etc, but the performance overhead becomes quite noticeable once the number of lines in the input exceeds a few hundreds of thousands.

There are a few factors that affect the overhead. Most notably, the length of the pattern string and the selectivity of the query and they are usually correlated to some degree in practice. i.e. the longer the query, the better the selectivity. The selectivity matters because the overhead only applies to the matching items.

For a query with extremely bad selectivity (say ./// for locate / output), the new algorithm is over twice as slow as before.

But for a query with pretty good selectivity, the overhead is dwarfed by the other parts of the processing and I could still get better performance than 0.13.3 which was before some optimizations made in #637.

Options

Two concerns, four possible configurations.

  • --sort-by=score|matchlen
  • --algo=v1|v2

I don't feel strongly about --sort-by or any advanced configuration parameters for scoring criteria. I'd like to keep things simple and try to focus on improving the default.

I was also hoping to replace the current algorithm with the new one and be done with it without extra --algo, once there is a general consensus that the quality of the result is worth the cost and the performance overhead becomes insignificant. But the latter is not the case and I can imagine that there are cases where performance does matter, for example locate / | fzf. So we have no choice but to add --algo? Or maybe I'm worrying too much? Should I just ship it and decide after hearing what users say? Maybe there are few users who actually use fzf with millions of lines.

@balta2ar

This comment has been minimized.

Copy link

balta2ar commented Sep 11, 2016

I'd definitely would like fzf to keep the good old exact early-exit matching algorithm as you've just described my frequent use case: locate / and my locate has 4.1M entries.

As for the old version of fuzzy matching, I step aside as I couldn't get used to it — exact matching has always worked better for me. So I'm not sure I personally need the old fuzzy version.

In vim, though, I think it would be nice to have an option to override shell's FZF_DEFAULT_OPTIONS because I hope to start using this smarter algorithm and I would like to enable it in vim settings explicitly (but to have it off for the rest of the system). I couldn't find an option to do that in fzf.vim's docs. Is it there?

@junegunn

This comment has been minimized.

Copy link
Owner

junegunn commented Sep 12, 2016

@balta2ar

I'd definitely would like fzf to keep the good old exact early-exit matching algorithm as you've just described my frequent use case: locate / and my locate has 4.1M entries.

Since you use exact matcher, I suggest that you directly pass the pattern to locate command (locate foobar) to lighten the load of fzf. The performance overhead we're discussing here mainly applies to fuzzy matcher. I also updated the exact matcher alongside to find the best occurrence with higher score, but the overhead is hardly significant in that case.

And yes, Vim allows you to override environment variables in vimrc. let $FZF_DEFAULT_OPTS = '...'

@TheZoq2

This comment has been minimized.

Copy link

TheZoq2 commented Sep 13, 2016

I would say that leaving the old algorithm as an option is a good idea since the new one is a bit slower for large files.

@junegunn

This comment has been minimized.

Copy link
Owner

junegunn commented Sep 18, 2016

I just released 0.15.0 with --algo=[v1|v2] option where v2 is the default. You can use v1 if the performance is unacceptable. Please try it and let me know of any problems – bugs, non-intuitive result, etc – you run into.

The details of the new algorithm can be found here:
https://github.com/junegunn/fzf/blob/0.15.0/src/algo/algo.go

@junegunn junegunn closed this Sep 18, 2016

@junegunn junegunn referenced this issue Oct 26, 2016

Closed

About sort problem in search results #719

4 of 14 tasks complete

@ahmedre ahmedre referenced this issue Nov 15, 2016

Closed

Customizing the scoring algorithm #737

3 of 15 tasks complete

@junegunn junegunn referenced this issue May 24, 2017

Closed

length tiebreak not working as expected #932

3 of 15 tasks complete
@simnalamburt

This comment has been minimized.

Copy link

simnalamburt commented Jan 11, 2018

With my love to fzf, people say sad things :(

If fzy supports streaming I can finally move away from fzf. (jhawthorn/fzy#17)

@balta2ar Please don't get me wrong. I used fzy because fzf did not support windows back then. 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment