Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ordering of search results is affected by Max Results #89

Open
ijt opened this issue Jul 3, 2019 · 4 comments
Open

ordering of search results is affected by Max Results #89

ijt opened this issue Jul 3, 2019 · 4 comments

Comments

@ijt
Copy link
Contributor

ijt commented Jul 3, 2019

Increasing the max results can affect the ordering of the search results. Here is an example.

Screen Shot 2019-07-03 at 12 02 19

Having stable ordering of search results would be a useful property and less surprising for users.

@ijt
Copy link
Contributor Author

ijt commented Jul 3, 2019

I would be willing to work on this.

@hanwen
Copy link
Contributor

hanwen commented Jul 3, 2019

How would you do it?

Basically, you the search engine shows the best result on top. If you search over a larger corpus, you can find better matches, which displaces other results and changes ordering.

@ijt
Copy link
Contributor Author

ijt commented Jul 4, 2019

One possibility would be to present the results in the order they occur within the posting lists. Items in posting lists could have an additional weight field corresponding to their estimated general relevance. The posting lists could be sorted according to the weights, and re-sorted as necessary.

That would degrade the ordering though. The question needs some more thought.

@hanwen
Copy link
Contributor

hanwen commented Jul 4, 2019

@ijt - If you mean "index shard" when you say "posting list", this is exactly how it works already.

Within a shard, files are ordered by importance (important files first), so eg. all things equal you get matches from non-test files before test files.

Then the shards themselves are ordered by "quality" score, which is mainly powered from the github star-count. So matches in github.com/google/guava get prefernce over matches in android.googlesource.com/platform/external/guava, even though the content is the same.

The problem is that matches have quality. If you are looking for "idiot", then the word "idiot" in an unimportant shard is a better match than the identifier "bidiOther" in an important shard.

If you increase the result count to include the unimportant shard, inevitably, this will upset the ordering.

One way out of this is to have a cheaper way to find quality matches. For example, currenltly we have
a section for file names, and file contents. If you add a separate corpus section for symbols (which would be smaller than file contents), you could search more shards with the same amount of CPU

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants