Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: fuzzy autocomplete #1268

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

WIP: fuzzy autocomplete #1268

wants to merge 5 commits into from

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Mar 26, 2019

This PR is WIP and I wouldn't suggest merging it as-is, it's open for testing, discussion and feedback.

One confusing and under-documented feature of the elasticsearch match query is that it supports type: phrase and also fuzziness: * but not both at the same time 🤷‍♀️

There are unfortunately no warnings about this, which is what's contributing to the confusion, it might be the same for cutoff_frequency and phrase.

So this PR simply rewrites the top MUST query for most queries generated for autocomplete to remove phrase and enable fuzziness.

I've chosen conservative values for fuzziness in order to avoid increasing the CPU usage significantly, a fuzziness of 1 means that the Levenshtein edit distance is 1, so a single character can be added, removed or replaced.

The prefix_length is set to 1, this means that the first character is not considered for edits, the default is 0 which would obviously generate a lot more permutations and require a lot more CPU, the tradeoff here is that if someone mistypes the first letter then they're doomed from the get-go.

Lastly the max_expansions is set to 10, I'm not 100% sure what the correct setting for this should be so I again chose something conservative, my understanding of this is that only a maximum of 10 tokens in the index will be used to generate an 'OR' type condition.

From what I've read online, the vast majority of typos are Levenshtein 1 and rarely on the first character, so this should catch a wide range of typing errors without increasing the CPU requirements too much.

Things I would like to change before merging:

  • Clean up the queries, they are becoming confusing and we might be able to simplify them
  • The other phrase queries now have 'fuzziness': 1 which, as I explained above, will do nothing and only serves to confuse.
  • Test the performance implications of running this on every keypress for a global index.
  • Consider the implications of not using phrase and whether it and slop are even required here.
  • Move autocomplete_defaults values out of the phrase:* namespace, maybe in to a fuzzy:* namespace?

@mihneadb
Copy link
Contributor

@missinglink it might be useful to expose fuzziness as a parameter to autocomplete. E.g. maybe someone is OK with more CPU usage for correcting more input strings?

@missinglink
Copy link
Member Author

Hi @mihneadb, I think it's better to leave it as configurable on startup.

Having it variable per-request could cause quality-of-service issues for cloud providers like ourselves as there would be huge variability in the CPU usage of each request. I think it's important that operations engineers can control this setting.

You can adjust the settings for your docker installation without generating a new docker image by copying the autocomplete_defaults file to your local machine, editing it and then bind-mounting your local copy as a replacement for the file in the container.

If you're not familiar with docker-compose you can look in the docker-compose.yml file and you'll see that we bind-mount the local file pelias.json and make it available in the container using the same method.

You'll then need to bring the API service down and up again to apply that configuration.

@mihneadb
Copy link
Contributor

Makes sense, thanks!

@Joxit
Copy link
Member

Joxit commented Apr 17, 2019

Hi @missinglink,

I'm testing this PR, and I found something for the parameters prefix_length.
I'm trying to found this address : 40 Rue De L'arsenal, Bordeaux, France with autocomplete.
When I type 40 Rue De L Arsenal Bordeaux (without ') or 40 Rue De Arsenal, Bordeaux (without L' or 40 Rue Arsenal, Bordeaux (without De L'), I have no results with prefix_length = 1 (but works with prefix_length = 0).
This is common in french to omit determiners and can be a bit strange when we have no results.

We may need some benchmarks to weigh the pros and cons 🤔

@missinglink
Copy link
Member Author

missinglink commented Apr 17, 2019

Thanks @Joxit, interesting feedback, that's certainly unexpected behaviour.

@orangejulius
Copy link
Member

So I tested this out a bit yesterday, and it's quite good! I do think max_expansions needs to be a bit higher than 10, since that's not enough to even support all the possible replacements of a single character (there would be 36 if we are assuming only [a-z0-9]).

@Joxit your example makes sense. With prefix_length = 1, changing/insertion of the first character is supported, so larsenal would be one of the generated possibilities when you type arsenal. However with prefix_length = 0, it would not be. The underlying token for L'arsenal is in fact larsenal, and indeed the following queries don't currently work:

/v1/autocomplete?text=40 Rue De l arsenal, Bordeaux, France
/v1/autocomplete?text=40 Rue De arsenal, Bordeaux, France

So I think we should try increasing the max_expansions to at least 100 or so, and do a bit of performance testing, as well as look for possible issues where the fuzzy searching causes otherwise working queries to now fail. If neither of those end up being a huge concern I really think this is surprisingly close to 🚢.

Some new querty types were added since the fuzziness PR was created,
this updates them with proper test expectations.
@orangejulius
Copy link
Member

I just rebased this branch against the latest in master, and updated some newly added fixtures to take fuzziness parameters into account. The tests should pass again!

@Joxit
Copy link
Member

Joxit commented May 11, 2019

Hi there,
I'm experimenting this PR (with some changes) on our production server.

Here are the changes I've made.

-  'fuzzy:prefix_length': 1,
-  'fuzzy:max_expansions': 10,
+  'fuzzy:prefix_length': 0,
+  'fuzzy:max_expansions': 25,

Source: Joxit/pelias-api jawg/v3.63.0-fuzzy.

I will see if we have some performance issues or incorrect responses 😄.

@Joxit
Copy link
Member

Joxit commented Jun 25, 2019

I just found something than can be weird when we use fuzzy, perfect matches may not be in first positions.
Some work will be required in ES scoring for this.

(This is completely random)

{
  "function_score":{
    "query":{
      "match":{
        "name.default":{
          "analyzer":"peliasQueryPartialToken",
          "query":"vannes",
          "operator":"and"
        }
      }
    },
    "max_boost":20,
    "functions":[
      {
        "weight":1
      }
    ],
    "score_mode":"first",
    "boost_mode":"replace"
  }
}

@ambrusadrianz
Copy link

Hi @missinglink, is there a way I could help with this PR?

@missinglink
Copy link
Member Author

missinglink commented Feb 6, 2020

Yes, please build a docker project from pelias/docker and compare the differences in result quality between pelias/api:master and pelias/api:fuzzy_autocomplete.

https://hub.docker.com/layers/pelias/api/fuzzy_autocomplete/images/sha256-9da0db501c473a7079e1df26569657d0bca89a3c97e0976f01a183a92b7a4f9d

Let us know what's better/worse

@missinglink
Copy link
Member Author

This branch also needs to rebase master, I had a quick look at it and it's not a trivial task.

@ambrusadrianz
Copy link

The issue is that I already migrated to the version with Elastic 7, so testing it out won't be a quick one either..

@missinglink
Copy link
Member Author

I'm not sure if follow, why would that be an issue? Are you saying the queries are not compatible with ES7?

@ambrusadrianz
Copy link

I get this error when running a query:

[parsing_exception] [constant_score] query does not support [query], with { line=1 & col=54 }

@itssoc2
Copy link

itssoc2 commented Oct 7, 2020

Hi, any progress with this PR?. I believe is a very interesting functionality

@jnorkus
Copy link

jnorkus commented Feb 21, 2022

Any news on this? Would love to see this implemented. Any issues preventing it to be merged?

@missinglink
Copy link
Member Author

The technical answer is that fuzziness increases recall at the cost of precision, and this alone does not increase relevance.

The main issue is the way Lucene implements edit-distance queries (what they call fuzziness), these alternative terms are given an equal score as exact matches. This is obviously not ideal since you can spell a word perfectly and have spelling mistakes ranked higher due to other properties influencing the score.

This is further complicated by the complexity of our queries, often a single query contains multiple sub-queries each of which targets a different field of the document.

So if you set an edit-distance of 2 (for instance) then it's not clear how many total edits will be made to the input text. For instance if the parser detected 4 different classifications (eg. number, street, locality, country) in the input then the text may be mutated up to 8 times!

I believe that the result of these two issues will only serve to slow down existing queries while also reducing precision, so it's lose-lose.

I'd be happy to find a solution which provides the increased recall without the ranking issues.

@misra-jitas
Copy link

Hello..
Any update on this Pelias Fuzziness
Is anybody using Pelias Fuzziness? =)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants