Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

re-enable support for custom multi-word synonyms #457

Merged
merged 2 commits into from Aug 7, 2020

Conversation

missinglink
Copy link
Member

@missinglink missinglink commented Aug 4, 2020

As discussed in #456, the work in #453 had the unexpected consequences of dropping support for multi-word custom synonyms.

My general guidance here is that multi-word synonyms are poorly supported by lucene/elasticsearch and so should be avoided where possible, great care should be taken to ensure they are compatible with the match_phrase queries used by Pelias.

this new elasticsearch doc explains the issues with multi-word synonyms with accompanying graphics:
https://www.elastic.co/guide/en/elasticsearch/reference/7.6/token-graphs.html

Where possible I'd recommend using 'aliases' ie. doc.setNameAlias() instead, this is a reliable method of achieving the same thing, although it's far less convenient because it's on a per-record basis.

So.. having said all that.. this PR re-enables support for multi-word synonyms (in custom synonyms files only) in order to avoid breaking backwards compatibility.

In order to do this I had to move the custom multi-word synonyms outside the multiplexer, apparently multiplexers emit tokens one-by-one to each of their branches, preventing the ability to 'look-ahead' as required by any multi-term analysis within a branch.

Screenshot 2020-08-04 at 10 21 15

resolves #456

@@ -74,6 +75,12 @@ function multiWordCheck(line, logprefix, tokens) {
});
}

function tokenReplacementCheck(line, logprefix) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've also added this additional linter warning.

Technically the => syntax is actually still supported (the linter doesn't rewrite the configs) but I would like to heavily discourage its use.

The reason for this is an entry such as road => rd would only enter [ "rd" ] in the index, so when a user queries for [ "road" ] (or any ngram of), ☠️ it will not match ☠️

Advanced users may opt to write it as road => road, rd to overcome this issue but this is equivalent to road, rd.

I think we will probably completely drop support for => at some point in the future because it's very easy to make mistakes with.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking it might actually be useful to use => sometimes. For example, if we have a single letter abbreviation that we want to ensure is not being applied too often.

c, carrer would replace every instance of c with carrer. However, carrer => career, c would not.

As you mentioned we would probably recommend that all tokens on the left hand side of the replacement also appear on the right, to avoid issues with autocomplete jitter.

@missinglink
Copy link
Member Author

I've added 1baeadf to address an issue where the linter is was using /\s/ instead of [\\s/\\\\-]+ to determine which tokens were multi-word.

As a result I've had to remove a few hyphenated synonyms from the canonical synonym lists.

@orangejulius
Copy link
Member

Nice. Does it make sense to add an integration test for multi-word synonyms?

@orangejulius orangejulius merged commit 78c3cd3 into master Aug 7, 2020
@orangejulius orangejulius deleted the multiword_synonyms branch August 7, 2020 13:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi token synonyms appear broken by synonyms upgrade
2 participants