add normalizer for keyword fields #415

missinglink · 2019-12-13T15:36:35Z

cherry-picked from #412 and based on #414.

This PR adds a normalizer which is the nearest thing to an analyzer for keyword fields.
more info here: elastic/elasticsearch#18064

This allows us to perform some basic normalization to fields such as layer, source and category, forcing them to be lowercased and doing some ICU normalization.

One notable change here is that those fields were previously case-sensitive and will now be case-insensitive, which I think is preferable despite there being a test which was covering this behaviour.

Note that not all keyword fields should have a normalizer specified, for instance, verbatim fields such as bounding_box and addendum are probably best left with the default null normalizer.

Normalizers are applied both at index-time and at query-time.

I would like to add some additional filters such as trim and unique but they are not available until version 6.4 of elasticsearch and so will come in a subsequent PR which can be merged independently of this.

Joxit · 2019-12-13T16:58:24Z

Is this normalizer necessary? The API is already doing the lowercase transformation for layer and source (category too ?) and there are also some check for ids 🤔

The code looks good 😄

missinglink · 2019-12-16T10:21:10Z

Good point, I was more thinking about trying to prevent bugs by ensuring the tokens were normalized.
This normalization needs to be done both at index-time and query-time and I could see bugs easily being introduced, especially in the pelias/api code.

It's not a big deal and I'd be fine with not merging this if it comes with a performance hit.

missinglink · 2019-12-16T10:23:45Z

I found this old test case super confusing because it's asserting that the keyword field is case-sensitive even though it should never be the case 🤷‍♂

[admission of guilt] it was written by me 😝

Joxit · 2019-12-16T10:58:36Z

Ha ha ha, it's ok, 4 years ago, the statute of limitations has passed :p

missinglink force-pushed the normalizers branch from da83b59 to c507a3c Compare December 13, 2019 15:42

missinglink mentioned this pull request Dec 13, 2019

add 'trim' to the 'peliasKeywordNormalizer' filters #416

Open

missinglink force-pushed the normalizers branch from c507a3c to af0929e Compare December 13, 2019 15:57

feat(normalizers): add optional normalizer for keyword fields

7a674fa

missinglink force-pushed the normalizers branch from af0929e to 7a674fa Compare December 16, 2019 14:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add normalizer for keyword fields #415

add normalizer for keyword fields #415

missinglink commented Dec 13, 2019 •

edited

Loading

Joxit commented Dec 13, 2019

missinglink commented Dec 16, 2019

missinglink commented Dec 16, 2019 •

edited

Loading

Joxit commented Dec 16, 2019

add normalizer for keyword fields #415

Are you sure you want to change the base?

add normalizer for keyword fields #415

Conversation

missinglink commented Dec 13, 2019 • edited Loading

Joxit commented Dec 13, 2019

missinglink commented Dec 16, 2019

missinglink commented Dec 16, 2019 • edited Loading

Joxit commented Dec 16, 2019

missinglink commented Dec 13, 2019 •

edited

Loading

missinglink commented Dec 16, 2019 •

edited

Loading