Question: working with accented character #269

guirip · 2017-05-29T18:01:24Z

Hello Oliver,
I hope you're doing fine.

I come back to you with this fiddle, where you can see that some data has accented characters. e.g 'général'

If the user searches for 'général', matching entry is returned as expected.
If the user searches for 'genéral', no entry is returned.

How would you advise me to get this working ?
Is there a simpler way than removing accents when building indexes and from user input too ?

olivernn · 2017-05-29T20:14:34Z

In a way Lunr is doing the right thing here, as there is no entry in the index for 'genéral', i.e. it thinks 'e' and 'é' are different characters (which technically they are) and so doesn't find anything.

There is a lunr-unicode-normalizer plugin which attempts to normalise characters by removing diacritical marks. It looks like it probably needs upgrading to support Lunr 2, but this shouldn't be too difficult, this guide should hopefully show what is required.

Let me know if that solves the problem.

guirip · 2017-05-30T10:36:11Z

Hello

Thanks for the link !

I am sorry but I don't have time at all to dig in to migrate the plugin (being almost alone on a big app with an approaching deadline, will mostly get home again at 10pm), but I copy/pasted most of it (with a link to original source in the jsdoc) and its works like a charm.
I use it to normalize input on indexes creation, and to normalize given 'input' on user search.

nekdolan · 2019-01-10T22:27:24Z

Just wanted to say that I made a npm compatible version of the lunr-unicode-normalizer and it seems to be working just fine. Should work in the browser as well, but I didn't try. https://github.com/nekdolan/lunr-unicode-normalizer
It works the same as the language plugins.

guirip · 2019-01-16T16:06:44Z

@nekdolan Thanks for the tip

olivernn · 2019-01-17T17:43:54Z

@nekdolan Nice work wrapping that up in an NPM module.

One modification you might be interested in is that, in Lunr 2.x at least, a tokenizer can be specified per index, this should mean that you no longer need to ~~monkey~~ freedom patch lunr.tokenizer.

smontlouis · 2019-03-30T01:23:36Z

@nekdolan I was unable to use your code, so I created a unicodeNormalizer 2.x compatible : https://gist.github.com/bulby97/7bd05561be91151b38aee3a6204d3e77

nekdolan · 2019-03-30T19:48:49Z

@bulby97 I didn't realize that the wiki I used was using version 1 instead of 2. Thanks for the upgrade.

Ecco · 2019-10-25T18:56:27Z

One quick pedantic comment: I don't think this plugin is not correctly named. I think it does a transliteration (in this case, from Unicode to Ascii), not a normalization (which could be a conversion from NFC to NFKD). 🤓

dhdaines · 2023-06-09T23:40:16Z

One quick pedantic comment: I don't think this plugin is not correctly named. I think it does a transliteration (in this case, from Unicode to Ascii), not a normalization (which could be a conversion from NFC to NFKD). nerd_face

Yes, the proper name for it is "character folding": http://www.unicode.org/reports/tr30/tr30-4.html since "normalization" preserves (more or less...) the original glyphs.

Whoosh has very nice documentation: https://whoosh.readthedocs.io/en/latest/stemming.html#character-folding
Lucene has a very fancy implementation: https://lucene.apache.org/core/9_6_0/analysis/icu/index.htmlhttps://lucene.apache.org/core/9_6_0/analysis/icu/index.html

dhdaines · 2023-06-09T23:42:50Z

@nekdolan I was unable to use your code, so I created a unicodeNormalizer 2.x compatible : https://gist.github.com/bulby97/7bd05561be91151b38aee3a6204d3e77

Sadly, this is 404. I'll do another one and put it on NPM shortly, as I need this (despite its weird and somewhat antiquated API lunr.js appears to still be the best JavaScript search library out there)

dhdaines · 2023-06-12T03:07:45Z

Here you go: https://www.npmjs.com/package/lunr-folding

pomeloshark · 2023-09-11T02:01:55Z

@dhdaines Are you able to provide a demo of how to use this? Every time I try to incorporate it into my project my existing search function stops returning any results; I'm not super well versed in javascript so I'm not sure exactly where I'm going wrong. Thanks in advance

dhdaines · 2023-10-22T19:02:14Z

@dhdaines Are you able to provide a demo of how to use this?

Sorry for the delay! There's an example now at https://www.npmjs.com/package/lunr-folding - but due to some JavaScript weirdness that I don't understand at all, it's slightly wrong. You should be able to install:

npm install lunr lunr-folding

Then run:

const lunr = require("lunr");
const folding = require("lunr-folding").default;
folding(lunr);

const idx = lunr(function () {
    this.ref("id");
    this.field("text");
    this.add({ id: "1", text: "Étape 1: Collecter des bobettes" });
    this.add({ id: "2", text: "Étape 2: ???" });
    this.add({ id: "3", text: "Étape 3: Profit" });
});
const results = idx.search("etape 3");
console.log(JSON.stringify(results[0]));

pomeloshark · 2023-10-29T03:26:46Z

@dhdaines Thanks very much!

Hinton mentioned this issue May 17, 2022

[ps-136] Ignore accented characters in vault search bitwarden/jslib#804

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: working with accented character #269

Question: working with accented character #269

guirip commented May 29, 2017

olivernn commented May 29, 2017

guirip commented May 30, 2017

nekdolan commented Jan 10, 2019

guirip commented Jan 16, 2019

olivernn commented Jan 17, 2019

smontlouis commented Mar 30, 2019 •

edited

nekdolan commented Mar 30, 2019

Ecco commented Oct 25, 2019

dhdaines commented Jun 9, 2023

dhdaines commented Jun 9, 2023

dhdaines commented Jun 12, 2023

pomeloshark commented Sep 11, 2023

dhdaines commented Oct 22, 2023

pomeloshark commented Oct 29, 2023

Question: working with accented character #269

Question: working with accented character #269

Comments

guirip commented May 29, 2017

olivernn commented May 29, 2017

guirip commented May 30, 2017

nekdolan commented Jan 10, 2019

guirip commented Jan 16, 2019

olivernn commented Jan 17, 2019

smontlouis commented Mar 30, 2019 • edited

nekdolan commented Mar 30, 2019

Ecco commented Oct 25, 2019

dhdaines commented Jun 9, 2023

dhdaines commented Jun 9, 2023

dhdaines commented Jun 12, 2023

pomeloshark commented Sep 11, 2023

dhdaines commented Oct 22, 2023

pomeloshark commented Oct 29, 2023

smontlouis commented Mar 30, 2019 •

edited