Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question: working with accented character #269

Open
guirip opened this issue May 29, 2017 · 14 comments
Open

Question: working with accented character #269

guirip opened this issue May 29, 2017 · 14 comments

Comments

@guirip
Copy link

guirip commented May 29, 2017

Hello Oliver,
I hope you're doing fine.

I come back to you with this fiddle, where you can see that some data has accented characters. e.g 'général'

  • If the user searches for 'général', matching entry is returned as expected.

  • If the user searches for 'genéral', no entry is returned.

How would you advise me to get this working ?
Is there a simpler way than removing accents when building indexes and from user input too ?

@olivernn
Copy link
Owner

In a way Lunr is doing the right thing here, as there is no entry in the index for 'genéral', i.e. it thinks 'e' and 'é' are different characters (which technically they are) and so doesn't find anything.

There is a lunr-unicode-normalizer plugin which attempts to normalise characters by removing diacritical marks. It looks like it probably needs upgrading to support Lunr 2, but this shouldn't be too difficult, this guide should hopefully show what is required.

Let me know if that solves the problem.

@guirip
Copy link
Author

guirip commented May 30, 2017

Hello

Thanks for the link !

I am sorry but I don't have time at all to dig in to migrate the plugin (being almost alone on a big app with an approaching deadline, will mostly get home again at 10pm), but I copy/pasted most of it (with a link to original source in the jsdoc) and its works like a charm.
I use it to normalize input on indexes creation, and to normalize given 'input' on user search.

@nekdolan
Copy link

Just wanted to say that I made a npm compatible version of the lunr-unicode-normalizer and it seems to be working just fine. Should work in the browser as well, but I didn't try. https://github.com/nekdolan/lunr-unicode-normalizer
It works the same as the language plugins.

@guirip
Copy link
Author

guirip commented Jan 16, 2019

@nekdolan Thanks for the tip

@olivernn
Copy link
Owner

@nekdolan Nice work wrapping that up in an NPM module.

One modification you might be interested in is that, in Lunr 2.x at least, a tokenizer can be specified per index, this should mean that you no longer need to monkey freedom patch lunr.tokenizer.

@smontlouis
Copy link

smontlouis commented Mar 30, 2019

@nekdolan I was unable to use your code, so I created a unicodeNormalizer 2.x compatible : https://gist.github.com/bulby97/7bd05561be91151b38aee3a6204d3e77

@nekdolan
Copy link

@bulby97 I didn't realize that the wiki I used was using version 1 instead of 2. Thanks for the upgrade.

@Ecco
Copy link

Ecco commented Oct 25, 2019

One quick pedantic comment: I don't think this plugin is not correctly named. I think it does a transliteration (in this case, from Unicode to Ascii), not a normalization (which could be a conversion from NFC to NFKD). 🤓

@dhdaines
Copy link

dhdaines commented Jun 9, 2023

One quick pedantic comment: I don't think this plugin is not correctly named. I think it does a transliteration (in this case, from Unicode to Ascii), not a normalization (which could be a conversion from NFC to NFKD). nerd_face

Yes, the proper name for it is "character folding": http://www.unicode.org/reports/tr30/tr30-4.html since "normalization" preserves (more or less...) the original glyphs.

Whoosh has very nice documentation: https://whoosh.readthedocs.io/en/latest/stemming.html#character-folding
Lucene has a very fancy implementation: https://lucene.apache.org/core/9_6_0/analysis/icu/index.htmlhttps://lucene.apache.org/core/9_6_0/analysis/icu/index.html

@dhdaines
Copy link

dhdaines commented Jun 9, 2023

@nekdolan I was unable to use your code, so I created a unicodeNormalizer 2.x compatible : https://gist.github.com/bulby97/7bd05561be91151b38aee3a6204d3e77

Sadly, this is 404. I'll do another one and put it on NPM shortly, as I need this (despite its weird and somewhat antiquated API lunr.js appears to still be the best JavaScript search library out there)

@dhdaines
Copy link

Here you go: https://www.npmjs.com/package/lunr-folding

@pomeloshark
Copy link

@dhdaines Are you able to provide a demo of how to use this? Every time I try to incorporate it into my project my existing search function stops returning any results; I'm not super well versed in javascript so I'm not sure exactly where I'm going wrong. Thanks in advance

@dhdaines
Copy link

@dhdaines Are you able to provide a demo of how to use this?

Sorry for the delay! There's an example now at https://www.npmjs.com/package/lunr-folding - but due to some JavaScript weirdness that I don't understand at all, it's slightly wrong. You should be able to install:

npm install lunr lunr-folding

Then run:

const lunr = require("lunr");
const folding = require("lunr-folding").default;
folding(lunr);

const idx = lunr(function () {
    this.ref("id");
    this.field("text");
    this.add({ id: "1", text: "Étape 1: Collecter des bobettes" });
    this.add({ id: "2", text: "Étape 2: ???" });
    this.add({ id: "3", text: "Étape 3: Profit" });
});
const results = idx.search("etape 3");
console.log(JSON.stringify(results[0]));

@pomeloshark
Copy link

@dhdaines Thanks very much!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants