Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: support for case-sensitive and case-insensitive search #331

Open
giuliac89 opened this issue Feb 16, 2018 · 9 comments
Open

Comments

@giuliac89
Copy link

Hi Oliver,
do you plan to add this feature?

@hoelzro
Copy link
Contributor

hoelzro commented Mar 6, 2018

@giuliac89 FWIW, you could add this feature in current lunr.js by tweaking the pipeline - I believe the forced lowercasing that currently happens happens in the tokenizer.

@olivernn
Copy link
Owner

olivernn commented Mar 7, 2018

.@hoelzro is right, the current down casing happens inside lunr.tokenizer. Unfortunately this would mean you would need to re-implement it just to change that one part.

Do you have a specific use case in mind? How does the current behaviour fall short?

@giuliac89
Copy link
Author

giuliac89 commented Mar 8, 2018

I'm implementing a search engine for a research project related to philological editions. http://evt.labcd.unipi.it/

It's important to add this functionality to ensure more details in the philological studies that will be carried out on these editions.

@olivernn
Copy link
Owner

olivernn commented Mar 9, 2018

So, in your case, a term, say "FOO", has a different meaning than the downcased term "foo"?

As well as lunr.tokenizer the query parser also downcasses terms. This only affects lunr.Index#search, not lunr.Index#query:

// won't work, gets converted to "foo"
idx.search("FOO") 

// will work, no further processing of the terms done
idx.query(function (q) {
  q.term("FOO")
})

@giuliac89
Copy link
Author

Yes, the difference between a term "FOO" and a term "foo" could be basic for some research studies and this is the reason why I would like to include this feature in my search engine. So the only thing that I can do is re-implement the tokenizer.

Do you think that this feature could be interesting for lunr.js?

@indolering
Copy link

Do you think that this feature could be interesting for lunr.js?

To be honest, it seems pretty niche. It wouldn't be hard to implement as an all-or-nothing feature of the index (just add it as a config option) but how would you support query time case-sensitivity without blowing up the index size? I think it's important to remember that Lunr is primarily for static websites and size is a big deal....

@giuliac89
Copy link
Author

Well, I tried to develop the feature in my web app and the index size is not a big problem in this case!
In a document of about 1460 words, the index size (including two types of metadata) without case-sensitive feature is about 121kb. With case-sensitive feature is about 158kb.
Indexing is only in "lowercase mode". To handle case-sensitivity I simply developed a custom tokenizer, in which I create a lunr token like this:

new lunr.Token (token, {
   position: [startIndex, tokenLength],
   index: tokens.length,
   originalToken: originalToken
});

So I register the "original token" as metadata:

0: lunr.Token {
   str: "in",
   metadata: {
      index: 0
      originalToken: "In"
      position: (2) [0, 2]
   }
}

In this way is simple check the case-sensitivity without making the index size increase considerably.

@indolering
Copy link

Submit a patch!

@olivernn
Copy link
Owner

@giuliac89 @indolering this seems like a good candidate for being turned into a plugin, if so we could add it to the new list of plugins on the wiki and the website. If someone does the work to package this up I'm more than happy to feature it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants