Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Created an initial pluggable tokenizer with ngram support in order to allow using lunr to drive autocomplete style search boxes. #63

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

wballard
Copy link

I use this library all the time, thanks for making it available. One use case we keep doing more is client side autocomplete, and have found that ngram indexing on the server -- usually ElasticSearch -- is giving us the best results. I just need that functionality client side, and in node.js, and don't care to fuss with going out of process to Elastic Search if I can avoid it.

I tried to follow along with your style and formatting, and hopefully did so to your satisfaction.

This sets up an index level tokenizer, I didn't dive as far in as #21, as that implies field level pipelines and tokenizers -- which really then should have some extension to pipeline to 'start' with a tokenizer then stream through multiple filters in the pipeline -- or some other field object that combines a tokenizer and pipeline.

allow using lunr to drive autocomplete style search boxes.
}

/**
* A tokenizer tha indexes on character bigrams.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/tha/that/

@wballard
Copy link
Author

Thanks -- I can see how I totally copy-pasta that same doc error.

@olivernn
Copy link
Owner

Many thanks for taking the time to look into this.

I think that an ngram tokeniser would make a great plugin for lunr, as part of the changes I am making for better i18n support I am add a very simple plugin system that I think you could take advantage of. It's great to have another potential use case for a plugin so that I make sure the API is flexible enough.

Let me take a closer look through your changes and see if I can make some suggestions of how to extract this as a plugin.

Thanks again!

@hugovincent
Copy link

Any update on this?

@rowanoulton
Copy link

Hey, is there an ETA for merging this or the plugin system mentioned? Would love to use it!

@cvan
Copy link
Contributor

cvan commented Sep 24, 2014

@olivernn can this be merged in or is the plugin system ready yet?

@missinglink
Copy link

I would also like to contribute ngram analyzers for autocomplete. what is the status of this? it's been open for a year now and so I'm hesitant to do any more work on it.

@olivernn
Copy link
Owner

The means to add plugins to lunr already exists. The main extension point is to modify an indexes text processing pipeline. Each index has its own pipeline, and so a plugin can safely modify the pipeline of the index it is being applied to.

I think though that in these cases the tokenizer needs to be modified. This is possible but for reasons the tokeniser is global, not individual per index. So all indexes will then be forced to use the replacement tokenizer, this may or may not be a problem.

An example:

var myNgramTokenizer = function () {
  lunr.tokenizer = function (obj) {
    // ngram implementation
  }
}

idx.use(myNgramTokenizer)

I'm not sure why the tokenizer is not a property of the instance of lunr index, I will take a look at this.

@natcohen
Copy link

@olivernn Great work! Any chance this could be merged? ngram and edgengram are must have nowadays... I'd love to see it built-in or as a plugin.

@tienne
Copy link

tienne commented Feb 16, 2022

Is there anything we can do?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants