Using invertedIndex for autocomplete #287

hungerburg · 2017-07-28T11:01:42Z

Not so much an issue with lunr, which is great! More a quick try to get ideas going…

In a shell with the jq utility I pull my terms from the lunr index in advance:
jq '[.index.invertedIndex[][0]|scan("^\\w{3,}")]|unique' index.json > iindex.json
I can feed that to http://api.jqueryui.com/autocomplete/ widget like below

   function normalize(str) {
      var map = { "ä": "a", "ö": "o", "ü": "u", "ß": "ss" };
      return str.replace(/[^A-Za-z0-9]/g,
         function(a) { return map[a]||a; }
      );
   }
   $.getJSON('iindex.json', function (tags) {
      $('#query').autocomplete({
         minLength: 3,
         source: function(inp, out) {
            var t = normalize(inp.term);
            var r = $.ui.autocomplete.filter(tags, t);
            out(r);
         }
      });
   });

Not fully nice, but works acceptably so far. Now truly nice would be to create the autocomplete index on the client and have the term to match processed by the indexer instead of that crude normalizer above.

The text was updated successfully, but these errors were encountered:

olivernn · 2017-08-07T20:13:20Z

Sorry for the late reply.

You could definitely wrap that normalise function up into a lunr plugin. There is a similar project, lunr-unicode-normalizer, but I don't think it has been updated for lunr 2.

As for autocomplete, I need to get round to actually putting a demo of this together, but this is what I've been suggesting to people.

idx.query(function (q) {
  // exact matches should have the highest boost
  q.term(searchTerm, { boost: 100 })

  // prefix matches should be boosted slightly
  q.term(searchTerm, { boost: 10, usePipeline: false, wildcard: lunr.Query.wildcard.TRAILING })

  // finally, try a fuzzy search, without any boost
  q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
})

I disable the pipeline to prevent stemming getting in the way, you would have to experiment if this makes sense for your use case, especially if you wanted to add the unicode normalising plugin.

Additionally, when using the query method lunr won't be doing any tokenisation for you, you can either handle this your self, or borrow the lunr.tokenizer directly, or its regex to split into individual tokens.

hungerburg · 2017-08-08T21:21:48Z

Dear Oliver, no need to be sorry! With the help of your snippet I cobbled together a little more context, so that, in fact, I get around creating an extra iindex upfront but instead can feed the autocomplete from lunr itself and also can get rid of my own normalizer by using the pipeline:

   // LUNR query for autocomplete terms
   function acsearch(searchTerm) {
      var results = index.query(function (q) {
         // exact matches should have the highest boost
         q.term(searchTerm, { boost: 100 })
         // wildcard matches should be boosted slightly
         q.term(searchTerm, { boost: 10, usePipeline: true, wildcard: lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING })
         // finally, try a fuzzy search, without any boost
         q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
      });
      if (!results.length) { return ""; }
      return results.map(function(v, i, a) { // extract
         return Object.keys(v.matchData.metadata);
      }).reduce(function(a, b) { // flatten
         return a.concat(b);
      }).filter(function(v, i, a) { // uniq
         return a.indexOf(v) === i;
      });
   }

   // install jquery autocomplete widget
   $('#query').autocomplete({
      appendTo: '#dlg',
      minLength: 3,
      source: function(inp, out) {
         out(acsearch(inp.term.toLowerCase()));
      }
   });

The resulting list reads much the same -- though now it is sorted by score and some new fuzzy terms appear. Performance on my desktop is also good. I think this is showing all that can be pulled from an inverted index - anything more would need a mapping to the unstemmed original text and that is beyond the scope of this experiment, at least for now. Thank you very much indeed!

olivernn · 2017-08-15T20:03:29Z

@hungerburg glad you managed to make some progress. Indeed the stemming does make implementing really good autocomplete difficult.

Just thinking aloud, but you can get Lunr to create and store the mapping to the unstemmed words, though it will likely increase the index size considerably.

During indexing you can add a pipeline function that stores the unstemmed words as metadata for a token. The concept is discussed a bit in the guides.

You would need to add a pipeline function before the stemmer:

var storeUnstemmed = function (token) {
  token.metadata['unstemmed'] = token.toString()
  return token
}

The idea would be that you would then get the unstemmed words from the search results, the guide has a more detailed example.

First prize would be able to have an algorithmic unstemmer, but I don't think its possible.

olivernn · 2018-06-06T20:03:26Z

Closing this as it has gone stale, feel free to comment or re-open if you still have any unanswered questions.

chrisbartley · 2019-01-16T20:15:31Z

First, a HUGE thanks to both @hungerburg and @olivernn for this. I combined both suggestions and it's working great. For anyone wanting to do the same, this is what worked for me...

I'm indexing like this:

// Store unstemmed term in the metadata.  See:
// https://github.com/olivernn/lunr.js/issues/287#issuecomment-322573117
// https://lunrjs.com/guides/customising.html#token-meta-data
const storeUnstemmed = function(builder) {

   // Define a pipeline function that keeps the unstemmed word
   const pipelineFunction = function(token) {
      token.metadata['unstemmed'] = token.toString();
      return token;
   };

   // Register the pipeline function so the index can be serialised
   lunr.Pipeline.registerFunction(pipelineFunction, 'storeUnstemmed');

   // Add the pipeline function to both the indexing pipeline and the searching pipeline
   builder.pipeline.before(lunr.stemmer, pipelineFunction);

   // Whitelist the unstemmed metadata key
   builder.metadataWhitelist.push('unstemmed');
};

const index = lunr(function() {
   this.use(storeUnstemmed);
   ...
});

And modified the autocomplete function suggested by @hungerburg to use the unstemmed words like this:

autoComplete(searchTerm) {
   const results = this._index.query(function(q) {
      // exact matches should have the highest boost
      q.term(searchTerm, { boost : 100 })
      // wildcard matches should be boosted slightly
      q.term(searchTerm, {
         boost : 10,
         usePipeline : true,
         wildcard : lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
      })
      // finally, try a fuzzy search, without any boost
      q.term(searchTerm, { boost : 1, usePipeline : false, editDistance : 1 })
   });
   if (!results.length) {
      return "";
   }
   return results.map(function(v, i, a) { // extract unstemmed terms
      const unstemmedTerms = {};
      Object.keys(v.matchData.metadata).forEach(function(term) {
         Object.keys(v.matchData.metadata[term]).forEach(function(field) {
            v.matchData.metadata[term][field].unstemmed.forEach(function(word) {
               unstemmedTerms[word] = true;
            });
         });
      });
      return Object.keys(unstemmedTerms);
   }).reduce(function(a, b) { // flatten
      return a.concat(b);
   }).filter(function(v, i, a) { // uniq
      return a.indexOf(v) === i;
   });
}

Thanks!

Using advice from olivernn/lunr.js#287 to retain the "autocomplete" behavior seen in earlier versions.

olivernn closed this as completed Jun 6, 2018

kennedyrose mentioned this issue Oct 8, 2018

Autosuggest in search escaladesports/2019-website-boilerplate#182

Closed

johnfairh added a commit to realm/jazzy that referenced this issue Feb 8, 2019

Update search to use Lunr 2.3.5

cd793c6

Using advice from olivernn/lunr.js#287 to retain the "autocomplete" behavior seen in earlier versions.

johnfairh added a commit to realm/jazzy that referenced this issue Feb 8, 2019

Update search to use Lunr 2.3.5

53bbea0

Using advice from olivernn/lunr.js#287 to retain the "autocomplete" behavior seen in earlier versions.

johnfairh added a commit to realm/jazzy that referenced this issue Feb 8, 2019

Update search to use Lunr 2.3.5

73abc98

Using advice from olivernn/lunr.js#287 to retain the "autocomplete" behavior seen in earlier versions.

cannin mentioned this issue Jun 14, 2019

Typeahead Search with Lunr cannin/ihop-reach#36

Closed

chrisbartley mentioned this issue Oct 1, 2021

how to do auto suggest? #471

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using invertedIndex for autocomplete #287

Using invertedIndex for autocomplete #287

hungerburg commented Jul 28, 2017

olivernn commented Aug 7, 2017

hungerburg commented Aug 8, 2017 •

edited

olivernn commented Aug 15, 2017

olivernn commented Jun 6, 2018

chrisbartley commented Jan 16, 2019

Using invertedIndex for autocomplete #287

Using invertedIndex for autocomplete #287

Comments

hungerburg commented Jul 28, 2017

olivernn commented Aug 7, 2017

hungerburg commented Aug 8, 2017 • edited

olivernn commented Aug 15, 2017

olivernn commented Jun 6, 2018

chrisbartley commented Jan 16, 2019

hungerburg commented Aug 8, 2017 •

edited