Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using invertedIndex for autocomplete #287

Closed
hungerburg opened this issue Jul 28, 2017 · 5 comments
Closed

Using invertedIndex for autocomplete #287

hungerburg opened this issue Jul 28, 2017 · 5 comments

Comments

@hungerburg
Copy link

Not so much an issue with lunr, which is great! More a quick try to get ideas going…

In a shell with the jq utility I pull my terms from the lunr index in advance:
jq '[.index.invertedIndex[][0]|scan("^\\w{3,}")]|unique' index.json > iindex.json
I can feed that to http://api.jqueryui.com/autocomplete/ widget like below

   function normalize(str) {
      var map = { "ä": "a", "ö": "o", "ü": "u", "ß": "ss" };
      return str.replace(/[^A-Za-z0-9]/g,
         function(a) { return map[a]||a; }
      );
   }
   $.getJSON('iindex.json', function (tags) {
      $('#query').autocomplete({
         minLength: 3,
         source: function(inp, out) {
            var t = normalize(inp.term);
            var r = $.ui.autocomplete.filter(tags, t);
            out(r);
         }
      });
   });

Not fully nice, but works acceptably so far. Now truly nice would be to create the autocomplete index on the client and have the term to match processed by the indexer instead of that crude normalizer above.

@olivernn
Copy link
Owner

olivernn commented Aug 7, 2017

Sorry for the late reply.

You could definitely wrap that normalise function up into a lunr plugin. There is a similar project, lunr-unicode-normalizer, but I don't think it has been updated for lunr 2.

As for autocomplete, I need to get round to actually putting a demo of this together, but this is what I've been suggesting to people.

idx.query(function (q) {
  // exact matches should have the highest boost
  q.term(searchTerm, { boost: 100 })

  // prefix matches should be boosted slightly
  q.term(searchTerm, { boost: 10, usePipeline: false, wildcard: lunr.Query.wildcard.TRAILING })

  // finally, try a fuzzy search, without any boost
  q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
})

I disable the pipeline to prevent stemming getting in the way, you would have to experiment if this makes sense for your use case, especially if you wanted to add the unicode normalising plugin.

Additionally, when using the query method lunr won't be doing any tokenisation for you, you can either handle this your self, or borrow the lunr.tokenizer directly, or its regex to split into individual tokens.

@hungerburg
Copy link
Author

hungerburg commented Aug 8, 2017

Dear Oliver, no need to be sorry! With the help of your snippet I cobbled together a little more context, so that, in fact, I get around creating an extra iindex upfront but instead can feed the autocomplete from lunr itself and also can get rid of my own normalizer by using the pipeline:

   // LUNR query for autocomplete terms
   function acsearch(searchTerm) {
      var results = index.query(function (q) {
         // exact matches should have the highest boost
         q.term(searchTerm, { boost: 100 })
         // wildcard matches should be boosted slightly
         q.term(searchTerm, { boost: 10, usePipeline: true, wildcard: lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING })
         // finally, try a fuzzy search, without any boost
         q.term(searchTerm, { boost: 1, usePipeline: false, editDistance: 1 })
      });
      if (!results.length) { return ""; }
      return results.map(function(v, i, a) { // extract
         return Object.keys(v.matchData.metadata);
      }).reduce(function(a, b) { // flatten
         return a.concat(b);
      }).filter(function(v, i, a) { // uniq
         return a.indexOf(v) === i;
      });
   }

   // install jquery autocomplete widget
   $('#query').autocomplete({
      appendTo: '#dlg',
      minLength: 3,
      source: function(inp, out) {
         out(acsearch(inp.term.toLowerCase()));
      }
   });

The resulting list reads much the same -- though now it is sorted by score and some new fuzzy terms appear. Performance on my desktop is also good. I think this is showing all that can be pulled from an inverted index - anything more would need a mapping to the unstemmed original text and that is beyond the scope of this experiment, at least for now. Thank you very much indeed!

@olivernn
Copy link
Owner

@hungerburg glad you managed to make some progress. Indeed the stemming does make implementing really good autocomplete difficult.

Just thinking aloud, but you can get Lunr to create and store the mapping to the unstemmed words, though it will likely increase the index size considerably.

During indexing you can add a pipeline function that stores the unstemmed words as metadata for a token. The concept is discussed a bit in the guides.

You would need to add a pipeline function before the stemmer:

var storeUnstemmed = function (token) {
  token.metadata['unstemmed'] = token.toString()
  return token
}

The idea would be that you would then get the unstemmed words from the search results, the guide has a more detailed example.

First prize would be able to have an algorithmic unstemmer, but I don't think its possible.

@olivernn
Copy link
Owner

olivernn commented Jun 6, 2018

Closing this as it has gone stale, feel free to comment or re-open if you still have any unanswered questions.

@chrisbartley
Copy link

First, a HUGE thanks to both @hungerburg and @olivernn for this. I combined both suggestions and it's working great. For anyone wanting to do the same, this is what worked for me...

I'm indexing like this:

// Store unstemmed term in the metadata.  See:
// https://github.com/olivernn/lunr.js/issues/287#issuecomment-322573117
// https://lunrjs.com/guides/customising.html#token-meta-data
const storeUnstemmed = function(builder) {

   // Define a pipeline function that keeps the unstemmed word
   const pipelineFunction = function(token) {
      token.metadata['unstemmed'] = token.toString();
      return token;
   };

   // Register the pipeline function so the index can be serialised
   lunr.Pipeline.registerFunction(pipelineFunction, 'storeUnstemmed');

   // Add the pipeline function to both the indexing pipeline and the searching pipeline
   builder.pipeline.before(lunr.stemmer, pipelineFunction);

   // Whitelist the unstemmed metadata key
   builder.metadataWhitelist.push('unstemmed');
};

const index = lunr(function() {
   this.use(storeUnstemmed);
   ...
});

And modified the autocomplete function suggested by @hungerburg to use the unstemmed words like this:

autoComplete(searchTerm) {
   const results = this._index.query(function(q) {
      // exact matches should have the highest boost
      q.term(searchTerm, { boost : 100 })
      // wildcard matches should be boosted slightly
      q.term(searchTerm, {
         boost : 10,
         usePipeline : true,
         wildcard : lunr.Query.wildcard.LEADING | lunr.Query.wildcard.TRAILING
      })
      // finally, try a fuzzy search, without any boost
      q.term(searchTerm, { boost : 1, usePipeline : false, editDistance : 1 })
   });
   if (!results.length) {
      return "";
   }
   return results.map(function(v, i, a) { // extract unstemmed terms
      const unstemmedTerms = {};
      Object.keys(v.matchData.metadata).forEach(function(term) {
         Object.keys(v.matchData.metadata[term]).forEach(function(field) {
            v.matchData.metadata[term][field].unstemmed.forEach(function(word) {
               unstemmedTerms[word] = true;
            });
         });
      });
      return Object.keys(unstemmedTerms);
   }).reduce(function(a, b) { // flatten
      return a.concat(b);
   }).filter(function(v, i, a) { // uniq
      return a.indexOf(v) === i;
   });
}

Thanks!

johnfairh added a commit to realm/jazzy that referenced this issue Feb 8, 2019
Using advice from olivernn/lunr.js#287 to retain the
"autocomplete" behavior seen in earlier versions.
johnfairh added a commit to realm/jazzy that referenced this issue Feb 8, 2019
Using advice from olivernn/lunr.js#287 to retain the
"autocomplete" behavior seen in earlier versions.
johnfairh added a commit to realm/jazzy that referenced this issue Feb 8, 2019
Using advice from olivernn/lunr.js#287 to retain the
"autocomplete" behavior seen in earlier versions.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants