Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

highlight search keywords #97

Closed
arussel opened this issue Jul 8, 2014 · 20 comments
Closed

highlight search keywords #97

arussel opened this issue Jul 8, 2014 · 20 comments

Comments

@arussel
Copy link

arussel commented Jul 8, 2014

I have a requirement that search keywords should be highlighted in the found documents, is there a way to do this with lunrjs atm ?

@olivernn
Copy link
Owner

There are a couple of issues requesting this feature. It involves quite a few changes to the way lunr works. I made a start on an implementation here but I haven't got round to completing it. I'll try and spend some more time with it this week and see if I can get something out for people to try.

@Qvatra
Copy link

Qvatra commented Aug 21, 2014

Is this issue already solved? I didn't find the way to do that with lunrjs...

@olivernn
Copy link
Owner

Sorry, still don't have a decent answer to this. It involves quite a lot of change to lunr and I'm not entirely convinced with the current implementation I put together (linked in the comment). I need to spend some more time thinking about how best to implement highlighting without sacrificing either index size or performance, and that takes some time!

@jonathanhudak
Copy link

If it helps anyone I achieved this functionality by using BlastJS http://julian.com/research/blast/

@shobhitg
Copy link

shobhitg commented Oct 1, 2015

@olivernn I understand you were trying to get the right balance between index size and performance. Will it be possible for you to describe what approach were you trying/planning in branch next?

@shobhitg
Copy link

shobhitg commented Oct 1, 2015

@olivernn Is it possible to get to know which stem word actually matched?

If yes, then I can easily use that information with the BlastJS library mentioned above by @hudakdidit

@olivernn
Copy link
Owner

olivernn commented Oct 5, 2015

What I was trying to do was to wrap the token in a lunr.Token which would keep track of any extra metadata about the token that was picked up in the pipeline. One such piece of metadata could have been the position of the token in the original text.

It involved vast changes to the existing way lunr works, and in the end I think I decided that rather than try and retrofit this kind of feature into the existing architecture a bigger rethink was required. The problem with that is getting the time to really work through what a different architecture would look like, time I just haven't had :(

As for getting the step that matched, you would basically have to re-implement parts of what the search function is currently doing:

  1. Run the pipeline on the search terms, this gives you the stemmed tokens
  2. Find the documents that contain each stemmed token using idx.tokenStore.get(stem)

There is no easier or more efficient way of doing this with the current set up of the lunr.

@julkue
Copy link

julkue commented Feb 3, 2016

I would like to realize this with a highlighting component. However, first of we need to make sure that highlighted words and matches by lunr are exactly the same for a good usability concept. Therefore I created #200.

@drallgood
Copy link

@julmot
I did it this way:

var queryTokens = lunr.tokenizer(request.term)
$.each(queryTokens, function(index, token) {
     pageContentElement.jmHighlight(token,{"className":"lunr-match-highlight"});
});

Seemed "good enough" to me. I could have also used the full Lunr pipeline to get the stemmed words, but then the highlights would look weird (e.g. you search for 'Persistence' and it highlights 'persist')

@julkue
Copy link

julkue commented Feb 11, 2016

@drallgood Thanks for letting me know.

I don't think it would be weird, rather it would be consistent. Imagine a situation where a user searches for "Searched". Lunr will find files containing "searching". But a highlighting component will highlight nothing (different than expected), as there is not the exact term "searched" inside.

Do you know a way to get all found words, also "searching" in this example?

@d0ugal
Copy link
Contributor

d0ugal commented Mar 10, 2016

This looks like a duplicate of #25

@julkue
Copy link

julkue commented Mar 10, 2016

@d0ugal There are a couple of issues that are similar here

@nknapp
Copy link

nknapp commented Jul 18, 2016

@olivernn I'm not sure that I understood your comment correctly, so maybe the following is the same thing your said: I think the Lucene way of highligting terms

  • passing the query through the pipeline in order to get the stemmed search terms and then
  • passing the found document through the pipeline (again) match the words against the stemmed search tems. While doing this, it keeps track of the offset of the matched tokens and uses those to highlight terms and to extract to snippet of the text that matches best.

I have no deep experience with lunr so far, but it seems to be that this approach would not require large refactorings of the code. Or I may be completely mistaken.

@Bahar1978
Copy link

Hello @drallgood ,
Could you please let me know how you could highlight the search terms?

@drallgood
Copy link

@hajarghaem
Sure.

The basic idea is as follows:

  1. Get all documents that match
  2. Reduce the set to the number of results you'd like to show
  3. Get the content for those matching documents
  4. Use mark.js (formerly known as jmHighlight) to highlight the keywords in those documents
  5. Clean up the documents so that you'll only show a small portion of highlighted text.
  6. Append the resulting content to your search results
  7. repeat 4-6 for all documents in your result

Some code (this is actually embedded in an jQuery autocomplete definition):

      var queryTokens = lunrIndex.pipeline.run(lunr.tokenizer(request.term))
      var resultSet = _.chain(lunrIndex.search(request.term)).take(10).pluck('ref').map(function(ref) {
        return lunrData.docs[ref];
      }).value();

      resultSet.reduce(function(sequence, item) {
        return sequence.then(function() {
          return $.get(item.url);
        }).then(function( data ) {
            item.excerpt = '';
            var pageContent = $.parseHTML(data);
            var pageContentElement = $(pageContent).filter(".doc-body");

            $.each(queryTokens, function(index, token) {
              pageContentElement.jmHighlight(token,{"className":"lunr-match-highlight"});
            });

            pageContentElement.find(".lunr-match-highlight").slice(0,4).each(function(index, blastElement){
              var text = $(blastElement).map(function(i, element){
                  var previousNode = this.previousSibling.nodeValue;
                  var nextNode = this.nextSibling.nodeValue;
                  var wordsBefore = _.escape(previousNode.split(' ').slice(-10).join(' '));
                  var wordsAfter = _.escape(nextNode.split(' ').slice(0,10).join(' '));

                  if(nextNode.endsWith(" ")) {
                    wordsBefore += " ";
                  }

                  return wordsBefore + element.outerHTML + wordsAfter
              }).first().get();
              if(!item.excerpt) {
                item.excerpt = '';
              }
              item.excerpt += '<p class="lunr-match-highlight_result">'+text+"</p>";
            });
        });

Probably not the nicest code, but it works ;)

@olivernn
Copy link
Owner

The latest version of Lunr does provide support for highlighting matches in documents. There is a demo showing this in action.

To be clear, Lunr does not provide the actual highlighting, but it is now able to return the positions of keywords that did match. This should enable the use of other libraries to perform the highlighting of terms in a page.

Please try it out and let me know any feedback.

@JinxMan25
Copy link

I can't seem to get the position of the terms returned from the result in the metaData attribute. How can I get the position?

@olivernn
Copy link
Owner

@clanofnoobs please open a new issue showing what you've tried and I'll take a look.

I'm closing this issue now as there is support for highlighting terms with lunr. If there are problems with getting highlighting to work they should considered bugs and a new issue should be opened.

@edave
Copy link

edave commented Nov 26, 2018

For anyone who comes across this, in reference to @clanofnoobs's question, the position must be whitelisted in the metadata when the index is constructed (within the passed-in function), like so:

this.metadataWhitelist = ['position']

From the bottom of https://lunrjs.com/guides/core_concepts.html

@manuadappt
Copy link

@hajarghaem
Sure.

The basic idea is as follows:

  1. Get all documents that match
  2. Reduce the set to the number of results you'd like to show
  3. Get the content for those matching documents
  4. Use mark.js (formerly known as jmHighlight) to highlight the keywords in those documents
  5. Clean up the documents so that you'll only show a small portion of highlighted text.
  6. Append the resulting content to your search results
  7. repeat 4-6 for all documents in your result

Some code (this is actually embedded in an jQuery autocomplete definition):

      var queryTokens = lunrIndex.pipeline.run(lunr.tokenizer(request.term))
      var resultSet = _.chain(lunrIndex.search(request.term)).take(10).pluck('ref').map(function(ref) {
        return lunrData.docs[ref];
      }).value();

      resultSet.reduce(function(sequence, item) {
        return sequence.then(function() {
          return $.get(item.url);
        }).then(function( data ) {
            item.excerpt = '';
            var pageContent = $.parseHTML(data);
            var pageContentElement = $(pageContent).filter(".doc-body");

            $.each(queryTokens, function(index, token) {
              pageContentElement.jmHighlight(token,{"className":"lunr-match-highlight"});
            });

            pageContentElement.find(".lunr-match-highlight").slice(0,4).each(function(index, blastElement){
              var text = $(blastElement).map(function(i, element){
                  var previousNode = this.previousSibling.nodeValue;
                  var nextNode = this.nextSibling.nodeValue;
                  var wordsBefore = _.escape(previousNode.split(' ').slice(-10).join(' '));
                  var wordsAfter = _.escape(nextNode.split(' ').slice(0,10).join(' '));

                  if(nextNode.endsWith(" ")) {
                    wordsBefore += " ";
                  }

                  return wordsBefore + element.outerHTML + wordsAfter
              }).first().get();
              if(!item.excerpt) {
                item.excerpt = '';
              }
              item.excerpt += '<p class="lunr-match-highlight_result">'+text+"</p>";
            });
        });

Probably not the nicest code, but it works ;)

what is "request.term" within tokenizer

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests