When the snippeter ran on a doc containing a word like 'fußball' at the end of the field, it triggered a StringIndexOutOfBoundsException. It happened because during tokenization, ASCIIFoldingFilter changed the 'ß' character into 'ss'. But the Snippeter was calculating the length using the end offset from the expanded string ('fussball'), causing the error.
I added a testcase to demonstrate the problem. Please do a sanity check code read :)
Fixing bug with Snippeter triggered by characters folded by ASCIIFold…
…ingFilter, plus testcase
Originally, we had the same code as this pull. But that was changed for prefix search, can you test that case?
I added a testcase for prefix search "f*" and it passes. Here's the code - am I covering the case you're concerned about?
(added to the end of SnippetSearchTest.testTokenizingChangesTokenLength()):
query = new Query(new PrefixTermQuery("text", "fu"), "fu*", null);
srs = searcher.search(query, 0, 1, 0, ImmutableMap.of("snippet_fields", "text", "snippet_type", "html"));
sr = srs.getResults().iterator().next();
snippet = sr.getField("snippet_text");
assertNotNull("Snippet is null", snippet);
assertTrue("Search term not highlighted", snippet.contains("<b>Fußball</b>"));
assertTrue("Snippet lost space before highlighted term", snippet.contains("der "));
assertTrue("Snippet lost space after highlighted term", snippet.contains(" player"));
However, in my test I lost the ability to do prefix "partial highlighting". For "Fu*" the expected highlight would be: "Fussball".
Should the endOffset be corrected for prefix matching using the original query token.len -1? Or should be prefix "partial highlighting" feature completely removed? In any case, there's a bit more work to do here (clean up or handle prefix).
Edit: matches would need to be more than just a Pair.
I spent some time trying to keep the prefix partial highlighting, like "Fussball". In order to do it correctly in every case, including queries such as "fuß*", I think the Snippeter code would have to know the original (untokenized) query text. This is because tokenization can arbitrary change the length of the terms (ASCIIFoldingFilter). I'm not sure if it's worth that much change.
I think highlighting the whole term that matched due to a prefix query is fine as far as usability goes. What do you think? If so, I can clean it up and remove the partial highlighting case.
Removing prefix search partial highlighting code that is no longer ne…
that last commit should implement what we discussed (removing partial highlighting for prefix matches)