Fixing bug with Snippeter and German 'ß' character #34

merged 2 commits into from Apr 10, 2012

3 participants


When the snippeter ran on a doc containing a word like 'fußball' at the end of the field, it triggered a StringIndexOutOfBoundsException. It happened because during tokenization, ASCIIFoldingFilter changed the 'ß' character into 'ss'. But the Snippeter was calculating the length using the end offset from the expanded string ('fussball'), causing the error.

I added a testcase to demonstrate the problem. Please do a sanity check code read :)


LinkedIn member

Originally, we had the same code as this pull. But that was changed for prefix search, can you test that case?
CC @iperez


I added a testcase for prefix search "f*" and it passes. Here's the code - am I covering the case you're concerned about?

(added to the end of SnippetSearchTest.testTokenizingChangesTokenLength()):

query = new Query(new PrefixTermQuery("text", "fu"), "fu*", null);
srs =, 0, 1, 0, ImmutableMap.of("snippet_fields", "text", "snippet_type", "html"));
sr = srs.getResults().iterator().next();
snippet = sr.getField("snippet_text");
assertNotNull("Snippet is null", snippet);
assertTrue("Search term not highlighted", snippet.contains("<b>Fu&szlig;ball</b>"));
assertTrue("Snippet lost space before highlighted term", snippet.contains("der "));
assertTrue("Snippet lost space after highlighted term", snippet.contains(" player"));
LinkedIn member

Yes, exactly.

However, in my test I lost the ability to do prefix "partial highlighting". For "Fu*" the expected highlight would be: "Fussball".

Should the endOffset be corrected for prefix matching using the original query token.len -1? Or should be prefix "partial highlighting" feature completely removed? In any case, there's a bit more work to do here (clean up or handle prefix).

Edit: matches would need to be more than just a Pair.


I spent some time trying to keep the prefix partial highlighting, like "Fussball". In order to do it correctly in every case, including queries such as "fuß*", I think the Snippeter code would have to know the original (untokenized) query text. This is because tokenization can arbitrary change the length of the terms (ASCIIFoldingFilter). I'm not sure if it's worth that much change.

I think highlighting the whole term that matched due to a prefix query is fine as far as usability goes. What do you think? If so, I can clean it up and remove the partial highlighting case.

LinkedIn member

I agree.


that last commit should implement what we discussed (removing partial highlighting for prefix matches)

@iladriano iladriano merged commit 4a66b93 into linkedin:master Apr 10, 2012

thanks Adrián!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment