Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with HTTPS or Subversion.

Download ZIP

Loading…

Fixing bug with Snippeter and German 'ß' character #34

Merged
merged 2 commits into from

3 participants

@searchify

When the snippeter ran on a doc containing a word like 'fußball' at the end of the field, it triggered a StringIndexOutOfBoundsException. It happened because during tokenization, ASCIIFoldingFilter changed the 'ß' character into 'ss'. But the Snippeter was calculating the length using the end offset from the expanded string ('fussball'), causing the error.

I added a testcase to demonstrate the problem. Please do a sanity check code read :)

-chris

@iladriano
Owner

Originally, we had the same code as this pull. But that was changed for prefix search, can you test that case?
CC @iperez

@searchify

I added a testcase for prefix search "f*" and it passes. Here's the code - am I covering the case you're concerned about?

(added to the end of SnippetSearchTest.testTokenizingChangesTokenLength()):

query = new Query(new PrefixTermQuery("text", "fu"), "fu*", null);
srs = searcher.search(query, 0, 1, 0, ImmutableMap.of("snippet_fields", "text", "snippet_type", "html"));
sr = srs.getResults().iterator().next();
snippet = sr.getField("snippet_text");
assertNotNull("Snippet is null", snippet);
assertTrue("Search term not highlighted", snippet.contains("<b>Fu&szlig;ball</b>"));
assertTrue("Snippet lost space before highlighted term", snippet.contains("der "));
assertTrue("Snippet lost space after highlighted term", snippet.contains(" player"));
@iladriano
Owner

Yes, exactly.

However, in my test I lost the ability to do prefix "partial highlighting". For "Fu*" the expected highlight would be: "Fussball".

Should the endOffset be corrected for prefix matching using the original query token.len -1? Or should be prefix "partial highlighting" feature completely removed? In any case, there's a bit more work to do here (clean up or handle prefix).

Edit: matches would need to be more than just a Pair.

@searchify

I spent some time trying to keep the prefix partial highlighting, like "Fussball". In order to do it correctly in every case, including queries such as "fuß*", I think the Snippeter code would have to know the original (untokenized) query text. This is because tokenization can arbitrary change the length of the terms (ASCIIFoldingFilter). I'm not sure if it's worth that much change.

I think highlighting the whole term that matched due to a prefix query is fine as far as usability goes. What do you think? If so, I can clean it up and remove the partial highlighting case.

@iladriano
Owner

I agree.

@searchify

that last commit should implement what we discussed (removing partial highlighting for prefix matches)

@iladriano iladriano merged commit 4a66b93 into from
@searchify

thanks Adrián!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Commits on Apr 4, 2012
  1. @clamprecht
Commits on Apr 6, 2012
  1. @clamprecht
This page is out of date. Refresh to see the latest.
View
11 src/main/java/com/flaptor/indextank/search/SnippetSearcher.java
@@ -157,9 +157,8 @@ private String snippet(Set<TermQuery> terms, String fieldName, String text) {
String termInText = tokens.get(i).getText();
for (String termInQuery : termsForField) {
- if (termInQuery.endsWith("*") && termInText.startsWith(termInQuery.substring(0, termInQuery.length() - 1))) {
- matches.add(new Pair<Integer, Integer>(i, termInQuery.length() - 1));
- } else if (termInQuery.equals(termInText)) {
+ if ((termInQuery.endsWith("*") && termInText.startsWith(termInQuery.substring(0, termInQuery.length() - 1)))
+ || termInQuery.equals(termInText)) {
matches.add(new Pair<Integer, Integer>(i, termInText.length()));
}
}
@@ -189,9 +188,11 @@ private String mark(Window window, String text) {
for (Pair<AToken, Integer> token : window.matches) {
escapeAndAppend(buff, text, current, token.first().getStartOffset());
buff.append(open);
- escapeAndAppend(buff, text, token.first().getStartOffset(), token.first().getStartOffset() + token.last());
+ int start = token.first().getStartOffset();
+ int endOffset = token.first().getEndOffset();
+ escapeAndAppend(buff, text, start, endOffset);
buff.append(close);
- current = token.first().getStartOffset() + token.last();
+ current = endOffset;
}
// let subclasses handle where snippets end
View
33 src/test/java/com/flaptor/indextank/search/SnippetSearcherTest.java
@@ -28,6 +28,7 @@
import com.flaptor.indextank.index.Document;
import com.flaptor.indextank.index.IndexEngine;
import com.flaptor.indextank.query.ParseException;
+import com.flaptor.indextank.query.PrefixTermQuery;
import com.flaptor.indextank.query.AndQuery;
import com.flaptor.indextank.query.Query;
import com.flaptor.indextank.query.TermQuery;
@@ -138,6 +139,38 @@ public void testEncodesHTMLonEnd() throws IOException, InterruptedException {
assertTrue("less-than signs not encoded!", sr.getField("snippet_text").contains("&lt;"));
}
+ @TestInfo(testType=UNIT)
+ public void testTokenizingChangesTokenLength() throws IOException, InterruptedException, ParseException {
+ double timestampBoost = System.currentTimeMillis() / 1000.0;
+ String docid = "docid";
+ // \u00df is 'LATIN SMALL LETTER SHARP S'
+ // ASCIIFoldingFilter converts it from 'ß' to 'ss'
+ // see http://www.fileformat.info/info/unicode/char/df/index.htm
+ String text = "Clown Ferdinand und der Fu\u00dfball player";
+ Document doc = new Document(ImmutableMap.of("text", text));
+ indexer.add(docid, doc, (int)timestampBoost, Maps.<Integer, Double>newHashMap());
+
+ String queryText = "fussball";
+ Query query = new Query(new TermQuery("text", queryText), queryText, null);
+
+ SearchResults srs = searcher.search(query, 0, 1, 0, ImmutableMap.of("snippet_fields", "text", "snippet_type", "html"));
+ SearchResult sr = srs.getResults().iterator().next();
+ String snippet = sr.getField("snippet_text");
+ assertNotNull("Snippet is null", snippet);
+ assertTrue("Search term not highlighted", snippet.contains("<b>Fu&szlig;ball</b>"));
+ assertTrue("Snippet lost space before highlighted term", snippet.contains("der "));
+ assertTrue("Snippet lost space after highlighted term: " + snippet, snippet.contains(" player"));
+
+ query = new Query(new PrefixTermQuery("text", "fu"), "fu*", null);
+
+ srs = searcher.search(query, 0, 1, 0, ImmutableMap.of("snippet_fields", "text", "snippet_type", "html"));
+ sr = srs.getResults().iterator().next();
+ snippet = sr.getField("snippet_text");
+ assertNotNull("Snippet is null", snippet);
+ assertTrue("Search term not highlighted", snippet.contains("<b>Fu&szlig;ball</b>"));
+ assertTrue("Snippet lost space before highlighted term", snippet.contains("der "));
+ assertTrue("Snippet lost space after highlighted term", snippet.contains(" player"));
+ }
@TestInfo(testType=UNIT)
public void testFetchAll() throws IOException, InterruptedException {
Something went wrong with that request. Please try again.