Encoded sequences can clash with separator byte and cause assertion errors #85

danielnaber · 2016-11-05T15:21:29Z

This causes the assert to fail in TrimSuffixEncoder.java:60:

Dictionary d = Dictionary.read(Paths.get("/tmp/org/languagetool/resource/de/german_synth.dict"));
DictionaryLookup dict = new DictionaryLookup(d);
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT");  // works
dict.lookup("anfragen|DOESNOTEXIST");  // works
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT:NEB");  // AssertionError at TrimSuffixEncoder.java:60

To reproduce, you can get the german_synth.dict from http://search.maven.org/remotecontent?filepath=de/danielnaber/german-pos-dict/1.0/german-pos-dict-1.0.jar

Stacktrace is this:

java.lang.AssertionError
	at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:60)
	at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)
	at org.languagetool.CrashTest.test(CrashTest.java:18)

In a less stripped down case this shows as (at least I assume this is the same issue):

Caused by: java.lang.IndexOutOfBoundsException
	at java.nio.Buffer.checkIndex(Buffer.java:540)
	at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
	at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:62)
	at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)

The text was updated successfully, but these errors were encountered:

dweiss · 2016-11-06T10:51:28Z

I'll look into this, thanks Daniel.

dweiss · 2016-11-07T09:11:04Z

Confirmed. The encoded length of the suffix to strip clashes with separator character. Thinking on how to fix this.

dweiss · 2016-11-07T09:35:59Z

Daniel, the simplest workaround for this problem is currently to recompile the dictionary from scratch and substitute the separator character for something with a higher value. Good examples would be ',' or '\t' since these fall below 'A' in ascii range and would wrap around to much longer suffixes.

I didn't invent this encoding scheme, it's a legacy from Janek Daciuk's work. A proper fix would involve some bookkeeping on the encoded string so that we don't accidentally trip on separator character (the input must not contain it, this is verified).

dweiss · 2016-11-07T10:08:49Z

Fixed in 2.1.2. Will publish asap. Thanks for a detailed repro, Daniel!

danielnaber · 2016-11-07T12:59:26Z

Thanks for the fast fix, I can confirm it fixes the issue - also without re-building the dictionary, or do you still recommend that?

dweiss · 2016-11-07T13:01:31Z

The fix will work with your existing dictionary if you can/ are willing to upgrade. If you want to stick to 2.1.1, you'll have to rebuild the dictionary with a different separator, unfortunately.

danielnaber mentioned this issue Nov 5, 2016

assertion failure in morfologik languagetool-org/languagetool#586

Closed

dweiss added bug refactoring labels Nov 7, 2016

dweiss added this to the 2.1.2 milestone Nov 7, 2016

dweiss self-assigned this Nov 7, 2016

dweiss changed the title ~~assert in TrimSuffixEncoder.java:60~~ Encoded sequences can clash with separator byte and cause assertion errors Nov 7, 2016

dweiss closed this as completed in efc065b Nov 7, 2016

dweiss added a commit that referenced this issue Nov 7, 2016

GH-85: Add changes entry.

b02b62a

jaumeortola mentioned this issue Nov 29, 2016

Speller: words containing the dictionary separator are not handled properly #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encoded sequences can clash with separator byte and cause assertion errors #85

Encoded sequences can clash with separator byte and cause assertion errors #85

danielnaber commented Nov 5, 2016

dweiss commented Nov 6, 2016

dweiss commented Nov 7, 2016

dweiss commented Nov 7, 2016

dweiss commented Nov 7, 2016

danielnaber commented Nov 7, 2016

dweiss commented Nov 7, 2016

Encoded sequences can clash with separator byte and cause assertion errors #85

Encoded sequences can clash with separator byte and cause assertion errors #85

Comments

danielnaber commented Nov 5, 2016

dweiss commented Nov 6, 2016

dweiss commented Nov 7, 2016

dweiss commented Nov 7, 2016

dweiss commented Nov 7, 2016

danielnaber commented Nov 7, 2016

dweiss commented Nov 7, 2016