Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoded sequences can clash with separator byte and cause assertion errors #85

Closed
danielnaber opened this issue Nov 5, 2016 · 6 comments
Assignees
Milestone

Comments

@danielnaber
Copy link
Collaborator

This causes the assert to fail in TrimSuffixEncoder.java:60:

Dictionary d = Dictionary.read(Paths.get("/tmp/org/languagetool/resource/de/german_synth.dict"));
DictionaryLookup dict = new DictionaryLookup(d);
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT");  // works
dict.lookup("anfragen|DOESNOTEXIST");  // works
dict.lookup("anfragen|VER:1:PLU:KJ1:SFT:NEB");  // AssertionError at TrimSuffixEncoder.java:60

To reproduce, you can get the german_synth.dict from http://search.maven.org/remotecontent?filepath=de/danielnaber/german-pos-dict/1.0/german-pos-dict-1.0.jar

Stacktrace is this:

java.lang.AssertionError
	at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:60)
	at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)
	at org.languagetool.CrashTest.test(CrashTest.java:18)

In a less stripped down case this shows as (at least I assume this is the same issue):

Caused by: java.lang.IndexOutOfBoundsException
	at java.nio.Buffer.checkIndex(Buffer.java:540)
	at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139)
	at morfologik.stemming.TrimSuffixEncoder.decode(TrimSuffixEncoder.java:62)
	at morfologik.stemming.DictionaryLookup.lookup(DictionaryLookup.java:217)
@dweiss
Copy link
Member

dweiss commented Nov 6, 2016

I'll look into this, thanks Daniel.

@dweiss
Copy link
Member

dweiss commented Nov 7, 2016

Confirmed. The encoded length of the suffix to strip clashes with separator character. Thinking on how to fix this.

@dweiss dweiss added this to the 2.1.2 milestone Nov 7, 2016
@dweiss dweiss self-assigned this Nov 7, 2016
@dweiss
Copy link
Member

dweiss commented Nov 7, 2016

Daniel, the simplest workaround for this problem is currently to recompile the dictionary from scratch and substitute the separator character for something with a higher value. Good examples would be ',' or '\t' since these fall below 'A' in ascii range and would wrap around to much longer suffixes.

I didn't invent this encoding scheme, it's a legacy from Janek Daciuk's work. A proper fix would involve some bookkeeping on the encoded string so that we don't accidentally trip on separator character (the input must not contain it, this is verified).

@dweiss dweiss changed the title assert in TrimSuffixEncoder.java:60 Encoded sequences can clash with separator byte and cause assertion errors Nov 7, 2016
@dweiss dweiss closed this as completed in efc065b Nov 7, 2016
dweiss added a commit that referenced this issue Nov 7, 2016
@dweiss
Copy link
Member

dweiss commented Nov 7, 2016

Fixed in 2.1.2. Will publish asap. Thanks for a detailed repro, Daniel!

@danielnaber
Copy link
Collaborator Author

Thanks for the fast fix, I can confirm it fixes the issue - also without re-building the dictionary, or do you still recommend that?

@dweiss
Copy link
Member

dweiss commented Nov 7, 2016

The fix will work with your existing dictionary if you can/ are willing to upgrade. If you want to stick to 2.1.1, you'll have to rebuild the dictionary with a different separator, unfortunately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants