New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange issue with VON_NOETEN rule, spell checker deletes dot. #273

Closed
janschreiber opened this Issue Jun 2, 2015 · 10 comments

Comments

Projects
None yet
4 participants
@janschreiber
Member

janschreiber commented Jun 2, 2015

lt-bug
Here is a strange issue I found today. If I type "von nöten" into the standalone GUI version of LT 2.9, this triggers both the (new) rule VON_NOETEN and the spelling rule, as expected, because 'Nöten' is a noun and should be uppercase in German.
Here's the issue:
(1) For some reason, the dot at the sentence end is included in the selection for replacement by the spelling rule. This is somewhat unexpected, but as long as the suggestion includes the dot, it doesn't do any harm.
(2) The screenshot shows that only the first two suggestions (which are, strangely, identical btw) include the dot. If I choose one of the other suggestions, the dot is deleted.
This is not a big deal, but it can transform a correct sentence into an incorrect one. And it is strange: Why would the spell checker mark punctuation for replacement at all?

@janschreiber janschreiber added the German label Jun 2, 2015

@janschreiber

This comment has been minimized.

Show comment
Hide comment
@janschreiber

janschreiber Jul 4, 2015

Member

A clarification: This problem has actually nothing to do with the VON_NOETEN rule. It seems to occur whenever a spelling error is the last word of a sentence and the sentence ends with a period. Sentences ending with a question mark or exclamation mark are unaffected. It's for example triggered by the sentence "Das müssen wir um jeden Preis verhindenn."
Also, it affects both the GUI version and the Web interface at the LT homepage.

Member

janschreiber commented Jul 4, 2015

A clarification: This problem has actually nothing to do with the VON_NOETEN rule. It seems to occur whenever a spelling error is the last word of a sentence and the sentence ends with a period. Sentences ending with a question mark or exclamation mark are unaffected. It's for example triggered by the sentence "Das müssen wir um jeden Preis verhindenn."
Also, it affects both the GUI version and the Web interface at the LT homepage.

@Xaratas

This comment has been minimized.

Show comment
Hide comment
@Xaratas

Xaratas Dec 10, 2015

Its the same with the commandline. Even if the Word is added to the hunspell user dictionary spelling.txt language tool checks includes the dot and marks it as an error.
(Galawain is a God in the Game Pillars of Eternity)

xar@Y4d:~/gitrepos/poe_translation/translation_helper$ echo "Das Haus von Galawain." | java -jar ../../../Downloads/LanguageTool-3.1/languagetool-commandline.jar -l de-DE                     
Expected text language: German (Germany)
Working on STDIN...
1.) Line 1, column 14, Rule ID: GERMAN_SPELLER_RULE
Message: Möglicher Rechtschreibfehler gefunden
Suggestion: Galawain; Galawains; Galawein; Galamain; Galawand; Galamainz; Galawind; Galaweine; Galahain; Galahaine; Galahains; Galamains; Galarain; Galaraine; Galarains; Galasaint; Galawahn; Galawahns; Galawaid; Galawaids
Das Haus von Galawain. 
             ^^^^^^^^^ 

Xaratas commented Dec 10, 2015

Its the same with the commandline. Even if the Word is added to the hunspell user dictionary spelling.txt language tool checks includes the dot and marks it as an error.
(Galawain is a God in the Game Pillars of Eternity)

xar@Y4d:~/gitrepos/poe_translation/translation_helper$ echo "Das Haus von Galawain." | java -jar ../../../Downloads/LanguageTool-3.1/languagetool-commandline.jar -l de-DE                     
Expected text language: German (Germany)
Working on STDIN...
1.) Line 1, column 14, Rule ID: GERMAN_SPELLER_RULE
Message: Möglicher Rechtschreibfehler gefunden
Suggestion: Galawain; Galawains; Galawein; Galamain; Galawand; Galamainz; Galawind; Galaweine; Galahain; Galahaine; Galahains; Galamains; Galarain; Galaraine; Galarains; Galasaint; Galawahn; Galawahns; Galawaid; Galawaids
Das Haus von Galawain. 
             ^^^^^^^^^ 
@Xaratas

This comment has been minimized.

Show comment
Hide comment
@Xaratas

Xaratas Dec 10, 2015

It looks like the fix would be to change line 124 from this file:

return (isAlphabetic && !word.equals("--") && hunspellDict.misspelled(word)) || isProhibited(removeTrailingDot(word));

return (isAlphabetic && !word.equals("--") && hunspellDict.misspelled(word)) || isProhibited(removeTrailingDot(word));

Suggested fix:

return (isAlphabetic && !word.equals("--") && hunspellDict.misspelled(removeTrailingDot(word))) || isProhibited(removeTrailingDot(word));

Xaratas commented Dec 10, 2015

It looks like the fix would be to change line 124 from this file:

return (isAlphabetic && !word.equals("--") && hunspellDict.misspelled(word)) || isProhibited(removeTrailingDot(word));

return (isAlphabetic && !word.equals("--") && hunspellDict.misspelled(word)) || isProhibited(removeTrailingDot(word));

Suggested fix:

return (isAlphabetic && !word.equals("--") && hunspellDict.misspelled(removeTrailingDot(word))) || isProhibited(removeTrailingDot(word));
@danielnaber

This comment has been minimized.

Show comment
Hide comment
@danielnaber

danielnaber Dec 10, 2015

Member

@Xaratas I cannot reproduce the problem you describe with the current version from git, this seems to be fixed already.

Member

danielnaber commented Dec 10, 2015

@Xaratas I cannot reproduce the problem you describe with the current version from git, this seems to be fixed already.

@Xaratas

This comment has been minimized.

Show comment
Hide comment
@Xaratas

Xaratas Dec 10, 2015

@danielnaber Checked against the latest snapshot. LanguageTool-20151209-snapshot
Its fixed if the word is added to the hunspell spelling.txt.
Much less important rest issue: If the word is unknown then the error marker includes the point.

Das Haus von Galawain. 
             ^^^^^^^^^ 

Xaratas commented Dec 10, 2015

@danielnaber Checked against the latest snapshot. LanguageTool-20151209-snapshot
Its fixed if the word is added to the hunspell spelling.txt.
Much less important rest issue: If the word is unknown then the error marker includes the point.

Das Haus von Galawain. 
             ^^^^^^^^^ 
@janschreiber

This comment has been minimized.

Show comment
Hide comment
@janschreiber

janschreiber Feb 12, 2016

Member

The command line version of Hunspell shows the same behavior, i.e. effectively deletes the period from the sentence end. This was probably introduced like ten years ago to accommodate for the automatic handling of the period in LibreOffice (and OOo). I found this in the Hunspell change log:

src/hunspell/{suggestmgr,hunspell}.*: strip periods from suggestions (restore MySpell's original behaviour)
Rationale: OpenOffice.org has an automatic period handling mechanism and suggestions look better without periods.

Member

janschreiber commented Feb 12, 2016

The command line version of Hunspell shows the same behavior, i.e. effectively deletes the period from the sentence end. This was probably introduced like ten years ago to accommodate for the automatic handling of the period in LibreOffice (and OOo). I found this in the Hunspell change log:

src/hunspell/{suggestmgr,hunspell}.*: strip periods from suggestions (restore MySpell's original behaviour)
Rationale: OpenOffice.org has an automatic period handling mechanism and suggestions look better without periods.

@f-knorr

This comment has been minimized.

Show comment
Hide comment
@f-knorr

f-knorr Oct 14, 2016

Contributor

Here is the problem:
We use this regex to tokenize text in German
((?![ß\-.])[^\p{L}]|(?<=\d)\-|\-(?=\d+))
Due to the . in this regex, a full stop does not tokenize text (i.e., "Das ist fasch." gets tokenized in "Das, ist, fasch.")
Unfortunately, we get this regex-part directly from the hunspell dictionary (which we read fromthe aff file):
wordChars = "(?![" + hunspellDict.getWordChars().replace("-", "\\-") + "])";

The easy fix that i suggest is to add . as additional tokenizing character (we could also modify the aff-file, but then we would have to remember this change after every update of the dict)
((?![ß\-.])[^\p{L}]|(?<=\d)\-|\-(?=\d+)[.])
Now "Das ist fasch." gets tokenized in "Das, ist, fasch, ."

Contributor

f-knorr commented Oct 14, 2016

Here is the problem:
We use this regex to tokenize text in German
((?![ß\-.])[^\p{L}]|(?<=\d)\-|\-(?=\d+))
Due to the . in this regex, a full stop does not tokenize text (i.e., "Das ist fasch." gets tokenized in "Das, ist, fasch.")
Unfortunately, we get this regex-part directly from the hunspell dictionary (which we read fromthe aff file):
wordChars = "(?![" + hunspellDict.getWordChars().replace("-", "\\-") + "])";

The easy fix that i suggest is to add . as additional tokenizing character (we could also modify the aff-file, but then we would have to remember this change after every update of the dict)
((?![ß\-.])[^\p{L}]|(?<=\d)\-|\-(?=\d+)[.])
Now "Das ist fasch." gets tokenized in "Das, ist, fasch, ."

f-knorr added a commit that referenced this issue Oct 14, 2016

@f-knorr f-knorr closed this Oct 14, 2016

@janschreiber

This comment has been minimized.

Show comment
Hide comment
@janschreiber

janschreiber Oct 14, 2016

Member

Does this change affect the internal handling of abbreviations by Hunspell, such as 'bzw.', 'bzgl.' etc.? I can't test this myself atm, sorry.

Member

janschreiber commented Oct 14, 2016

Does this change affect the internal handling of abbreviations by Hunspell, such as 'bzw.', 'bzgl.' etc.? I can't test this myself atm, sorry.

f-knorr added a commit that referenced this issue Oct 14, 2016

@f-knorr

This comment has been minimized.

Show comment
Hide comment
@f-knorr

f-knorr Oct 14, 2016

Contributor

Unfortunately, yes! I have just reverted my change.
These abbreviations would no longer be accepted, if we add . as a tokenizing character to the regex.
Hence, we need a more sophisticated fix...

Contributor

f-knorr commented Oct 14, 2016

Unfortunately, yes! I have just reverted my change.
These abbreviations would no longer be accepted, if we add . as a tokenizing character to the regex.
Hence, we need a more sophisticated fix...

@f-knorr f-knorr reopened this Oct 14, 2016

@f-knorr

This comment has been minimized.

Show comment
Hide comment
@f-knorr

f-knorr Oct 15, 2016

Contributor

So, here is another attempt to fix the problem:
First, let's check whether the word to be corrected ends with ".". In this case, add a "." to all suggestions that do not end with ".". (We have to do this check as, for instance, "ec." creates the suggestion "etc.", which already ends with "." -> "Wir brauchen noch Wasser, Limo, Bier etc.")
I have added some checks to prevent a repetition of yesterday's fix attempt.

Contributor

f-knorr commented Oct 15, 2016

So, here is another attempt to fix the problem:
First, let's check whether the word to be corrected ends with ".". In this case, add a "." to all suggestions that do not end with ".". (We have to do this check as, for instance, "ec." creates the suggestion "etc.", which already ends with "." -> "Wir brauchen noch Wasser, Limo, Bier etc.")
I have added some checks to prevent a repetition of yesterday's fix attempt.

@f-knorr f-knorr closed this in 3911b60 Oct 15, 2016

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment