Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List with hard-coded spelling suggestions for some cases? #679

Closed
janschreiber opened this issue Feb 28, 2017 · 3 comments
Closed

List with hard-coded spelling suggestions for some cases? #679

janschreiber opened this issue Feb 28, 2017 · 3 comments

Comments

@janschreiber
Copy link
Contributor

janschreiber commented Feb 28, 2017

In a few cases, the algorithm for suggesting corrections does not come up with the best suggestion, or even with any suggestions at all, usually because the word the users meant to write and what they actually typed are too far away from each other in terms of Levenshtein distance.
German examples include the word 'analpherbet' (as reported in the forum recently) and the weird but common misspelling 'legendlich' for 'lediglich' (almost 40,000 hits on Google).
I wonder if we could have a manually maintained, semicolon- or tab-separated list
for those of these cases that are brought to our attention. That could either be a separate file suggestions.txt, or maybe we could use the existing file prohibit.txt.
Example:

analpherbet; Analphabet
analpherbeten; Analphabeten; Analphabetin
legendlich; lediglich; leg endlich
mistreiter; Mitstreiter

LanguageTool could look up a misspelled word in the first column and display the words in the remainder of the line as suggestions, ideally in the order they are given in the file. IMO the suggestions that are generated programmatically should be ignored if a misspelled word is found in the list. This would allow us to not only add missing suggestions, but also overrule misleading suggestions the software produces sometimes (mostly weird compounds that aren't really used, such as 'Brustwalze').

@milekpl
Copy link
Member

milekpl commented Apr 25, 2017

There is a very simple solution for Polish already in place:

  1. Use SimpleReplaceRule to create replacements for popular spelling mistakes in rules/replace.txt.
  2. Use ignore.txt in hunspell folder to list these popular spelling mistakes.

You could write up a script to populate both files from a list and a JUnit test to check that you don't get additional replacements..

@danielnaber
Copy link
Member

This issue also shows our spell checker (often that's just hunspell) is far from perfect. A native speaker can see what "analpherbet" is meant to mean, and LT should be able to get that, too.

@janschreiber
Copy link
Contributor Author

There is already a solution for that: getAdditionalTopSuggestions() in
languagetool/languagetool-language-modules/de/src/main/java/org/languagetool/rules/de/GermanSpellerRule.java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants