Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Does not work: Dump + Encode with frequency + Dump again #34

Open
milekpl opened this issue Dec 21, 2013 · 6 comments
Open

Does not work: Dump + Encode with frequency + Dump again #34

milekpl opened this issue Dec 21, 2013 · 6 comments
Assignees
Labels

Comments

@milekpl
Copy link
Member

milekpl commented Dec 21, 2013

Jaume, I dumped the Polish dictionary, used the frequency list to encode it. But then I cannot dump the dictionary again as there is an error:

d:\download\LanguageTool-2.4-SNAPSHOT>java -cp languagetool.jar org.languagetool
.dev.DictionaryExporter pl_PL.dict >pl_PL.src

Unhandled program error occurred.
Invoke with '--help' for help.
java.lang.RuntimeException: Invalid dictionary entry format (missing separator).

```
    at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:5
```

9)
        at morfologik.stemming.DictionaryIterator.next(DictionaryIterator.java:1
5)
        at morfologik.tools.FSADumpTool.dump(FSADumpTool.java:171)
        at morfologik.tools.FSADumpTool.go(FSADumpTool.java:75)
        at morfologik.tools.Tool.go(Tool.java:45)
        at morfologik.tools.FSADumpTool.main(FSADumpTool.java:285)
        at org.languagetool.dev.DictionaryExporter.main(DictionaryExporter.java:
40)
I think this is an omission on our part in morfologik-speller but it shows also in LT code.
@ghost ghost assigned jaumeortola Dec 21, 2013
@jaumeortola
Copy link
Member

Hi,
The DictionaryExporter in LT expects the speller dictionary to be inside a "hunspell" folder:

if (new File(filename).getAbsolutePath().contains("hunspell")) {
  FSADumpTool.main("--raw-data", "-d", args[0]);
} else {
  FSADumpTool.main("--raw-data", "-x", "-d", args[0]);
}

Taking the polish dict from the hunspell folder I can dump it. But I'm not sure if everything is OK.

@milekpl
Copy link
Member Author

milekpl commented Dec 22, 2013

Jaume, I tried to dump the dictionary from the current folder. Then the error will appear. I simply wanted to see if it was encoded properly (because there is an encoding-related bug I discovered:

I don't think hardcoding the folder helps, and -x should work for frequency dictionaries. Otherwise, we cannot say we supply the source, which violates Debian principles - this is why we have documented all decoding procedures so that one could get the original sources. This means, however, that the decoding procedure has to produce readable frequency files, I'm afraid.

See also morfologik/morfologik-stemming#15

@danielnaber
Copy link
Member

Also see morfologik/morfologik-stemming#35

@danielnaber
Copy link
Member

So I understand that the problem is that we add the -x option depending on the hard-coded directory name. Instead we need to look inside the .info file and see if the fsa.dict.encoder option is set and only use the -x option if that is the case. Is that correct?

@danielnaber
Copy link
Member

@milekpl Could you maybe help with this, i.e. reply to my question above from 2014-09-24?

@milekpl
Copy link
Member Author

milekpl commented Oct 8, 2014

@danielnaber: it won't help. The encoder will be set but frequency dictionaries have more data. These data are not dumped properly. I tried to persuade Jaume to add code to dump frequency data but this is not a trivial thing to do, as the source format is XML.

jimregan pushed a commit to jimregan/languagetool that referenced this issue Oct 20, 2019
[ga] replace some generated examples with genuine ones from gaois.ie
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants