Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible bug in algorithm #45

Closed
tbsmark86 opened this issue Nov 30, 2018 · 3 comments
Closed

Possible bug in algorithm #45

tbsmark86 opened this issue Nov 30, 2018 · 3 comments
Assignees

Comments

@tbsmark86
Copy link

I've stumbled over some strange/incorrect hyphens in some words.

To validate i tryed http://pyphen.org/ and compared on a large list of words. With a ton of differences. In this list I've found one wrong word (didn't look any further):
"zweihenklig" should be "zwei-henk-lig" but is "zwei-hen-klig"

It seems there are multiple pattern lists for german available therefore I've created a custom de.hpb with the patterns found in the MiKTeX Portable Package (6/30/2018) to fix this.

BUT: Then TeX and Hyphenopoly seem to disagree on other words (again i did not look further):
"zytosol" => "zyto-s-ol" in TeX: "zy-to-sol" (which is correct)
"indestructible" => "in-des-t-ruc-tible" in TeX: "in-de-struc-tible" (while not german this is almost correct)

Your de.hpd results in: "zy-to-sol" and "in-de-st-ruc-ti-ble"

Can you look into this?
I would like to avoid doing some ajax request to get this done with the Python solution.

I can provide a TeX test file and the custom de.hpb if you need it.

@mnater mnater closed this as completed in 3101cf8 Nov 30, 2018
@mnater
Copy link
Owner

mnater commented Nov 30, 2018

Hi

There are many different pattern lists for German, indeed. This is mostly due to a very active community for building german patterns (http://projekte.dante.de/Trennmuster) updating the patterns every once a while.

I rebuild the patterns based on the most recent wordlist from the Trennmuster-group and updated the de.hpb. Now "zweihenklig" and "zytosol" are hyphenated correctly.

Pattern files are language-specific – it's ok that "indestructible" won't be hyphenated correctly by german patterns.

I'm very interested in your word list to check the patterns.

Best regards,
Mathias

@mnater mnater self-assigned this Nov 30, 2018
@tbsmark86
Copy link
Author

Hi again,

thanks for the update! Works better now.

I'm very interested in your word list to check the patterns.

I've simply used the dictionary file from LanguageTool and then diff'ed the result from Hypenopoly and Pyphen.
You can find the dicts here: https://github.com/languagetool-org/languagetool/tree/master/languagetool-language-modules/de/src/main/resources/org/languagetool/resource/de/hunspell

While testing again i've noted a strange thing with nodejs the StringDecoder returns "95,97,98,99,100,101,102,103,104, ..." instead of "_abcdef...." which results in awkward word-splitting. Not sure if this a bug in nodejs or intended. I'am at v8.11.

@mnater
Copy link
Owner

mnater commented Dec 14, 2018

While testing again i've noted a strange thing with nodejs the StringDecoder returns "95,97,98,99,100,101,102,103,104, ..." instead of "_abcdef...." which results in awkward word-splitting. Not sure if this a bug in nodejs or intended. I'am at v8.11.

Sorry, I missed that part.
Newer versions of node.js support TypedArrays as argument for StringDecoder.write(). Older versions require it to be a Buffer.

@mnater mnater reopened this Dec 14, 2018
@mnater mnater closed this as completed in 319d13f Dec 14, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants