-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hyphenation patterns update (cleaned up #376) #378
Conversation
Pinging @roshavagarga and @cramoisi (I can't add you as reviewers...) for a quick look and confirmation that I didn't mess anything, before merging. |
I will look into it after work :) |
cr3gui/data/hyph/languages.json
Outdated
"filename": "Latvian.pattern", | ||
"language": "lv", | ||
"left_hyphen_min": 2, | ||
"right_hyphen_min": 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I missed adding a comma after this line (157)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, just saw that, going to fix it.
c4c57cd
to
3c82047
Compare
I did my once-over and it seems fine to me, though I did notice this has 1 less new line compared to the original, but no idea where that 1 missing line is from, probably something I missed and you fixed? |
Yes, a last trailing blank line in Esperanto.pattern that I removed (to avoid github highlighting it as "not clean"). |
Ah, those always annoyed me, good to know I can get rid of them 👍 |
Added 2 commits to rename badly named Romanian and Ukrainian pattern files, and some methods to help cleaning up frontend readertypography.lua code. |
@poire-z Not sure how to add changes to this PR, so I've attached the latest Spanish pattern :) |
61ebe5b
to
86432e4
Compare
OK, added.
If you plan to do more work/fixes like these, and you have a linux machine / can do git command line stuff, I can give you a small cheatsheet on how to proceed to make simple PRs on new branches (that I could more easily reword, but even you, and we'll be here to help if you get stuck). |
Yeah, the one Linux machine I have is a decrepit single-core tiny notebook from like 7-8 years ago that's running Lubuntu, because nothing else runs well on it :/ |
cr3gui/data/hyph/Bulgarian.pattern
Outdated
<pattern></pattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line 7 must be deleted
cr3gui/data/hyph/Esperanto.pattern
Outdated
<pattern>unu3a2nim</pattern> | ||
<pattern>uo2</pattern> | ||
<pattern> uv2u3l</pattern> | ||
<pattern>uzulinterfaco</pattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please move the line 1964 at the end of the file, and comment it :)
cr3gui/data/hyph/Irish.pattern
Outdated
<pattern>dtiom5áintí</pattern> | ||
<pattern>thiom5áintí</pattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move these lines 6085-6086 before your block of commented exceptions ?
cr3gui/data/hyph/Norwegian.pattern
Outdated
<pattern>upp5yver</pattern> | ||
<pattern>ut5ørk</pattern> | ||
<pattern>ut5ørken</pattern> | ||
<!-- <pattern>velan</pattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you gather all your commented lines in one block at the end of the file ?
cr3gui/data/hyph/Slovak.pattern
Outdated
<!-- Exceptions, see https://github.com/koreader/crengine/pull/376 | ||
<pattern>dosť</pattern> | ||
--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
commented lines at the end of file ? ;)
cr3gui/data/hyph/Spanish.pattern
Outdated
<pattern> re3aprend</pattern> | ||
<pattern> re3aprénd</pattern> | ||
<pattern> re3apret</pattern> | ||
<pattern> reapríet</pattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be commented and moved at the end of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All 4 ? Or just the last one without any number?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll assume it's just the last one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Of, right, you answered that below :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yup. only the non-numbered one. I messed with my line selection, sorry :)
cr3gui/data/hyph/Spanish.pattern
Outdated
<pattern> reu3nia</pattern> | ||
<pattern> reu3ní</pattern> | ||
<pattern> reu3nis</pattern> | ||
<pattern> reunim</pattern> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to be commented and move at the end of the file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only the ones without numbers @ 4098, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes :)
crengine/src/textlang.cpp
Outdated
{ "en-GB", "English_GB", "English_GB.pattern", 2, 3 }, | ||
{ "en", "English_US", "English_US.pattern", 2, 3 }, | ||
{ "eo", "Esperanto", "Esperanto.pattern", 2, 2 }, | ||
{ "et", "Estonian", "Estonian.pattern", 2, 3 }, | ||
{ "fi", "Finnish", "Finnish.pattern", 2, 2 }, | ||
{ "fr", "French", "French.pattern", 2, 1 }, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add // see French.pattern file
as a comment for french 2,1 ?
(@roshavagarga : I'll do the requested modifications - later this afternoon.) |
OK, some quick notes (build from my own notes, removing more complex and rare stuff):
|
Switch from outdated BGOffice pattern to hyphenation.org's newer version. Update quotation marks.
Update right_hyphen_min from 2 to 3. Add commented out exceptions to pattern file, to be investigated.
Update right_hyphen_min from 2 to 3. Add exceptions to pattern file.
Update right_hyphen_min from 2 to 3. Add exceptions to pattern file (some commented out, to be investigated).
Add exceptions to pattern file.
Add exceptions to pattern file (some commented out, to be investigated).
So frontend code can fetch per-language hyphen limits.
86432e4
to
1ebcd14
Compare
Ok, done. Please have a quick look again as I did this in a hurry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok good for me :)
This file is sufficient to describe the 'TeX Hyphenation patterns' format? Ok, I can pick hyphenation patterns files from KOReader into CoolReader. Thanks to all contributors. crengine/crengine/src/hyphman.cpp Line 2 in a00d1b8
And I don't know where Alan got these patterns and how correct they are. But if I make dump from " English_US_hyphen_(Alan).pdb " I got 295799 bytes instead of 126141 from file English_US.pattern . And this fact makes you wonder which file is better and more correct.I found that in KOReader " *_hyphen_(Alan).pdb " files dropped in #209 and #206 and I guess nobody checked contents of the "*_hyphen_(Alan).pdb " files.Maybe we can combine them somehow? What is the best way to do this? Is it ok to just add omitted patterns in " *.pattern " from "*_hyphen_(Alan).pdb "?
P.S. This tool can be found in my not-yet-merged branch here: https://github.com/virxkane/coolreader/tree/updates-2020.09-1/crengine/Tools/HyphDumper |
I could check the differences if you provide a human readable dump of English pdb file. Différences can be legit as the TeX files are only used as a base for crengine which can't read all TeX rules by design. So when I changed from pdb to pattern I also removed them as they were useless. But as now we are following the hyphen patterns from recognised sources, so I don't think we need to worry about trying to make better than linguists already did for libreoffice, etc. I think feedback from native speakers should be enough as it's somehow tricky to edit these files without in dept knowledge of the languages... |
How can I make "human readable dump of English pdf file"? I can only convert |
That will do as I'm human and I can read a pattern file with my eyes :) Btw : I checked the content of pdb files back then but only for a few of them. As I said, some differences are legit |
I don't think combining them makes any sense: it's not really "the more items the better", but having a set that is consistent within itself (some patterns with numbers overriding some others with other numbers - mixing sources could possibly have these numbers relations incoherent). |
Yes, it looks logical.
The dump for "
Is there something useful about these files " |
Ok, I checked and actual patterns are more developed and consistent than the old one. A good example is that exceptions introduced by the hacks you mentioned in the pdb files are all processed correctly in current pattern file. So I guess you could get rid of these _(alan).pdb files once for all :-) |
@cramoisi Thank you. |
See #376 for the whole discussion and work steps.
This is just #376 with the 33 work commits squashed into 17 logical commits.
This change is