Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hyphenation patterns update (cleaned up #376) #378

Merged
merged 20 commits into from Sep 17, 2020

Conversation

poire-z
Copy link
Contributor

@poire-z poire-z commented Sep 17, 2020

See #376 for the whole discussion and work steps.
This is just #376 with the 33 work commits squashed into 17 logical commits.


This change is Reviewable

@poire-z
Copy link
Contributor Author

poire-z commented Sep 17, 2020

Pinging @roshavagarga and @cramoisi (I can't add you as reviewers...) for a quick look and confirmation that I didn't mess anything, before merging.

@poire-z poire-z mentioned this pull request Sep 17, 2020
28 tasks
@cramoisi
Copy link
Contributor

I will look into it after work :)

"filename": "Latvian.pattern",
"language": "lv",
"left_hyphen_min": 2,
"right_hyphen_min": 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed adding a comma after this line (157)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, just saw that, going to fix it.

@roshavagarga
Copy link
Contributor

I did my once-over and it seems fine to me, though I did notice this has 1 less new line compared to the original, but no idea where that 1 missing line is from, probably something I missed and you fixed?

@poire-z
Copy link
Contributor Author

poire-z commented Sep 17, 2020

Yes, a last trailing blank line in Esperanto.pattern that I removed (to avoid github highlighting it as "not clean").

@roshavagarga
Copy link
Contributor

Ah, those always annoyed me, good to know I can get rid of them 👍

@poire-z
Copy link
Contributor Author

poire-z commented Sep 17, 2020

Added 2 commits to rename badly named Romanian and Ukrainian pattern files, and some methods to help cleaning up frontend readertypography.lua code.

@roshavagarga
Copy link
Contributor

@poire-z Not sure how to add changes to this PR, so I've attached the latest Spanish pattern :)

Spanish.zip

@poire-z poire-z force-pushed the roshavagarga_hyph_works branch 2 times, most recently from 61ebe5b to 86432e4 Compare September 17, 2020 10:41
@poire-z
Copy link
Contributor Author

poire-z commented Sep 17, 2020

OK, added.

I tried learning git some time ago and just gave up out of frustration and stuck with using github's web ui.

If you plan to do more work/fixes like these, and you have a linux machine / can do git command line stuff, I can give you a small cheatsheet on how to proceed to make simple PRs on new branches (that I could more easily reword, but even you, and we'll be here to help if you get stuck).

@roshavagarga
Copy link
Contributor

Yeah, the one Linux machine I have is a decrepit single-core tiny notebook from like 7-8 years ago that's running Lubuntu, because nothing else runs well on it :/
I generally stick to Windows 7, because gaming, but you can still share the cheatsheet and if I do resort to it I can always boot from a usb drive or something.

<pattern></pattern>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line 7 must be deleted

<pattern>unu3a2nim</pattern>
<pattern>uo2</pattern>
<pattern> uv2u3l</pattern>
<pattern>uzulinterfaco</pattern>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please move the line 1964 at the end of the file, and comment it :)

Comment on lines 6085 to 6086
<pattern>dtiom5áintí</pattern>
<pattern>thiom5áintí</pattern>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you move these lines 6085-6086 before your block of commented exceptions ?

<pattern>upp5yver</pattern>
<pattern>ut5ørk</pattern>
<pattern>ut5ørken</pattern>
<!-- <pattern>velan</pattern>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you gather all your commented lines in one block at the end of the file ?

Comment on lines 2474 to 2476
<!-- Exceptions, see https://github.com/koreader/crengine/pull/376
<pattern>dosť</pattern>
-->
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented lines at the end of file ? ;)

<pattern> re3aprend</pattern>
<pattern> re3aprénd</pattern>
<pattern> re3apret</pattern>
<pattern> reapríet</pattern>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be commented and moved at the end of the file

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All 4 ? Or just the last one without any number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll assume it's just the last one.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of, right, you answered that below :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yup. only the non-numbered one. I messed with my line selection, sorry :)

<pattern> reu3nia</pattern>
<pattern> reu3ní</pattern>
<pattern> reu3nis</pattern>
<pattern> reunim</pattern>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to be commented and move at the end of the file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only the ones without numbers @ 4098, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes :)

{ "en-GB", "English_GB", "English_GB.pattern", 2, 3 },
{ "en", "English_US", "English_US.pattern", 2, 3 },
{ "eo", "Esperanto", "Esperanto.pattern", 2, 2 },
{ "et", "Estonian", "Estonian.pattern", 2, 3 },
{ "fi", "Finnish", "Finnish.pattern", 2, 2 },
{ "fr", "French", "French.pattern", 2, 1 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add // see French.pattern file as a comment for french 2,1 ?

@poire-z
Copy link
Contributor Author

poire-z commented Sep 17, 2020

(@roshavagarga : I'll do the requested modifications - later this afternoon.)

@poire-z
Copy link
Contributor Author

poire-z commented Sep 17, 2020

OK, some quick notes (build from my own notes, removing more complex and rare stuff):

FORK repo on github

CLONE AND SETUP UPSTREAM :
$ git clone https://github.com/yourname/crengine.git
$ git remote -v
$ git remote add upstream https://github.com/koreader/crengine.git
$ git remote -v
    origin  https://github.com/yourname/crengine.git (fetch)
    origin  https://github.com/yourname/crengine.git (push)
    upstream  https://github.com/koreader/crengine.git (fetch)
    upstream  https://github.com/koreader/crengine.git (push)

NEVER ADD STUFF TO YOUR master BRANCH, it will just be kept up to date
with upstream master as a reference.

To sync your local master branch with upstream:
$ git checkout master
$ git fetch upstream
$ git merge upstream/master
To update your remote branch with your now-sync'ed local branch:
$ git push origin master

----
CREATE A NEW BRANCH to be pushed as a PR:
$ git checkout -b my-mod-branch
<< edit files >>
To see your changed stuff
$ git status
$ git diff

To make a commit with your changed stuff:
$ git add path/to/file
$ git add .    (if you want to add them all)
$ git status
$ git commit
<< lauches vi >>
    Your commit title

    Your commit details if needed.
Check all looks good:
$ git log
$ git log -p
Add more commits if needed.

When ready, push for a PR
$ git push -u origin my-mod-branch         (-u to create it)
And follow the url to github to open a PR.

Add more commits (as above) if needed to answer review comments, and push them:
$ git push origin my-mod-branch

Here, you would have a lot of other options, to rebase, squash, remove commits,
I could detail them when needed.

When PR is merged, cleanup and resync everything:
$ git checkout master
$ git branch -D my-mod-branch
$ git fetch --prune
$ git fetch upstream ; git merge upstream/master
$ git push origin master

Ready for next PR :)

roshavagarga and others added 7 commits September 17, 2020 13:48
Update right_hyphen_min from 2 to 3.
Add exceptions to pattern file (some commented out,
to be investigated).
Add exceptions to pattern file.
Add exceptions to pattern file (some commented out,
to be investigated).
So frontend code can fetch per-language hyphen limits.
@poire-z
Copy link
Contributor Author

poire-z commented Sep 17, 2020

Ok, done. Please have a quick look again as I did this in a hurry.

Copy link
Contributor

@cramoisi cramoisi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok good for me :)

@virxkane
Copy link
Contributor

This file is sufficient to describe the 'TeX Hyphenation patterns' format?
https://github.com/hyphenation/tex-hyphen/blob/master/docs/tb87nemeth.pdf

Ok, I can pick hyphenation patterns files from KOReader into CoolReader. Thanks to all contributors.
But I thought, what to do with the "*_hyphen_(Alan) .pdb" files, just delete it? But what if there is something missing in the "*.pattern" files?
So, I wrote a simple converter (dumper) for "*_hyphen_(Alan).pdb" or "*.pattern" files and saw that some of the patterns that are present in "*_hyphen_(Alan).pdb" are missing in "*.pattern".
I guess that "*_hyphen_(Alan).pdb" files transfered from AlReader program:

\brief AlReader hyphenation manager

And I don't know where Alan got these patterns and how correct they are.
But if I make dump from "English_US_hyphen_(Alan).pdb" I got 295799 bytes instead of 126141 from file English_US.pattern. And this fact makes you wonder which file is better and more correct.
I found that in KOReader "*_hyphen_(Alan).pdb" files dropped in #209 and #206 and I guess nobody checked contents of the "*_hyphen_(Alan).pdb" files.
Maybe we can combine them somehow? What is the best way to do this? Is it ok to just add omitted patterns in "*.pattern" from "*_hyphen_(Alan).pdb"?

P.S. This tool can be found in my not-yet-merged branch here: https://github.com/virxkane/coolreader/tree/updates-2020.09-1/crengine/Tools/HyphDumper
I would be grateful if someone points out the errors.

@cramoisi
Copy link
Contributor

cramoisi commented Sep 20, 2020

I could check the differences if you provide a human readable dump of English pdb file. Différences can be legit as the TeX files are only used as a base for crengine which can't read all TeX rules by design. So when I changed from pdb to pattern I also removed them as they were useless.

But as now we are following the hyphen patterns from recognised sources, so I don't think we need to worry about trying to make better than linguists already did for libreoffice, etc. I think feedback from native speakers should be enough as it's somehow tricky to edit these files without in dept knowledge of the languages...

@virxkane
Copy link
Contributor

How can I make "human readable dump of English pdf file"? I can only convert *_hyphen_(Alan).pdb into *.pattern.

@cramoisi
Copy link
Contributor

That will do as I'm human and I can read a pattern file with my eyes :)

Btw : I checked the content of pdb files back then but only for a few of them. As I said, some differences are legit

@poire-z
Copy link
Contributor Author

poire-z commented Sep 20, 2020

Maybe we can combine them somehow? What is the best way to do this? Is it ok to just add omitted patterns in ".pattern" from "hyphen(Alan).pdb"?

I don't think combining them makes any sense: it's not really "the more items the better", but having a set that is consistent within itself (some patterns with numbers overriding some others with other numbers - mixing sources could possibly have these numbers relations incoherent).
Given that we don't really know who made the _Alan.pdb from what and when, and that the sources used to build the *.pattern are recent, referenced and updated, our *.pattern are a lot more trustable.

@virxkane
Copy link
Contributor

I don't think combining them makes any sense: it's not really "the more items the better", but having a set that is consistent within itself (some patterns with numbers overriding some others with other numbers - mixing sources could possibly have these numbers relations incoherent).
Given that we don't really know who made the _Alan.pdb from what and when, and that the sources used to build the *.pattern are recent, referenced and updated, our *.pattern are a lot more trustable.

Yes, it looks logical.
And finally, for the connoisseurs of English, which I am not, dump 4 files:

  1. https://github.com/koreader/crengine/blob/master/cr3gui/data/hyph/English_GB.pattern ->
    English_GB_dump.pattern.zip
  2. https://github.com/koreader/crengine/blob/master/cr3gui/data/hyph/English_US.pattern ->
    English_US_dump.pattern.zip
  3. https://github.com/buggins/coolreader/blob/master/cr3gui/data/hyph/English_GB_hyphen_(Alan).pdb ->
    English_GB_Alan_dump.pattern.zip
  4. https://github.com/buggins/coolreader/blob/master/cr3gui/data/hyph/English_US_hyphen_(Alan).pdb ->
    English_US_Alan_dump.pattern.zip

The dump for "*.pattern" is the same as original, but all patterns are sorted for ease of comparison.

  • Difference between "English_GB_Alan_dump.pattern" and "English_US_dump.pattern" is less than between "English_US_Alan_dump.pattern" and "English_US_dump.pattern" :).
  • Patterns in "English_GB_hyphen_(Alan).pdb" like "a8s7s8o7c8i8a8t8e8s" look like a hack.

Is there something useful about these files "*_hyphen_(Alan).pdb" ?

@cramoisi
Copy link
Contributor

Ok, I checked and actual patterns are more developed and consistent than the old one. A good example is that exceptions introduced by the hacks you mentioned in the pdb files are all processed correctly in current pattern file. So I guess you could get rid of these _(alan).pdb files once for all :-)

@virxkane
Copy link
Contributor

@cramoisi Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants