Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add typography rules for Russian #557

Merged
merged 7 commits into from Mar 20, 2024
Merged

Conversation

dmalinovsky
Copy link
Contributor

@dmalinovsky dmalinovsky commented Mar 15, 2024

Russian typography frowns upon having one letter prepositions and conjunctions hanging at the end of the line.

There are many Russian resources discussing this, I've linked some of the better known ones:
https://www.artlebedev.ru/kovodstvo/sections/62/
https://gramota.ru/spravka/vopros/294020
https://gramota.ru/spravka/vopros/219773


This change is Reviewable

Russian typography prohibits having one and two letter words hanging at the end of the line.

There are many Russian resources discussing this, I've linked one of the better known ones.

If anyone knows if this is true for Ukrainian and Belarusian as well, please let me know.
@poire-z
Copy link
Contributor

poire-z commented Mar 15, 2024

Pinging @pkb @virxkane @hius07 @mergen3107 @ssvb for confirmation this is the thing to do, and If anyone knows if this is true for Ukrainian and Belarusian ?

(And if it is the right thing to do, why have you been fine without it for so long ? :) Because it's some minor expectation ? How does the risk of having more hyphenation or spacing between words compares to getting this nice ?)

@dmalinovsky
Copy link
Contributor Author

I stopped using KOREader for a while after device upgrade. Now I'm back to it, and this is pretty noticeable. As an afterthought, I think it's better to scale down this change and prohibit one letter words only, so we'll have a balance between too much spacing and nice typography.

This will balance between too much spacing and nice typography.
@Frenzie
Copy link
Member

Frenzie commented Mar 15, 2024

@poire-z

(And if it is the right thing to do, why have you been fine without it for so long ? :)

In English and possibly many other languages it'd be preferable to avoid it when reasonably possible as well (not to be confused with prohibiting it :-) but it's a fairly rare occurrence.

@dmalinovsky
Copy link
Contributor Author

@poire-z

(And if it is the right thing to do, why have you been fine without it for so long ? :)

In English and possibly many other languages it'd be preferable to avoid it when reasonably possible as well (not to be confused with prohibiting it :-) but it's a fairly rare occurrence.

Okay, perhaps “prohibits” is a too strong word. I updated the PR description.

@ssvb
Copy link

ssvb commented Mar 15, 2024

for confirmation this is the thing to do, and If anyone knows if this is true for Ukrainian and Belarusian ?

The Belarusian hyphenation rules can be found here. Basically, in layman terms:

  • it's incorrect to hyphenate a word in such a way, that a single letter is left out on a separate line.
  • "дж" and "дз" are digraphs and they shouldn't be broken by hyphenation (except for the words, which start with "пад-" or "ад-" prefixes).
  • the hyphenated half of the word on the second line can't start with an apostrophe or letters "й", "ў", "ь".

@dmalinovsky
Copy link
Contributor Author

The Belarusian hyphenation rules can be found

I was more curious about whether it's okay to leave single letter prepositions at the end of the line — e.g. "з", "ў", etc.

@ssvb
Copy link

ssvb commented Mar 15, 2024

I was more curious about whether it's okay to leave single letter prepositions at the end of the line — e.g. "з", "ў", etc.

Ah, sorry, I somehow thought that it was a question about hyphenation. I don't remember any rules regulating one-letter words left hanging out in the beginning or in end of a line. Probably nobody really cares. It's just a matter of aesthetic style and if your patch makes the text look better, then go for it.

@mergen3107
Copy link

@poire-z
To answer your questions...
I started using crengine in form of CoolReader back in 2011 on Nook Simple Touch with Glowlight. At first, I was very picky about hyphenation patterns, but then I learned hyphenation engine isn't perfect, and it is hard to tweak all these cases.

As time went on, I was becoming less and less picky up to the point when I am probably illiterate in Russian hyphenations :D so I stopped recognizing these patterns, because I got what I wanted from hyphenations - saved spaced and straight text boxes.

But thank you @dmalinovsky for bringing this up, I'll revise all of these again and dig up my old notes with complaints :D

@poire-z
Copy link
Contributor

poire-z commented Mar 16, 2024

Just some warnings - as I can't judge about what's preferable, not reading Russian:

This is not about hyphenation, but about where to not line wrap when there is a normal space that should usually wrap.

Translated to English, https://www.artlebedev.ru/kovodstvo/sections/62/ says (and I think there is just that about this topic):

Каждый раз нужно вникать в смысл текста и привязывать предлоги и союзы к следующему за ними слову, а частицы — к предыдущему.
Each time you need to delve into the meaning of the text and link prepositions and conjunctions to the word following them, and particles to the previous one.

That's quite little, and hardly reads as "Russian typography frowns upon having one letter words hanging at the end of the line." :)
May be prepositions and conjunctions are mostly one-letter-words - but so could "particles" ?
And reading Kafka, you would then never see "Joseph K" at the end of a line, but always:

blah blah and Joseph
K reconsidered something
blah blah blah blah blah

And to ensure that, the code may need to more often increase spacing between words:

blah   blah  and  Joseph
K reconsidered something

or hyphenate the following word:

blah blah and Joseph K re-
considered something blah

So, it's not free benefit and auto-looks-better.

It should also just not be a question of taste - or it should be a taste shared by many.
For Polish, there are indeed some state/academic documentation that specifies the letters that are prepositions - and K is among them, so they would have this issue with Kafka :) Pinging @ptrm : do such false positive happens often?

The best way to be sure it's something that is worth doing is to check a few books by good Russian publishers, and View HTML some text selection with such single letter words, and Switch to debug view, and see if these publishers have explicitely put a no-break-space after such letters, we show them as .
If there are some but not many, and many single letter don't have them, it's that it's really dependant on the text/word/meaning/context, and we may not be able to do it automatically, and have to expect publishers to do that with adding   at the right places.
I think we saw a lot of them in some polish book at the time we added it, so it felt safe to ensure it via the code (even if I'm sure it causes false positives).

@dmalinovsky
Copy link
Contributor Author

And reading Kafka, you would then never see "Joseph K" at the end of a line, but always:

In Russian, initials will have a period added, so it'll be "Joseph K." and won't be affected by my change. As far as I know, only prepositions and conjunctions should have one letter length.

Not all books abide by these rules, unfortunately, but it's considered a sign of good typography. For example, one of the biggest Russian ebook sellers, LitRes, is using non-breaking spaces in FB2 files it produces. They also have an English website, litres.com.

@dmalinovsky
Copy link
Contributor Author

dmalinovsky commented Mar 16, 2024

Here's a sample FB2 file from LitRes:
70244713.fb2.zip

Note that it's using ASCII code 160 for non-breaking symbols, so you'll have to use hex editor or something similar to view it. They're added after 1 or 2 letter prepositions and short conjunctions ("а" and "и").

@dmalinovsky
Copy link
Contributor Author

Okay, I found one exception which will be broken by my change. As @poire-z correctly noted, we should leave particles at the end of the line, and there's one 1-letter particle, "б".

Let me update the PR to explicitly list prepositions and conjunctions.

Otherwise, a particle "б" will be also included, and it should stay at the end of the line.
@hius07
Copy link
Member

hius07 commented Mar 16, 2024

Pages from academic classical books.
I personally do not feel it as an issue.

1

2

@dmalinovsky
Copy link
Contributor Author

Pages from academic classical books. I personally do not feel it as an issue.

Fair enough. If others feel the same, I'll close this PR.

@ptrm
Copy link

ptrm commented Mar 18, 2024

@poire-z

It should also just not be a question of taste - or it should be a taste shared by many. For Polish, there are indeed some state/academic documentation that specifies the letters that are prepositions - and K is among them, so they would have this issue with Kafka :) Pinging @ptrm : do such false positive happens often?

TL;DR: Cyrillic U+043A к is not Latin U+006B k, and the same goes for uppercase versions, so no risk of a false positive here :) I took a closer look at the comments and PR code only after writing the digressions below, but maybe they'll be of any use for future reference ;)

K is an archaic preposition, started disappearing in XVI century, so in this case no ;) Also initials are followed by a dot in Polish, so even "Józef W." or "Z." would not cause a false positive here in case of contemporary prepositions. I think a false positive here would require a very rare case, cause e.g. the "A" team is in quotation marks, etc. ;)

Then most publishers (can't even think of a counter-example) add non-breaking space after prepositions, so that never was an issue in my case. Also dangling prepositions are frowned upon, but not considered errors in Polish

@dmalinovsky
Copy link
Contributor Author

Here's a recommendation from another well known resource about Russian language, its grammar, etc.: https://gramota.ru/spravka/vopros/294020

In general, yes, because it is extremely undesirable to leave one-letter prepositions and conjunctions at the end of a line that begin a sentence. By the way, in book publications it is not recommended to leave single-letter conjunctions and prepositions at the end of a line, even in the middle of sentences (in magazines, newspapers, information publications and publications of operational printing, this is allowed).

@dmalinovsky
Copy link
Contributor Author

I wish there was a way to make it a local only change with custom hyphenation rules, but alas...

@poire-z
Copy link
Contributor

poire-z commented Mar 18, 2024

We can try it - there's a few weeks until next KOReader release to see how good or bad it makes things.

May be let it only for "ru" - will it be ok to switch to typography ukrainian or belarussian to compare ? or are other typography rules like hyphenation different enough that other things will be at play and we won't be able to really compare ?

Can you fix the indentation for case 'К': and case 'к': ?

Also, just for our culture of us non-cyrillic readers, could you add the english meaning of these preposition, as it was friendly done for Polish:

case 'Z': // Meaning in english:
case 'a': // and
case 'i': // and
case 'o': // about
case 'u': // at
case 'w': // in
case 'z': // with

@ptrm: I was just asking about any false positive (not really only about just Joseph K. :) and maybe there are people whose last name is just a single letter ). And maybe if there are false positive, they just get not noticed.

Also improved indentation and reworded the comment.
@poire-z
Copy link
Contributor

poire-z commented Mar 18, 2024

I wish there was a way to make it a local only change with custom hyphenation rules, but alas...

Note that we have 3 lang tags we can force set for Russian, so you could apply your tweaks to only one of ru-GB or ru-US (dunno if these reaches deep down to our lang_tag here) - but then it won't really be tested.

https://github.com/koreader/koreader/blob/9387fcd2d0af29a0b915b53b598f561372522e88/frontend/apps/reader/modules/readertypography.lua#L87-L89

Not advising to do that, just mentionning it in case it gives other thoughts.

@dmalinovsky
Copy link
Contributor Author

May be let it only for "ru" - will it be ok to switch to typography ukrainian or belarussian to compare ? or are other typography rules like hyphenation different enough that other things will be at play and we won't be able to really compare ?

Sure, I think I was too hasty to suggest extending it to other languages as well.

Can you fix the indentation for case 'К': and case 'к': ?

Done.

Also, just for our culture of us non-cyrillic readers, could you add the english meaning of these preposition, as it was friendly done for Polish:

Done.

Also, is it okay to specify Cyrillic letters as UTF-8? Do I need to do something special for the encoding?

@poire-z
Copy link
Contributor

poire-z commented Mar 18, 2024

Thanks.

Also, is it okay to specify Cyrillic letters as UTF-8? Do I need to do something special for the encoding?

I guess it's fine - it reads fine in Github web, and I guess you compiled and tested it and it works.
(It's just me with my old latin1 environment that will see gibberish - but so is cyrillic to me anyway :))

I may push a PR in the coming days - so I'll merge this one then, if nobody else stops us here - and bump everything into KOReader.

@dmalinovsky
Copy link
Contributor Author

I guess it's fine - it reads fine in Github web, and I guess you compiled and tested it and it works.

To be on the safe side, I've replaced raw letters with UTF-32 sequences. The same way is used for the quotes in the file anyway.

@ptrm
Copy link

ptrm commented Mar 18, 2024

@ptrm: I was just asking about any false positive (not really only about just Joseph K. :) and maybe there are people whose last name is just a single letter ). And maybe if there are false positive, they just get not noticed.

Yeah, and I think I answered about other single letters too, and still can't think of any cases. I think those false positives would be extremely rare, but sure, foreign names may cause such cases :)

And since we're talking corner cases, I guess checking if an uppercase letter is at the beginning of a sentence (otherwise not a preposition for sure) would be hard to do?

@poire-z
Copy link
Contributor

poire-z commented Mar 18, 2024

Also, is it okay to specify Cyrillic letters as UTF-8? Do I need to do something special for the encoding?

I guess it's fine - it reads fine in Github web, and I guess you compiled and tested it and it works. (It's just me with my old latin1 environment that will see gibberish - but so is cyrillic to me anyway :))

Or I dunno. I remember seeing some non-ASCII char litteral quoted with L'x'.
https://en.cppreference.com/w/cpp/language/character_literal
May be use 0x0432 like elsewhere, and put the litteral cyrillic char in the // comment ?

To be on the safe side, I've replaced raw letters with UTF-32 sequences. The same way is used for the quotes in the file anyway.

oh, I see you just did that. I think U"" (double quotes) is for a string of multiple char, and U'' (single quote) is for a single char.
May be just use 0x1234 (without any quote) to be similar to how non-ascii char are done elsewhere in textlang.cpp, just for consistency in style?

@dmalinovsky
Copy link
Contributor Author

May be just use 0x1234 (without any quote) to be similar to how non-ascii char are done elsewhere in textlang.cpp, just for consistency in style?

I looked at line 304, for example, and copied it. Looks like there are 2 styles in the file. :)

@poire-z
Copy link
Contributor

poire-z commented Mar 18, 2024

I guess checking if an uppercase letter is at the beginning of a sentence (otherwise not a preposition for sure) would be hard to do?

I think so - I also don't want to put too much (any :)) heuristics about grammar (what is a sentence start) and how much to look ahead/behind, what to skip, etc... in that low level code :).

dmalinovsky added a commit to dmalinovsky/koreader that referenced this pull request Mar 18, 2024
@poire-z
Copy link
Contributor

poire-z commented Mar 18, 2024

I looked at line 304, for example, and copied it. Looks like there are 2 styles in the file. :)

May be I copied that list from elsewhere, or I thought this list of quotes could be ready for multi-codepoints quotes when needed - even if there's only single-codepoints quotes in it currently.

@dmalinovsky
Copy link
Contributor Author

May be I copied that list from elsewhere, or I thought this list of quotes could be ready for multi-codepoints quotes when needed - even if there's only single-codepoints quotes in it currently.

Makes sense. I've updated the style to match.

@poire-z poire-z merged commit 8e055dd into koreader:master Mar 20, 2024
1 check passed
@dmalinovsky
Copy link
Contributor Author

@poire-z, thank you! Can you please also merge koreader/koreader#11570 later? It's a cosmetic change.

@poire-z
Copy link
Contributor

poire-z commented Mar 20, 2024

Yes I will, just after the PR I'll make to bump all this crengine stuff, to keep things in a logical order.

@poire-z
Copy link
Contributor

poire-z commented Mar 21, 2024

I'm not sure you did test these changes, so please do :)
Our ota/nightly build download server is down. In the meantime, you can download manually a zip at:
https://gitlab.com/koreader/nightly-builds/-/pipelines by clicking on the right of the top most/most recent pipeline: image

@dmalinovsky
Copy link
Contributor Author

dmalinovsky commented Mar 21, 2024

I tested the last nightly, and this change works.

Before:

After:

Note that the circled preposition "в" moved from the end of the line to the next one. Also, the total page number stayed the same, so the effect from this hyphenation restriction is pretty small.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants