Added Serbian hyphenation patterns #372

strn · 2020-08-30T08:21:25Z

This change is

poire-z · 2020-08-30T08:39:53Z

Please also add it among these:

Lines 33 to 36 in 9d86806

    
           } _hyph_dict_table[] = { 
        
               { "bg",    "Bulgarian",     "Bulgarian.pattern",     2, 2 }, 
        
               { "ca",    "Catalan",       "Catalan.pattern",       2, 2 }, 
        
               { "cs",    "Czech",         "Czech.pattern",         2, 2 },

Also, dunno if Serbian shares with Russian these quote behaviours:

crengine/crengine/src/textlang.cpp

Lines 699 to 703 in 9d86806

    
           else if ( LANG_STARTS_WITH(("ru")) ) { // Russian 
        
               has_left_double_quotation_mark_closing = true; 
        
               has_left_double_angle_quotation_mark_opening = true; 
        
               has_right_double_angle_quotation_mark_closing = true; 
        
           }

poire-z · 2020-08-30T10:00:14Z

crengine/src/textlang.cpp

+        has_right_single_quotation_mark_opening = true;
+        has_right_single_quotation_mark_closing = true;


This is wrong: you can't have right_single_quotation_mark ’ both closing and opening here (actually, this is the default behaviour if you don't set anything about it here - so, if ’ can really be used for open & close, don't set anything about it).
Don't feel obliged to set anything, unless you really know about your Serbian typography :) (not much mentionned about it in https://en.wikipedia.org/wiki/Quotation_mark )

Corrected as advised, but also corrected line 404, according to https://en.wikipedia.org/wiki/Quotation_mark (standard primary and secondary quotation marks).

poire-z · 2020-08-30T10:53:34Z

crengine/src/textlang.cpp

+        has_left_double_quotation_mark_closing = true;
+        has_right_double_quotation_mark_closing = true;


You're really sure about these, both “ and ” closing ? (meaning these don't come in pair).
Note that these rules are just to avoid/allow some line breaking.
I think (it's all really complicated) that with this, you won't have a break before “ and ”, whether there is a space or not before them. But you're allowing a break after, even if stuck to a word.
So, better be sure that's really how old and current Serbian have been using these “ and ”.

If you usually don't have a space before such quotes (we do have these spaces in French), and you don't have constructs like good o' stuff (guess that's the reason for has_left_single_quotation_mark_opening=true in english), it's safer to not set anything :)

Yes, I am really sure about closing double quotes in Serbian language - "official" double quotes are „”, but alternate way of writing is „“ so both ” and “ can be closing double quotes. Please see previously quoted Wiki page about quotation marks. Serbian language is my mother tongue.

BTW, what is "old" and "new" Serbian?

Typography rules have a tendency to shift as the language evolves ;).

OK, trusting you on this :)

both ” and “ can be closing

OK, just note that these rules are not so much about "can be" , but are rather saying:
both ” and “ are always closing (and have always been closing, even in 18th century serbian)

BTW, what is "old" and "new" Serbian?

Dunno :) Just that sometimes, some typographical rules might be recent (if "official"), and might give wrong rendering with older text/books that had different punctuation/quotes usage at the time there were written.

OK, since we are all sure and there are no issues, what is preventing PR from being approved?

both ” and “ are always closing

Yes, both “ and ” are always closing quotation marks, just as „ is always opening quotation mark. https://en.wikipedia.org/wiki/Quotation_mark

OK, since we are all sure and there are no issues, what is preventing PR from being approved?

As merging and bumping it up to frontend takes a bit of work and time, we just want you to be sure and that you answer all our (us outsiders) remarks and questions :) so we're sure we're doing the right thing and don't have to correct it tomorrow and redo the bump.

So, another one :) :

just as „ is always opening

Any reason then to not add U+201A_opening ? ... May be just because we don't have it yet :)
We don't have/needed any rules for

SINGLE LOW-9 QUOTATION MARK (U+201A) SINGLE HIGH-REVERSED-9 QUOTATION MARK (U+201B) DOUBLE LOW-9 QUOTATION MARK (U+201E) DOUBLE HIGH-REVERSED-9 QUOTATION MARK (U+201F)

So, if really needed, add them and their handling:

bool has_single_low9_quotation_mark_opening = false; // U+201A ‚ etc...

But yet again: if Serbian doesn't add any space after an opening or before a closing quote (= it always stick to the word that if precedes or follows), most of these are not needed.

Actually, that is what I was wondering about - why there is no lower double quotation mark? German language uses it, and Serbian language follows it when it comes to quotation marks. Russian, OTOH, follows French rules.

I was reluctant to do these changes since I am not familiar with the code, but OK,I can try adding the rules.

If you mean like in Dutch:

Iemand zei: „Hallo.”

It's simply because it's not like French:

Quelqu'un a dit « bonjour ».

Strictly speaking those should be non-breaking spaces, but in practice you'll often find regular spaces as in the sample. It's only for that specific scenario that exceptions are needed, because space means potential line-break. «bonjour» would be just a bunch of characters happily stuck together, no special handling required. ^_^

But yet again: if Serbian doesn't add any space after an opening or before a closing quote (= it always stick to the word that if precedes or follows), most of these are not needed.

Ah, now I understand. If so, no special handling of double quotation marks is required.

…language does not add spaces; added Serbian language to languages that duplicates hyphen on next line if word contains it (composite words)

poire-z · 2020-08-30T12:39:52Z

Looks fine to me. No afterthought ? Ready to merge ?
Will you then bump crengine in a koreader-base PR (git submodule update --remote thirdparty/kpvcrlib/crengine) ? Or should I do it for you ?

strn · 2020-08-30T12:49:50Z

I guess the question is for @Frenzie ?

roshavagarga · 2020-08-30T15:09:57Z

@strn @poire-z The file needs the exceptions from the original version

Also, shouldn't this file be named Serbian-Cyrillic and the commit should also include another .pattern file for Serbian-Latin? Serbian is written in both the Cyrillic and Latin alphabet, as can be read about in this Wikipedia article, so it makes sense to include both.

poire-z · 2020-08-30T15:36:26Z

I guess from the lang tag names:

crengine/crengine/src/textlang.cpp

Lines 402 to 403 in 9d86806

    
           { "sr-latn",  L"\x201e", L"\x201c", L"\x2018", L"\x2018" }, /* „ “ ‘ ‘ */ 
        
           { "sr",       L"\x201e", L"\x201c", L"\x2018", L"\x2018" }, /* „ “ ‘ ‘ */

the default for "sr" is Cyrillic, and Latn is to be added when latin is to be used.

(unlike Uzbek, which looks like it defaults to some kind of latin alphabet)

crengine/crengine/src/textlang.cpp

Lines 418 to 419 in 9d86806

    
           { "uz-cyrl",  L"\x201c", L"\x201d", L"\x2018", L"\x2019" }, /* “ ” ‘ ’ */ 
        
           { "uz",       L"\x201c", L"\x201d", L"\x2019", L"\x2018" }, /* “ ” ’ ‘ */

So, dunno if we should follow this to not mention the script when it is the default, or if we should for both, for clarity.

roshavagarga · 2020-08-30T15:44:41Z

@poire-z Uzbek has those because of historical reasons, but it only uses the Yañalif-based Latin script since the early 90s, so the old ones are probably for older documents that have been digitized?

Serbian uses both writing systems and will continue to do so, so they should both be added, but as far as which one should be the default I'm guessing you'd need a frequency check to see what's more common - a Serbian ebook in Cyrillic or in Latin. As far as naming, I have no idea about other languages, but for Serbian it should probably have the script for clarity's sake.

poire-z · 2020-08-30T16:00:10Z

for older documents that have been digitized?

Or for newer documents that would quote older text with <blockquote lang="uz-cyrl">

which one should be the default I'm guessing you'd need a frequency check to see what's more common

There's no default to have. Everything is driven by:

the lang tags in the HTML
the lang tag of the book (there, there could be some ambiguity with "srp" ?)
the user UI language choice (but there also, we should/might have both translations).

The point is that the specs says "sr" standalone is Cyrillic, "sr-latn" should be used when text is in Serbian-Latin. No decision to make about defaults.
So, if we mention Serbian (Cyrillic) and Serbian (Latin) among the UI languages and Typography languages, we should be just fine without having to thing about defaults.

roshavagarga · 2020-08-30T16:09:00Z

Ah, that makes things easier, not having to choose a default that is. Yeah, I think both should be available for the UI translations and as typography languages as a unique exception.

strn · 2020-08-30T16:18:30Z

To reiterate: default script for Serbian language (sr) is Serbian Cyrillic. Serbian Latin (sr-latn) is a political construct leftover from Serbocroatian language (until 1990, before breakup of Yugoslavia). If there is a need for Serbian written using Croatian Latin alphabet it can be easily added.

Hence, in UI languages it should say:

Serbian (default script: Cyrillic)
[if/when needed] Serbian Latin (script: Croatian Latin)

Let me remind you that sites like Google, Facebook, Twitter, Instagram etc. correctly use only Cyrillic alphabet for their Serbian localization.

roshavagarga · 2020-08-30T16:40:11Z

@strn You should still add the exceptions to your Cyrillic pattern file as well as the Latin pattern file itself, but as poire-z mentioned, there is no "default", whichever is needed is selected based on the file's tags or whatever you'd like to call them.

strn · 2020-08-30T16:53:04Z

What XML tag is used for adding exceptions to .pattern file?

roshavagarga · 2020-08-30T16:57:06Z

@strn Just take a look at the English_GB file and its last few lines and compare them to the original lines here.

strn · 2020-08-30T16:57:08Z

@roshavagarga I was talking about default script for given language - i.e. book having tag "sr-latn" should not be written in Serbian Cyrillic alphabet and vice versa. Of course, book language should be picked from EPUB tags or something like that.

strn · 2020-08-30T17:31:20Z

Also, shouldn't this file be named Serbian-Cyrillic and the commit should also include another .pattern file for Serbian-Latin?

At the moment this commit will not include .pattern file for Serbian (Latin). Link quoted by @roshavagarga refers to Serbo-Croatian (sh-latn) hyphenization. Once someone makes Serbian (Latin) hyphenization file (sr-latn) and checks it in TeX hyphenation repository then it can be added.

roshavagarga · 2020-08-30T17:38:49Z

@strn My mistake, didn't notice that, but in this case you can use the LibreOffice .dic hyphenation file as your source, while option B would be to use a script and substitute the Cyrillic letters in your current pattern with their transliteration Latin equivalents?

strn · 2020-08-30T17:41:59Z

Yes, that would be possible. However, we can do it at a later point in time.

poire-z · 2020-08-31T11:14:39Z

Bumping base for you, koreader/koreader-base#1177 .
Once it's merged (later in the afternoon), you can bump base in your koreader/koreader#6596 with git submodule update --remote base/

poire-z · 2020-08-31T12:35:16Z

@strn : base merged. All ready for your frontend final touches.

Added Serbian hyphenation patterns

37669ae

Added Serbian quotation mark rules

78915c0

poire-z reviewed Aug 30, 2020

View reviewed changes

Corrected Serbian quotation mark rules

a8ab79a

poire-z reviewed Aug 30, 2020

View reviewed changes

Removed special handling of opening/closing double quotes as Serbian …

e078826

…language does not add spaces; added Serbian language to languages that duplicates hyphen on next line if word contains it (composite words)

Added exceptions for Serbian words hyphenization

61cca30

poire-z merged commit 02cd126 into koreader:master Aug 31, 2020

poire-z mentioned this pull request Aug 31, 2020

bump crengine: TextLang, hyphenation: add Serbian koreader/koreader-base#1177

Merged

poire-z mentioned this pull request Sep 1, 2020

bump crengine: TextLang, hyphenation: add Serbian koreader/koreader#6618

Merged

NiLuJe mentioned this pull request Nov 24, 2020

Add Latin hyphenation koreader/koreader#6910

Merged

eevan78 mentioned this pull request May 29, 2024

Combine hyphenation patterns for Serbian Cyrillic and Latin scripts #566

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Serbian hyphenation patterns #372

Added Serbian hyphenation patterns #372

strn commented Aug 30, 2020 •

edited by Frenzie

Loading

poire-z commented Aug 30, 2020

poire-z Aug 30, 2020

strn Aug 30, 2020 •

edited

Loading

poire-z Aug 30, 2020

strn Aug 30, 2020 •

edited

Loading

NiLuJe Aug 30, 2020

poire-z Aug 30, 2020

strn Aug 30, 2020

strn Aug 30, 2020 •

edited

Loading

poire-z Aug 30, 2020

strn Aug 30, 2020

Frenzie Aug 30, 2020

strn Aug 30, 2020

poire-z commented Aug 30, 2020

strn commented Aug 30, 2020

roshavagarga commented Aug 30, 2020

poire-z commented Aug 30, 2020 •

edited

Loading

roshavagarga commented Aug 30, 2020

poire-z commented Aug 30, 2020

roshavagarga commented Aug 30, 2020

strn commented Aug 30, 2020 •

edited

Loading

roshavagarga commented Aug 30, 2020

strn commented Aug 30, 2020

roshavagarga commented Aug 30, 2020

strn commented Aug 30, 2020

strn commented Aug 30, 2020 •

edited

Loading

roshavagarga commented Aug 30, 2020

strn commented Aug 30, 2020

poire-z commented Aug 31, 2020

poire-z commented Aug 31, 2020

		has_right_single_quotation_mark_opening = true;
		has_right_single_quotation_mark_closing = true;

		has_left_double_quotation_mark_closing = true;
		has_right_double_quotation_mark_closing = true;

Added Serbian hyphenation patterns #372

Added Serbian hyphenation patterns #372

Conversation

strn commented Aug 30, 2020 • edited by Frenzie Loading

poire-z commented Aug 30, 2020

Choose a reason for hiding this comment

strn Aug 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

strn Aug 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

strn Aug 30, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

poire-z commented Aug 30, 2020

strn commented Aug 30, 2020

roshavagarga commented Aug 30, 2020

poire-z commented Aug 30, 2020 • edited Loading

roshavagarga commented Aug 30, 2020

poire-z commented Aug 30, 2020

roshavagarga commented Aug 30, 2020

strn commented Aug 30, 2020 • edited Loading

roshavagarga commented Aug 30, 2020

strn commented Aug 30, 2020

roshavagarga commented Aug 30, 2020

strn commented Aug 30, 2020

strn commented Aug 30, 2020 • edited Loading

roshavagarga commented Aug 30, 2020

strn commented Aug 30, 2020

poire-z commented Aug 31, 2020

poire-z commented Aug 31, 2020

strn commented Aug 30, 2020 •

edited by Frenzie

Loading

strn Aug 30, 2020 •

edited

Loading

strn Aug 30, 2020 •

edited

Loading

strn Aug 30, 2020 •

edited

Loading

poire-z commented Aug 30, 2020 •

edited

Loading

strn commented Aug 30, 2020 •

edited

Loading

strn commented Aug 30, 2020 •

edited

Loading