Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added Serbian hyphenation patterns #372

Merged
merged 5 commits into from Aug 31, 2020
Merged

Added Serbian hyphenation patterns #372

merged 5 commits into from Aug 31, 2020

Conversation

strn
Copy link
Contributor

@strn strn commented Aug 30, 2020

This change is Reviewable

@poire-z
Copy link
Contributor

poire-z commented Aug 30, 2020

Please also add it among these:

} _hyph_dict_table[] = {
{ "bg", "Bulgarian", "Bulgarian.pattern", 2, 2 },
{ "ca", "Catalan", "Catalan.pattern", 2, 2 },
{ "cs", "Czech", "Czech.pattern", 2, 2 },

Also, dunno if Serbian shares with Russian these quote behaviours:
else if ( LANG_STARTS_WITH(("ru")) ) { // Russian
has_left_double_quotation_mark_closing = true;
has_left_double_angle_quotation_mark_opening = true;
has_right_double_angle_quotation_mark_closing = true;
}

Comment on lines 706 to 707
has_right_single_quotation_mark_opening = true;
has_right_single_quotation_mark_closing = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong: you can't have right_single_quotation_mark both closing and opening here (actually, this is the default behaviour if you don't set anything about it here - so, if can really be used for open & close, don't set anything about it).
Don't feel obliged to set anything, unless you really know about your Serbian typography :) (not much mentionned about it in https://en.wikipedia.org/wiki/Quotation_mark )

Copy link
Contributor Author

@strn strn Aug 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Corrected as advised, but also corrected line 404, according to https://en.wikipedia.org/wiki/Quotation_mark (standard primary and secondary quotation marks).

Comment on lines 706 to 707
has_left_double_quotation_mark_closing = true;
has_right_double_quotation_mark_closing = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're really sure about these, both and closing ? (meaning these don't come in pair).
Note that these rules are just to avoid/allow some line breaking.
I think (it's all really complicated) that with this, you won't have a break before “ and ”, whether there is a space or not before them. But you're allowing a break after, even if stuck to a word.
So, better be sure that's really how old and current Serbian have been using these and .

If you usually don't have a space before such quotes (we do have these spaces in French), and you don't have constructs like good o' stuff (guess that's the reason for has_left_single_quotation_mark_opening=true in english), it's safer to not set anything :)

Copy link
Contributor Author

@strn strn Aug 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I am really sure about closing double quotes in Serbian language - "official" double quotes are „”, but alternate way of writing is „“ so both and can be closing double quotes. Please see previously quoted Wiki page about quotation marks. Serbian language is my mother tongue.

BTW, what is "old" and "new" Serbian?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typography rules have a tendency to shift as the language evolves ;).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, trusting you on this :)

both ” and “ can be closing

OK, just note that these rules are not so much about "can be" , but are rather saying:
both ” and “ are always closing (and have always been closing, even in 18th century serbian)

BTW, what is "old" and "new" Serbian?

Dunno :) Just that sometimes, some typographical rules might be recent (if "official"), and might give wrong rendering with older text/books that had different punctuation/quotes usage at the time there were written.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, since we are all sure and there are no issues, what is preventing PR from being approved?

Copy link
Contributor Author

@strn strn Aug 30, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both ” and “ are always closing

Yes, both and are always closing quotation marks, just as is always opening quotation mark. https://en.wikipedia.org/wiki/Quotation_mark

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, since we are all sure and there are no issues, what is preventing PR from being approved?

As merging and bumping it up to frontend takes a bit of work and time, we just want you to be sure and that you answer all our (us outsiders) remarks and questions :) so we're sure we're doing the right thing and don't have to correct it tomorrow and redo the bump.

So, another one :) :

just as „ is always opening

Any reason then to not add U+201A_opening ? ... May be just because we don't have it yet :)
We don't have/needed any rules for

SINGLE LOW-9 QUOTATION MARK (U+201A)
SINGLE HIGH-REVERSED-9 QUOTATION MARK (U+201B)
DOUBLE LOW-9 QUOTATION MARK (U+201E)
DOUBLE HIGH-REVERSED-9 QUOTATION MARK (U+201F)

So, if really needed, add them and their handling:

bool has_single_low9_quotation_mark_opening = false; // U+201A ‚
etc...

But yet again: if Serbian doesn't add any space after an opening or before a closing quote (= it always stick to the word that if precedes or follows), most of these are not needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, that is what I was wondering about - why there is no lower double quotation mark? German language uses it, and Serbian language follows it when it comes to quotation marks. Russian, OTOH, follows French rules.

I was reluctant to do these changes since I am not familiar with the code, but OK,I can try adding the rules.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you mean like in Dutch:

Iemand zei: „Hallo.”

It's simply because it's not like French:

Quelqu'un a dit « bonjour ».

Strictly speaking those should be non-breaking spaces, but in practice you'll often find regular spaces as in the sample. It's only for that specific scenario that exceptions are needed, because space means potential line-break. «bonjour» would be just a bunch of characters happily stuck together, no special handling required. ^_^

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But yet again: if Serbian doesn't add any space after an opening or before a closing quote (= it always stick to the word that if precedes or follows), most of these are not needed.

Ah, now I understand. If so, no special handling of double quotation marks is required.

…language does not add spaces; added Serbian language to languages that duplicates hyphen on next line if word contains it (composite words)
@poire-z
Copy link
Contributor

poire-z commented Aug 30, 2020

Looks fine to me. No afterthought ? Ready to merge ?
Will you then bump crengine in a koreader-base PR (git submodule update --remote thirdparty/kpvcrlib/crengine) ? Or should I do it for you ?

@strn
Copy link
Contributor Author

strn commented Aug 30, 2020

I guess the question is for @Frenzie ?

@roshavagarga
Copy link
Contributor

@strn @poire-z The file needs the exceptions from the original version

Also, shouldn't this file be named Serbian-Cyrillic and the commit should also include another .pattern file for Serbian-Latin? Serbian is written in both the Cyrillic and Latin alphabet, as can be read about in this Wikipedia article, so it makes sense to include both.

@poire-z
Copy link
Contributor

poire-z commented Aug 30, 2020

I guess from the lang tag names:

{ "sr-latn", L"\x201e", L"\x201c", L"\x2018", L"\x2018" }, /* „ “ ‘ ‘ */
{ "sr", L"\x201e", L"\x201c", L"\x2018", L"\x2018" }, /* „ “ ‘ ‘ */

the default for "sr" is Cyrillic, and Latn is to be added when latin is to be used.

(unlike Uzbek, which looks like it defaults to some kind of latin alphabet)

{ "uz-cyrl", L"\x201c", L"\x201d", L"\x2018", L"\x2019" }, /* “ ” ‘ ’ */
{ "uz", L"\x201c", L"\x201d", L"\x2019", L"\x2018" }, /* “ ” ’ ‘ */

So, dunno if we should follow this to not mention the script when it is the default, or if we should for both, for clarity.

@roshavagarga
Copy link
Contributor

@poire-z Uzbek has those because of historical reasons, but it only uses the Yañalif-based Latin script since the early 90s, so the old ones are probably for older documents that have been digitized?

Serbian uses both writing systems and will continue to do so, so they should both be added, but as far as which one should be the default I'm guessing you'd need a frequency check to see what's more common - a Serbian ebook in Cyrillic or in Latin. As far as naming, I have no idea about other languages, but for Serbian it should probably have the script for clarity's sake.

@poire-z
Copy link
Contributor

poire-z commented Aug 30, 2020

for older documents that have been digitized?

Or for newer documents that would quote older text with <blockquote lang="uz-cyrl">

which one should be the default I'm guessing you'd need a frequency check to see what's more common

There's no default to have. Everything is driven by:

  • the lang tags in the HTML
  • the lang tag of the book (there, there could be some ambiguity with "srp" ?)
  • the user UI language choice (but there also, we should/might have both translations).

The point is that the specs says "sr" standalone is Cyrillic, "sr-latn" should be used when text is in Serbian-Latin. No decision to make about defaults.
So, if we mention Serbian (Cyrillic) and Serbian (Latin) among the UI languages and Typography languages, we should be just fine without having to thing about defaults.

@roshavagarga
Copy link
Contributor

Ah, that makes things easier, not having to choose a default that is. Yeah, I think both should be available for the UI translations and as typography languages as a unique exception.

@strn
Copy link
Contributor Author

strn commented Aug 30, 2020

To reiterate: default script for Serbian language (sr) is Serbian Cyrillic. Serbian Latin (sr-latn) is a political construct leftover from Serbocroatian language (until 1990, before breakup of Yugoslavia). If there is a need for Serbian written using Croatian Latin alphabet it can be easily added.

Hence, in UI languages it should say:

  • Serbian (default script: Cyrillic)
  • [if/when needed] Serbian Latin (script: Croatian Latin)

Let me remind you that sites like Google, Facebook, Twitter, Instagram etc. correctly use only Cyrillic alphabet for their Serbian localization.

@roshavagarga
Copy link
Contributor

@strn You should still add the exceptions to your Cyrillic pattern file as well as the Latin pattern file itself, but as poire-z mentioned, there is no "default", whichever is needed is selected based on the file's tags or whatever you'd like to call them.

@strn
Copy link
Contributor Author

strn commented Aug 30, 2020

What XML tag is used for adding exceptions to .pattern file?

@roshavagarga
Copy link
Contributor

@strn Just take a look at the English_GB file and its last few lines and compare them to the original lines here.

@strn
Copy link
Contributor Author

strn commented Aug 30, 2020

@roshavagarga I was talking about default script for given language - i.e. book having tag "sr-latn" should not be written in Serbian Cyrillic alphabet and vice versa. Of course, book language should be picked from EPUB tags or something like that.

@strn
Copy link
Contributor Author

strn commented Aug 30, 2020

Also, shouldn't this file be named Serbian-Cyrillic and the commit should also include another .pattern file for Serbian-Latin?

At the moment this commit will not include .pattern file for Serbian (Latin). Link quoted by @roshavagarga refers to Serbo-Croatian (sh-latn) hyphenization. Once someone makes Serbian (Latin) hyphenization file (sr-latn) and checks it in TeX hyphenation repository then it can be added.

@roshavagarga
Copy link
Contributor

@strn My mistake, didn't notice that, but in this case you can use the LibreOffice .dic hyphenation file as your source, while option B would be to use a script and substitute the Cyrillic letters in your current pattern with their transliteration Latin equivalents?

@strn
Copy link
Contributor Author

strn commented Aug 30, 2020

Yes, that would be possible. However, we can do it at a later point in time.

@poire-z
Copy link
Contributor

poire-z commented Aug 31, 2020

Bumping base for you, koreader/koreader-base#1177 .
Once it's merged (later in the afternoon), you can bump base in your koreader/koreader#6596 with git submodule update --remote base/

@poire-z
Copy link
Contributor

poire-z commented Aug 31, 2020

@strn : base merged. All ready for your frontend final touches.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants