Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Combine hyphenation patterns for Serbian Cyrillic and Latin scripts #566

Merged
merged 1 commit into from
Jun 15, 2024

Conversation

eevan78
Copy link
Contributor

@eevan78 eevan78 commented May 29, 2024

This pull request continues on the pull request #372.
As Serbian language uses two scripts with different codepoints, it is safe to combine the patterns into one file. In that way, it doesn't matter which script is used, and even texts that use both scripts will be properly hyphenated. Only the main part of the language tag in (X)HTML should be consulted to load the appropriate patterns. So sr, sr-Cyrl, sr-Latn, and regional versions of these (like sr_RS) should all load the same pattern file.
This approach is already successfully implemented in ConTeXt.

Patterns have been converted from https://devbase.net/dict-sr/ same ones used in LibreOffice extension Serbian Spellchecker.


This change is Reviewable

…ript

Combine the patterns for Cyrillic and Latin scripts.
@Frenzie
Copy link
Member

Frenzie commented May 29, 2024

As Serbian language uses two scripts with different codepoints, it is safe to combine the patterns into one file.

You mean the Latin one is currently completely absent I presume? As phrased it sounds a bit like you forgot to delete it. :-)

@poire-z
Copy link
Contributor

poire-z commented May 29, 2024

Pinging @strn @roshavagarga who contributed to #372 for thoughts and approval.

@roshavagarga
Copy link
Contributor

@poire-z I'd say @strn would be able to give a more valid opinion around whether this is something that should be done, as my understanding of Serbian and the cultural connotations of the above change are fairly basic.

If it works out-of-the-box and there aren't any cultural reasons not to do this, I don't see an issue.

I would note, however, that I'm not sure how the source(s) used for this compare to the one we currently use for Serbian, so possibly something to compare and/or test? (Taken from here)

@eevan78
Copy link
Contributor Author

eevan78 commented May 29, 2024

As Serbian language uses two scripts with different codepoints, it is safe to combine the patterns into one file.

You mean the Latin one is currently completely absent I presume? As phrased it sounds a bit like you forgot to delete it. :-)

You are right, they are now absent. When I read a Serbian book written in Latin script, I have to change the language to Croatian. That loads the croatian patterns that are based on the same Latin script. Otherwise, there is no hyphenation.

@eevan78
Copy link
Contributor Author

eevan78 commented May 29, 2024

I would note, however, that I'm not sure how the source(s) used for this compare to the one we currently use for Serbian, so possibly something to compare and/or test? (Taken from here)

Those are the same patterns, made by Dejan Muhamedagić, used in TeX.
I just had to convert the codepages to UTF-8 as these patterns use ISO8859-2 (for Latin patterns) and ISO8859-5 (for Cyrillic patterns) encoding.

Serbian hyphenation patterns are derived from official TeX patterns for Serbocroatian language (Cyrillic and Latin) created by Dejan Muhamedagić, version 2.02 from 22 June 2008 adopted for usage with Hyphen hyphenation library and released under GNU LGPL version 2.1 or later.

@poire-z
Copy link
Contributor

poire-z commented Jun 9, 2024

Pinging again @strn - please give us some feedback.

@strn
Copy link
Contributor

strn commented Jun 9, 2024

@poire-z , sorry for the late reply.

Yes, if patterns are the same, then they should be used for hyphenating texts in Serbian language - regardless of how it is written now.

However, let me just emphasize and remind you once again that only Serbian Cyrillic is a valid Serbian language alphabet. Usage of Croatian Latin alphabet comes from Yugoslav era and is best to be left there.

@eevan78
Copy link
Contributor Author

eevan78 commented Jun 10, 2024

As I've already said, this is just a technical matter that removes the need to change languages when reading books typeset on the Latin script.

@strn Can you please point to some valid reference that supports your claims?
Are you saying that for example these are Croatian books? Cyrillic script is defined as an official script in the Constitution, and both scripts are used in a daily correspondence, media, newspapers and publishing. No matter if we like, it or not.
Personally, I'm using Cyrillic script, but many other people that I know are not.
That's the only reason I'm proposing to unify the patterns in one file, purely as a convenience to the user.

@poire-z poire-z merged commit ab1d541 into koreader:master Jun 15, 2024
1 check passed
Frenzie pushed a commit to koreader/koreader that referenced this pull request Jun 16, 2024
Includes:
- Russian hyphenation: revert "allow hyphens after не" koreader/crengine#568
- Serbian hyphenation: combine patterns for Cyrillic and Latin scripts koreader/crengine#566
- writeNodeEx(): fix handling of multilines attribute values koreader/crengine#569
  See #12004 (comment).
- Add getBalancedHTML() helper

Also includes:
- kobo: add missing blitbuffer library koreader/koreader-base#1823
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants