-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
8-bit font encodings in *.ini files #267
Comments
@gmilde @jbezos I guess that Croatian is one of them. However, there are encodings = T1 OT1 LY1 On the other hand, I am not familiar with LY1. Please clarify. |
@gmilde Selecting |
IMO, a font encoding switch suggests itself, if the current encoding only provides a subset of the required encoding. For every language, we may distinguish two sets of encodings:¹
We should consider a suitable way to represent these sets in the *.ini files. Both sets may contain more than one font encoding with variations outside the letters actually used in the respective language. We may use some term or external list for supersets, e.g.
¹There is a grey zone when composite representations have wrong accent glyphs (like Romanian/Latvian characters with comma below in OT1) or misplaced accents (like the comma below in T1).
Currently, 173 babel/locale/*.ini files contain OT1 in the "encodings" list. All of them should be tested for OT1 compatibility. (There is no ogonek in Icelandic but thorn and eth fail with OT1, hyphenation requires T1.) @ivankokan All non-ASCII chars used in Croation (č ć dž đ lj nj š ž Č Ć DŽ Dž Đ LJ Lj NJ Nj Š Ž) work with OT1 while hyphenation requires T1. (The double-letters are automatically decomposed here but the legacy characters work fine in my example file). IMO, Babel's on-the-fly/imported languages should
|
A choice for the default behavior must be made – prioritize either font or hyphenation. There are ~50 |
@gmilde @jbezos So, the issue with Croatian and OT1 is 99 % with the hyphenation (the other 1 % is about missing Đđ which is at least handled somehow)? I leave you two to decide whether the OT1 should be excluded or not (indifferent on this matter but generally tend to be strict :D). |
Đđ are handled exactly like the other "adorned" characters: T1 has slots for pre-composed characters while in OT1 they are created by superposition of the base character and adornment (haček, acute, stroke, ...). This makes T1 the preferred font encoding for these languages and OT1 a "compatibility font encoding" (it works with some drawbacks). |
Loading a font encoding in the document preamble can be interpreted as a statement that the document author wants to use this font encoding at some place in the document. This is why I propose to switch the preferred font encoding for text parts in a "foreign" language if this font encoding is known. (Avoiding the font-encoding switch is as easy as deleting the respective font encoding from the list of "fontenc" arguments.)
This is why I propose to switch to an "ersatz" font encoding, if the preferred font encoding is not known and the current font encoding not in the list of compatible font encodings. If no compatible font encoding is declared, write a warning and try with the current font encoding (maybe it works because the missing characters are not used or the document provides some other workaround). In case of a compilation error, the combination of the actual error message and the preceding warning can give the user sensible feedback. |
There are several issues with the font encoding switches for languages "imported" from
*.ini
files or used "on-the-fly"Some languages currently listed as supporting OT1 use characters or accents that are not supportd by OT1. Examples are Polish and Lithuanian, there may be others where OT1 should be stripped from the "encodings" list.
Correct hyphenation under 8-bit TeX is only supported for the font encoding used in the pattern file (or compatible ones).¹ This font encoding should be preferred².
¹ There is a list of "8-bit hyphenation encodings" in the "Languages" table in https://hyphenation.org/index.html. It may be a bit outdated and incomplete but may serve as a start. The In the table, EC stands for T1 and ASCII for "OT1 or ASCII compatible" (languages that don't use accented characters). It may be helpfull to get in touch with the maintainers of https://www.ctan.org/pkg/hyph-utf8.
² It depends on the specific case, whether a font encoding switch is actually better or not:
As some of these encodings have limited font support, a secondary choice may be OK for single words or short quotes.
Marking up a text part in a different language may just serve external issues (spellcheck) but could also imply the authors intention to get correct hyphenation and correct representation of "exotic" characters.
Loading a font encoding with "fontenc" may serve as an indictor that it should be used for languages that have it as "first choice".
The following example shows the problems:
Two errors due to the ogonek accent, wrong characters for the "comma below" accent, suboptimal font encoding choices with respect to hyphenation and use of pre-composed characters.
The text was updated successfully, but these errors were encountered: