Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

8-bit font encodings in *.ini files #267

Open
gmilde opened this issue Oct 16, 2023 · 7 comments
Open

8-bit font encodings in *.ini files #267

gmilde opened this issue Oct 16, 2023 · 7 comments

Comments

@gmilde
Copy link

gmilde commented Oct 16, 2023

There are several issues with the font encoding switches for languages "imported" from *.ini files or used "on-the-fly"

  • Some languages currently listed as supporting OT1 use characters or accents that are not supportd by OT1. Examples are Polish and Lithuanian, there may be others where OT1 should be stripped from the "encodings" list.

  • Correct hyphenation under 8-bit TeX is only supported for the font encoding used in the pattern file (or compatible ones).¹ This font encoding should be preferred².

¹ There is a list of "8-bit hyphenation encodings" in the "Languages" table in https://hyphenation.org/index.html. It may be a bit outdated and incomplete but may serve as a start. The In the table, EC stands for T1 and ASCII for "OT1 or ASCII compatible" (languages that don't use accented characters). It may be helpfull to get in touch with the maintainers of https://www.ctan.org/pkg/hyph-utf8.

² It depends on the specific case, whether a font encoding switch is actually better or not:
As some of these encodings have limited font support, a secondary choice may be OK for single words or short quotes.
Marking up a text part in a different language may just serve external issues (spellcheck) but could also imply the authors intention to get correct hyphenation and correct representation of "exotic" characters.
Loading a font encoding with "fontenc" may serve as an indictor that it should be used for languages that have it as "first choice".

The following example shows the problems:
Two errors due to the ogonek accent, wrong characters for the "comma below" accent, suboptimal font encoding choices with respect to hyphenation and use of pre-composed characters.

\documentclass[english]{article}
\usepackage{parskip}
\usepackage[T1,T2A,L7x,QX,OT1]{fontenc}
\usepackage{babel}
\makeatletter
\begin{document}

Default language,  
current font encoding \cf@encoding,
default font encoding \encodingdefault.

\foreignlanguage{bulgarian}{Български,
current font encoding \cf@encoding.}

\foreignlanguage{polish}{Język polski,
current font encoding \cf@encoding.
The \emph{ogonek} accent fails with OT1 (QX, L7x, T1, T2A work).
Correct hyphenation requires QX but font support is limited.}

\foreignlanguage{lithuanian}{Lietuvių kalba,
current font encoding \cf@encoding.
The \emph{ogonek} accent fails with OT1 (QX, L7x, T1, T2A work).
Correct hyphenation requires L7x but font support is limited.}

\foreignlanguage{latvian}{Latviešu valoda,
current font encoding \cf@encoding.
Correct rendering of \emph{komma below} accent (Ģ Ķ Ļ Ņ ģ ķ ļ ņ)
requires L7x or T1,
correct hyphenation requires L7x but font support is limited.}

\end{document}
@ivankokan
Copy link
Contributor

ivankokan commented Oct 17, 2023

  • Some languages currently listed as supporting OT1 use characters or accents that are not supportd by OT1. Examples are Polish and Lithuanian, there may be others where OT1 should be stripped from the "encodings" list.

@gmilde @jbezos I guess that Croatian is one of them. However, there are \dj and \DJ implemented in the old days to address this issue, so eventually I think that OT1 can be left in the encodings list:

encodings = T1 OT1 LY1

On the other hand, I am not familiar with LY1. Please clarify.

@jbezos
Copy link
Contributor

jbezos commented Oct 17, 2023

@gmilde Selecting OT1 as the main encoding and therefore the preferred one effectively makes many other Latin encodings no-op (T4 and T5 are exceptions), so I wonder if it makes sense, considering T1 (and iir, LY1) is a superset. But clearly OT1 cannot be included in the list for Polish and Lithuanian, because there is no glyph for the ogonek at all (there are other languages with ogonek, like Icelandic and Navajo).

@gmilde
Copy link
Author

gmilde commented Oct 17, 2023

Selecting OT1 as the main encoding and therefore the preferred one effectively makes many other Latin encodings no-op (T4 and T5 are exceptions), so I wonder if it makes sense, considering T1 (and iir, LY1) is a superset.

IMO, a font encoding switch suggests itself, if the current encoding only provides a subset of the required encoding.

For every language, we may distinguish two sets of encodings:¹

  • canonical (hyphenation works, drag-and-drop works, characters are correctly represented in print) and
  • substitute (no compilation errors but some characters are composites leading to omissions in hyphenation and possibly errors with drag-and-drop/search from the PDF).

We should consider a suitable way to represent these sets in the *.ini files.

Both sets may contain more than one font encoding with variations outside the letters actually used in the respective language.

We may use some term or external list for supersets, e.g.

  • "canonical" OT1 implies that all standard text encodings and also non-standard but ASCII-compatible ones are "canonical" too (https://hyphenation.org uses the qualifier "ASCII").
  • "canonical" T1 implies that all standard text encodings (as well as LY1 and probably some more) should work as "substitute" font encodings.

¹There is a grey zone when composite representations have wrong accent glyphs (like Romanian/Latvian characters with comma below in OT1) or misplaced accents (like the comma below in T1).


But clearly OT1 cannot be included in the list for Polish and Lithuanian, because there is no glyph for the ogonek at all (there are other languages with ogonek, like Icelandic and Navajo).

Currently, 173 babel/locale/*.ini files contain OT1 in the "encodings" list. All of them should be tested for OT1 compatibility. (There is no ogonek in Icelandic but thorn and eth fail with OT1, hyphenation requires T1.)

@ivankokan All non-ASCII chars used in Croation (č ć dž đ lj nj š ž Č Ć DŽ Dž Đ LJ Lj NJ Nj Š Ž) work with OT1 while hyphenation requires T1. (The double-letters are automatically decomposed here but the legacy characters work fine in my example file).
LY1 is an alternative to the T1 encoding developed by Y&Y and
used in their commercial TEX implementation. "encguide.pdf" has an encoding table. For many western European languages is a "canonical" encoding. For Croatian, it can be used as "substitute" encoding (as can OT1, T2A and others).


IMO, Babel's on-the-fly/imported languages should

  • switch to a canonical font encoding if one is declared in the document.
  • Otherwise, emit a warning (suggesting to declare one of ) and select a known substitute font encoding.
  • If no substitute font encoding is declared, emit a warning (suggesting to declare one of or at least ).

@jbezos
Copy link
Contributor

jbezos commented Oct 18, 2023

A choice for the default behavior must be made – prioritize either font or hyphenation. There are ~50 fd files for QX vs. ~800 for T1, and the manual for babel-polish doesn’t even mention the former. The current rules prioritize fonts because a sudden change is usually meaningful, at the cost of some missing hyphens. The real limitation is the selected encoding must render all characters (thorn, eth, ogonek, schwa, eng, etc.). If necessary, the preferred encoding for a language can be set by users, if the font is not a problem or for whatever reason they want.

@ivankokan
Copy link
Contributor

ivankokan commented Oct 18, 2023

@ivankokan All non-ASCII chars used in Croation (č ć dž đ lj nj š ž Č Ć DŽ Dž Đ LJ Lj NJ Nj Š Ž) work with OT1 while hyphenation requires T1.

@gmilde @jbezos So, the issue with Croatian and OT1 is 99 % with the hyphenation (the other 1 % is about missing Đđ which is at least handled somehow)? I leave you two to decide whether the OT1 should be excluded or not (indifferent on this matter but generally tend to be strict :D).

@gmilde
Copy link
Author

gmilde commented Oct 19, 2023

@ivankokan

[...] missing Đđ which is at least handled somehow)?

Đđ are handled exactly like the other "adorned" characters: T1 has slots for pre-composed characters while in OT1 they are created by superposition of the base character and adornment (haček, acute, stroke, ...).
The same holds for, e.g. German umlauts (äöü) and French letters with grave and circomflex.
The legacy ligatures are converted to two characters (like in Unicode) already by "inputenc" (cf. utf8enc.dfu).

This makes T1 the preferred font encoding for these languages and OT1 a "compatibility font encoding" (it works with some drawbacks).

@gmilde
Copy link
Author

gmilde commented Oct 19, 2023

If necessary, the preferred encoding for a language can be set by users, if the font is not a problem or for whatever reason they want.

Loading a font encoding in the document preamble can be interpreted as a statement that the document author wants to use this font encoding at some place in the document.

This is why I propose to switch the preferred font encoding for text parts in a "foreign" language if this font encoding is known. (Avoiding the font-encoding switch is as easy as deleting the respective font encoding from the list of "fontenc" arguments.)

The real limitation is the selected encoding must render all characters (thorn, eth, ogonek, schwa, eng, etc.).

This is why I propose to switch to an "ersatz" font encoding, if the preferred font encoding is not known and the current font encoding not in the list of compatible font encodings. If no compatible font encoding is declared, write a warning and try with the current font encoding (maybe it works because the missing characters are not used or the document provides some other workaround). In case of a compilation error, the combination of the actual error message and the preceding warning can give the user sensible feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants