New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default to UTF-8 in 8-bit TeX #24

Closed
davidcarlisle opened this Issue Mar 25, 2018 · 17 comments

Comments

Projects
None yet
5 participants
@davidcarlisle
Contributor

davidcarlisle commented Mar 25, 2018

Placeholder issue for updating LaTeX to default to

\usepackage[utf8]{inputenc}

processing.

@davidcarlisle

This comment has been minimized.

Contributor

davidcarlisle commented Mar 25, 2018

@aminophen If defaulting utf8 inputenc causes issues with *platex processing can you let us know, I did some basic sanity checks but it is possible I missed something.

@aminophen

This comment has been minimized.

Contributor

aminophen commented Mar 25, 2018

Thanks for information; ok, I’ll see it later

@aminophen

This comment has been minimized.

Contributor

aminophen commented Mar 25, 2018

I examined whether something incompatible happens or not, and found one: the following source compiles fine with latex + dvips + ps2pdf on current TeX Live (at least on macOS 10.11)

\documentclass{article}
\usepackage{graphicx}
\begin{document}
[\includegraphics{image-å.eps}]
\end{document}

but if I add \usepackage[utf8]{inputenc} then

LaTeX Warning: File `image-\IeC {\r a}.eps' not found on input line 5.

! Missing \endcsname inserted.
<to be read again> 
                   \unhbox 

The workaround is adding

\usepackage[space]{grffile}
\grffilesetup{
encoding,
inputencoding=utf8,
filenameencoding=utf8,
}

(Though I don't think anybody would have done like this, but just information)

@davidcarlisle

This comment has been minimized.

Contributor

davidcarlisle commented Mar 25, 2018

@aminophen thanks for this comment.

there were some plans to have a more extensive change that would also make utf8 filenames work (generally not just in graphics) work "out of the box" but we decided to be a bit more cautious here and
"just" work as if inputenc had been used.

Another way, without using grfffile (and would work with \input etc) would be to use

\includegraphics{\detokenize{image-å}.eps}

But maybe we can adjust things here (at the very least we should document this in the notes for this release) thanks for the example.

@aminophen

This comment has been minimized.

Contributor

aminophen commented Mar 26, 2018

Now we (Japanese) concluded that there would be no issue for “valid” input sources for platex and uplatex. In pTeX/upTeX, Latin characters and non-Latin (= CJK) characters are distinguished from each other (based on the \kcatcode value) during tokenization. It should mean that inputenc status does not affect CJK token processing.

There might be a few users who need to make adjustments; those who have relied on the default will get an error (for example) with Shift_JIS half-width katakana characters. However, the raw input of such characters is no more “valid” than that of Latin-1, so we should ignore that.

@davidcarlisle

This comment has been minimized.

Contributor

davidcarlisle commented Mar 26, 2018

@aminophen

This comment has been minimized.

Contributor

aminophen commented Apr 6, 2018

Slightly off-topic but let me ask a simple question: is it known to ucs.sty authors that utf8x.def (or, the package "ucs") is incompatible with the new UTF-8 LaTeX?

There had been an old issue that ucs.sty and utf8.def cannot be mixed. It rarely happened before, but now it matters since utf8.def is already loaded by LaTeX.

\documentclass{article}
\usepackage[utf8x]{inputenc}
\usepackage[LY1]{fontenc}
\begin{document}
La lingvo {\TeX} estas danĝera, ne alproksimiĝu!
\end{document}

This causes "Missing \begin{document}" error with new UTF-8 LaTeX.

  • Both utf8.def and utf8x.def (or ucs.sty) provides a command \DeclareUnicodeCharacter, and the usage is incompatible (the utf8x.def one requires decimal charcode instead of hexadecimal).
  • utf8.def redefines \DeclareFontEncoding to load a .dfu file, but utf8x.def does not handle that; this ends up reading wrong usage (in respect of utf8x.def) of \DeclareUnicodeCharacter.
@FrankMittelbach

This comment has been minimized.

Contributor

FrankMittelbach commented Apr 6, 2018

thanks @aminophen for the heads up. No, they aren't ware and we overlooked that case. I have added code to handle utf8x specially so the above should work now. Midterm they should change slightly (might be enough to use \UseRawInputEncoding in utf8x. We'll contact them.

@aminophen

This comment has been minimized.

Contributor

aminophen commented Apr 6, 2018

I have added code to handle utf8x specially so the above should work now.

Confirmed, thank you very much.

Midterm they should change slightly (might be enough to use \UseRawInputEncoding in utf8x. We'll contact them.

Sounds good!

@aminophen

This comment has been minimized.

Contributor

aminophen commented Apr 6, 2018

Checked 7e5a425 and noticed that

\RequirePackage[2017/04/15]{latexrelease}
\documentclass{article}
\usepackage[utf8x]{inputenc}

causes "! Undefined control sequence." for \UseRawInputEncoding

@FrankMittelbach

This comment has been minimized.

Contributor

FrankMittelbach commented Apr 6, 2018

good catch :-(

@aminophen

This comment has been minimized.

Contributor

aminophen commented Apr 6, 2018

\DeclareOption{utf8x}{\ifdefined\UseRawInputEncoding \UseRawInputEncoding \fi
                      \inputencoding{\CurrentOption}}

maybe?

@moewew

This comment has been minimized.

moewew commented Apr 7, 2018

I was just looking at this for biblatex (plk/biblatex#734) and I was wondering what the expected output for

\UseRawInputEncoding
\documentclass{article}
\begin{document}
\inputencodingname
\end{document}

was. On my tests with TL2018 this somewhat unexpectedly gave "utf8". Should \UseRawInputEncoding maybe also do \let\inputencodingname\@undefined or am I missing something here?

@davidcarlisle

This comment has been minimized.

Contributor

davidcarlisle commented Apr 7, 2018

Yes I think it should, that, we'll get that in the next patch release as a workaround if you detect the encodingname is utf8 you could check the catcode of any 8bit character and if it's 12 then inputenc has been disabled, but \userawInputEncoding undefining encodingname (or defining it to raw ? would be more natural

@moewew

This comment has been minimized.

moewew commented Apr 7, 2018

Ah, good. Thank you for the quick reply.

For biblatex it would be more convenient if \UseRawInputEncoding were to do \let\inputencodingname\@undefined rather than setting it to a value such as raw. But I'm not sure about other packages and their uses for \inputencodingname.

@davidcarlisle

This comment has been minimized.

Contributor

davidcarlisle commented Apr 8, 2018

@moewew we have (with exceptional service from the ctan team) got a 2018-04-01 patch level 1 release on to ctan which undefines \inputencodingname if rolling back via latexrelease package or via \UseRawInputEncoding

@moewew

This comment has been minimized.

moewew commented Apr 8, 2018

Brilliant! Thank you very much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment