non-ascii file name regression #32

edocevoli · 2018-04-08T09:19:53Z

Brief outline of the bug

One of my test cases breaks with the the current 2018-04-01 release.

What I have done:

update only the LaTeX base package (MiKTeX: ltxbase)
open a command-prompt window and then:

chcp 65001    
pdflatex tèst.tex

This gives:

This is pdfTeX, Version 3.14159265-2.6-1.40.19 (MiKTeX 2.9.6655 NEXT 64-bit)
entering extended mode
! I can't find file `./t'.
<to be read again> 
                   \global 
<*> ./tè
         st.tex
Please type another input file name: 
! Emergency stop.
<to be read again> 
                   \global 
<*> ./tè
         st.tex
!  ==> Fatal error occurred, no output PDF file produced!
Transcript written on texput.log.

Maybe this is a MiKTeX-specific Windows bug. I will do further tests on macOS and Linux.

Minimal example showing the bug

\documentclass{article}
\begin{document}
Hallo Welt
\end{document}

Log file (required) and possibly PDF file

texput.log

The text was updated successfully, but these errors were encountered:

aminophen · 2018-04-08T09:23:25Z

Already reported by me and \detokenize should be used to avoid that (from #24 (comment))

josephwright · 2018-04-08T09:27:10Z

@aminophen Not quite that simple ... but I have to say I'm surprised that the binaries treat the file name argument at the TeX level (it's not \input tést, after all).

josephwright · 2018-04-08T09:27:54Z

Something like pdflatex \input\detokenize{tést}\relax works, but that's not ideal.

aminophen · 2018-04-08T09:33:53Z

the binaries treat the file name argument at the TeX level

Any arguments to *tex is treated as TeX code;-) When the first token is a character, *tex treats it as if \input is prefixed; when the first token is a control sequence, \input is not prefixed.

josephwright · 2018-04-08T09:42:59Z

@aminophen I have to say I've always imagined the logic differently :) 'If the first char is the escape char, treat as TeX code, otherwise read as a filename'

davidcarlisle · 2018-04-08T09:47:43Z

one thing I had experimented with is starting with \long\def\UTFviii@two@octets#1#2{% \string#1\string#2} and switching to the main definition "later" but the timing gets tricky and making the implicit input on the commandline work like an explicit \input also isn't as easy as one would hope.

…

On 8 April 2018 at 10:43, Joseph Wright ***@***.***> wrote: @aminophen <https://github.com/aminophen> I have to say I've always imagined the logic differently :) 'If the first char is the escape char, treat as TeX code, otherwise read as a filename' — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#32 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABNcArQJzyDnHcRFQk3Z6NamsJy7vfpfks5tmdukgaJpZM4TLdnn> .

aminophen · 2018-04-08T10:21:36Z

Delaying utf8.def etc. to \everyjob might be an only solution to this (not tested well for all engines, and if so I will have to adjust platex as well)

(edit: it will change a log filename opened by $pdflatex \\relax to utf8.log instead of texput.log)

--- latex.ltx.orig	2018-04-07 06:33:45.000000000 +0900
+++ latex.ltx	2018-04-08 19:15:09.000000000 +0900
@@ -8641,12 +8641,6 @@
 \catcode10=12 % ctrl J
 \catcode12=13 % ctrl L
 \catcode13=5  % newline
-\@tempcnta=128
-\loop
-  \catcode\@tempcnta=13
-  \advance\@tempcnta\@ne
-\ifnum\@tempcnta<256
-\repeat
 \def\UseRawInputEncoding{%
 \let\DeclareFontEncoding@\DeclareFontEncoding@saved   % revert
 \let\DeclareUnicodeCharacter\@undefined               % revert
@@ -8669,10 +8663,6 @@
 \repeat
 }
 \let\DeclareFontEncoding@saved\DeclareFontEncoding@
-\edef\inputencodingname{utf8}%
-\input{utf8.def}
-\let\@inpenc@test\@undefined
-\let\saved@space@catcode\@undefined
 \else
 \@tempcnta=0
 \loop
@@ -8793,6 +8783,18 @@
   \endgroup}
 \let\@filelist\@gobble
 \def\@addtofilelist#1{\xdef\@filelist{\@filelist,#1}}%
+\everyjob\expandafter{\the\everyjob
+\@tempcnta=128
+\loop
+  \catcode\@tempcnta=13
+  \advance\@tempcnta\@ne
+\ifnum\@tempcnta<256
+\repeat
+\edef\inputencodingname{utf8}%
+\input{utf8.def}
+\let\@inpenc@test\@undefined
+\let\saved@space@catcode\@undefined
+}
 \makeatother
 \errorstopmode
 \dump

davidcarlisle · 2018-04-08T10:39:28Z

@aminophen yes I'm actually currently running some tests with ltfinal changed as

%    \begin{macrocode}
\edef\inputencodingname{utf8}%
\input{utf8.def}
\let\UTFviii@two@octets@@\UTFviii@two@octets
\long\def\UTFviii@two@octets#1#2{\string#1\string#2}
\everyjob\expandafter{\the\everyjob
\let\UTFviii@two@octets\UTFviii@two@octets@@
}
\let\@inpenc@test\@undefined
\let\saved@space@catcode\@undefined
%    \end{macrocode}

would need the longer cases as well, not just the two byte of course. delaying the catcode activation until everyjob would work on the commandline but if we can make it work without that it may give a path to accepting utf8 filenames more generally in the document (which did not work in previous releases after inputenc was loaded)

aminophen · 2018-04-08T13:40:26Z

724013b works as expected for pdfLaTeX; I commited a support for that change in pLaTeX texjporg/platex@8b6c518 and it’s ok on both pLaTeX and upLaTeX. I’ll upload the new version of pLaTeX, when LaTeX is ready.

josephwright added bug category base (latex) labels Apr 8, 2018

aminophen mentioned this issue Apr 8, 2018

LaTeX の inputenc で UTF-8 が既定になった場合？ texjporg/platex#67

Closed

davidcarlisle added a commit that referenced this issue Apr 8, 2018

experimental code to address issue #32

735ce69

davidcarlisle added a commit that referenced this issue Apr 8, 2018

make everyjob code kernel only issue #32

724013b

josephwright closed this as completed Apr 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

non-ascii file name regression #32

non-ascii file name regression #32

edocevoli commented Apr 8, 2018

aminophen commented Apr 8, 2018

josephwright commented Apr 8, 2018

josephwright commented Apr 8, 2018

aminophen commented Apr 8, 2018

josephwright commented Apr 8, 2018

davidcarlisle commented Apr 8, 2018 via email

aminophen commented Apr 8, 2018 •

edited

davidcarlisle commented Apr 8, 2018 •

edited

aminophen commented Apr 8, 2018

non-ascii file name regression #32

non-ascii file name regression #32

Comments

edocevoli commented Apr 8, 2018

Brief outline of the bug

Minimal example showing the bug

Log file (required) and possibly PDF file

aminophen commented Apr 8, 2018

josephwright commented Apr 8, 2018

josephwright commented Apr 8, 2018

aminophen commented Apr 8, 2018

josephwright commented Apr 8, 2018

davidcarlisle commented Apr 8, 2018 via email

aminophen commented Apr 8, 2018 • edited

davidcarlisle commented Apr 8, 2018 • edited

aminophen commented Apr 8, 2018

aminophen commented Apr 8, 2018 •

edited

davidcarlisle commented Apr 8, 2018 •

edited