cosmetics: actually show the invalid UTF-8 byte sequence, format them in hex #62

mirabilos · 2018-08-05T20:09:15Z

… except if numexpr isn’t available, of course, since we cannot decode the sequence then

josephwright · 2018-08-05T20:10:18Z

The kernel requires e-TeX nowadays [and was meant to for many years :)].

also, show invalid bytes in hex, not decimal

davidcarlisle · 2018-09-30T15:05:33Z

I'm not sure about this one.

I suppose in general the information might be useful but for most people most of the time the internal encoding isn't going to help much: they just need to be told the file is not in UTF-8 and they will re-save the file in the editor or specify \usepackage[latin1]{inputenc} or similar to get back to an 8bit encoding.

As your update to the test suite shows you can get strange artifacts.
The new output from tlb1144 in this PR is

Invalid byte sequence: "E3 "5C "70 "61 "72 "5C "70 "61 "72.

which is ã\par\par which isn't a byte sequence in the input but rather some bytes generated by TeX after passing the internal \par from the blank line through \string.

It would probably be possible to catch this case, but then the code gets more complicated for what as you say is a cosmetic feature.

I'll ping @FrankMittelbach for review...

mirabilos · 2018-09-30T15:29:20Z

David Carlisle dixit:

I'm not sure about this one.

OK.

I suppose in general the information might be useful

It did help me (ok, granted, I’m definitely not the average user) once…

The new output from tlb1144 in this PR is ``` Invalid byte sequence: "E3 "5C "70 "61 "72 "5C "70 "61 "72. ``` which is `ã\par\par` which isn't a byte sequence in the input but rather some bytes generated by TeX after passing the internal `\par` from the blank line through `\string`.

… but, this. Yes. I saw this, and it’s annoying.

It would probably be possible to catch this case, but then the code gets more complicated for what as you say is a cosmetic feature.

That being said, the backslash from `\par` is indeed part of the invalid byte sequence. (Or rather, the missing second byte before the end of the actual multibyte sequence, but I’m not sure we can catch that, and the `\par` might have been entered by the user, in which case it’s actually correct.) Perhaps stopping after the first character below "80 would at least limit the “visual damage”?

I'll ping @FrankMittelbach for review...

Thanks. Perhaps we can improve upon this. (For the “simply re-save” consideration, the original encoding must be known. Also, perhaps, the source file had Mojibake or a mix of different encodings (`\input` comes to mind) or was damaged on transport. Situations may require manual fixing, in which case I’d love to have the actual byte/octet sequence to search for in my editor; this is fastest.)

davidcarlisle · 2018-09-30T15:39:10Z

That being said, the backslash from \par is indeed part of the invalid byte sequence.

there is no \par in the input file, it is a blank line in the input after the E3 byte which is being reported as \par by TeX so the actual non-utf8 byte sequence is E3 0A 0A (after TeX's end of line normalisation)

davidcarlisle · 2018-09-30T15:43:31Z

Situations may require manual fixing, in which case I’d love to have the actual byte/octet sequence to search for in my editor; this is fastest.)

yes you would, so would I, but if we help two people and confuse half a million with spurious "E3 "5C "70 "61 "72 "5C "70 "61 "72. it's not a net win:-)

Perhaps could say " Invalid byte sequence starting from byte E3" which would limit things to the first known bad byte.

mirabilos · 2018-09-30T15:53:16Z

David Carlisle dixit:

so the actual non-utf8 byte sequence is E3 0A 0A (after TeX's end of

E3 0A, since the 0A terminates it… yes, you’re right, sorry.

yes you would, so would I, but if we help two people and confuse half a million with spurious `"E3 "5C "70 "61 "72 "5C "70 "61 "72.` it's not a net win:-)

Perhaps "E3 "5C ?

Perhaps could say " Invalid byte sequence starting from byte E3" which would limit things to the first known bad byte.

But E3 is a valid start byte for, perhaps, 1000 other chars in the file. That would help negatively (i.e. rather hinder) a search for the corrupt one place…

davidcarlisle · 2018-09-30T16:22:16Z

That would help negatively (i.e. rather hinder) a search for the corrupt
searching for E3 might have too many hits but would eventually find the bad place.

searching for E3 5C will fail as that byte sequence isn't in the file, so if the intention is to help you search to find the bad bytes then this will not help. Really all Tex knows at this stage, after end of line and \par normalisation and possible macro expansion is that it is confused, it doesn't have a good handle on what the original byte sequence in the file is.

I'll leave it for a while to give time for other team members to review but my current feeling is not to do this. It could make an interesting extension package, as the code itself works fine, and if a user who understands the output opts in to use it to debug something then that would be a useful feature, but I think it's too low level and too confusing in edge cases for a general user facing error message.

josephwright · 2018-09-30T17:03:34Z

@davidcarlisle I'm with you here: I think the data is sufficiently specialised that it would be best handled in a package 'for the wise'.

davidcarlisle · 2018-10-01T07:09:22Z

E3 0A, since the 0A terminates it… yes, you’re right, sorry.

a single 0A linebreak in the file would have been reported as E3 20, it is the 0A 0A which is triggering the \par token being reported as E3 "5C "70 "61 "72 .

I think (despite weirdness around linebreaks) as noted above that this would be a useful component in a debugging-inputenc package, perhaps combined with the other useful debugging aid for finding control characters and other hard to find things in the source would be an option to insert FIX ME HERE kinds of text into the generated PDF rather than just make an error message so that people can search for visible known text in the output to help locate the error.

Closing here. But thanks for this set of PRs especially fixing my embarrassing F4/F5 error!

* support for hang option of footmisc * typo and missing test update (probably more to show up) * updating more tests * update date/version and changes.txt * attempt to patch a few more styles/classes (that contain \makebox rather than \hb@xt@) * mumble * tag note label if hang option is used (this is missing a test!) * try again with tagging temp disabled

mirabilos mentioned this pull request Aug 5, 2018

cosmetic: show invalid byte in hex, not decimal #61

Closed

mirabilos changed the title ~~cosmetics: actually show the invalid UTF-8 byte sequence~~ cosmetics: actually show the invalid UTF-8 byte sequence, format them in hex Aug 5, 2018

mirabilos force-pushed the show-invalid-byte-sequence branch from 29a0e26 to 3409b5c Compare August 5, 2018 20:32

cosmetics: actually show the invalid UTF-8 byte sequence

6f6b740

also, show invalid bytes in hex, not decimal

mirabilos force-pushed the show-invalid-byte-sequence branch 2 times, most recently from bb6a05d to 7213ca2 Compare August 5, 2018 20:56

update testsuite with new expected output

19e8ec3

mirabilos force-pushed the show-invalid-byte-sequence branch from 7213ca2 to 19e8ec3 Compare August 5, 2018 21:02

FrankMittelbach assigned davidcarlisle Sep 23, 2018

FrankMittelbach added this to the release 2018-12 milestone Sep 23, 2018

FrankMittelbach added this to To do in upcoming LaTeX2e releases Sep 23, 2018

davidcarlisle requested review from davidcarlisle and FrankMittelbach September 30, 2018 15:05

davidcarlisle closed this Oct 1, 2018

upcoming LaTeX2e releases automation moved this from To do to Done Oct 1, 2018

mirabilos mentioned this pull request Oct 4, 2018

cosmetics: show invalid bytes in hex, not decimal #82

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cosmetics: actually show the invalid UTF-8 byte sequence, format them in hex #62

cosmetics: actually show the invalid UTF-8 byte sequence, format them in hex #62

mirabilos commented Aug 5, 2018

josephwright commented Aug 5, 2018

davidcarlisle commented Sep 30, 2018

mirabilos commented Sep 30, 2018 via email

davidcarlisle commented Sep 30, 2018

davidcarlisle commented Sep 30, 2018

mirabilos commented Sep 30, 2018 via email

davidcarlisle commented Sep 30, 2018

josephwright commented Sep 30, 2018

davidcarlisle commented Oct 1, 2018 •

edited

cosmetics: actually show the invalid UTF-8 byte sequence, format them in hex #62

cosmetics: actually show the invalid UTF-8 byte sequence, format them in hex #62

Conversation

mirabilos commented Aug 5, 2018

josephwright commented Aug 5, 2018

davidcarlisle commented Sep 30, 2018

mirabilos commented Sep 30, 2018 via email

davidcarlisle commented Sep 30, 2018

davidcarlisle commented Sep 30, 2018

mirabilos commented Sep 30, 2018 via email

davidcarlisle commented Sep 30, 2018

josephwright commented Sep 30, 2018

davidcarlisle commented Oct 1, 2018 • edited

davidcarlisle commented Oct 1, 2018 •

edited