New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cosmetics: actually show the invalid UTF-8 byte sequence, format them in hex #62
Conversation
The kernel requires e-TeX nowadays [and was meant to for many years :)]. |
29a0e26
to
3409b5c
Compare
also, show invalid bytes in hex, not decimal
bb6a05d
to
7213ca2
Compare
7213ca2
to
19e8ec3
Compare
I'm not sure about this one. I suppose in general the information might be useful but for most people most of the time the internal encoding isn't going to help much: they just need to be told the file is not in UTF-8 and they will re-save the file in the editor or specify As your update to the test suite shows you can get strange artifacts.
which is It would probably be possible to catch this case, but then the code gets more complicated for what as you say is a cosmetic feature. I'll ping @FrankMittelbach for review... |
David Carlisle dixit:
I'm not sure about this one.
OK.
I suppose in general the information might be useful
It did help me (ok, granted, I’m definitely not the average user) once…
The new output from tlb1144 in this PR is
```
Invalid byte sequence: "E3 "5C "70 "61 "72 "5C "70 "61 "72.
```
which is `ã\par\par` which isn't a byte sequence in the input but
rather some bytes generated by TeX after passing the internal `\par`
from the blank line through `\string`.
… but, this. Yes. I saw this, and it’s annoying.
It would probably be possible to catch this case, but then the code
gets more complicated for what as you say is a cosmetic feature.
That being said, the backslash from `\par` is indeed part of the
invalid byte sequence. (Or rather, the missing second byte before
the end of the actual multibyte sequence, but I’m not sure we can
catch that, and the `\par` might have been entered by the user,
in which case it’s actually correct.)
Perhaps stopping after the first character below "80 would at
least limit the “visual damage”?
I'll ping @FrankMittelbach for review...
Thanks. Perhaps we can improve upon this.
(For the “simply re-save” consideration, the original encoding
must be known. Also, perhaps, the source file had Mojibake or
a mix of different encodings (`\input` comes to mind) or was
damaged on transport. Situations may require manual fixing, in
which case I’d love to have the actual byte/octet sequence to
search for in my editor; this is fastest.)
|
there is no |
yes you would, so would I, but if we help two people and confuse half a million with spurious Perhaps could say " Invalid byte sequence starting from byte E3" which would limit things to the first known bad byte. |
David Carlisle dixit:
so the actual non-utf8 byte sequence is E3 0A 0A (after TeX's end of
E3 0A, since the 0A terminates it… yes, you’re right, sorry.
yes you would, so would I, but if we help two people and confuse half
a million with spurious `"E3 "5C "70 "61 "72 "5C "70 "61 "72.` it's
not a net win:-)
Perhaps "E3 "5C ?
Perhaps could say " Invalid byte sequence starting from byte E3" which
would limit things to the first known bad byte.
But E3 is a valid start byte for, perhaps, 1000 other chars in the file.
That would help negatively (i.e. rather hinder) a search for the corrupt
one place…
|
searching for E3 5C will fail as that byte sequence isn't in the file, so if the intention is to help you search to find the bad bytes then this will not help. Really all Tex knows at this stage, after end of line and \par normalisation and possible macro expansion is that it is confused, it doesn't have a good handle on what the original byte sequence in the file is. I'll leave it for a while to give time for other team members to review but my current feeling is not to do this. It could make an interesting extension package, as the code itself works fine, and if a user who understands the output opts in to use it to debug something then that would be a useful feature, but I think it's too low level and too confusing in edge cases for a general user facing error message. |
@davidcarlisle I'm with you here: I think the data is sufficiently specialised that it would be best handled in a package 'for the wise'. |
a single 0A linebreak in the file would have been reported as E3 20, it is the 0A 0A which is triggering the \par token being reported as I think (despite weirdness around linebreaks) as noted above that this would be a useful component in a debugging-inputenc package, perhaps combined with the other useful debugging aid for finding control characters and other hard to find things in the source would be an option to insert Closing here. But thanks for this set of PRs especially fixing my embarrassing F4/F5 error! |
* support for hang option of footmisc * typo and missing test update (probably more to show up) * updating more tests * update date/version and changes.txt * attempt to patch a few more styles/classes (that contain \makebox rather than \hb@xt@) * mumble * tag note label if hang option is used (this is missing a test!) * try again with tagging temp disabled
… except if numexpr isn’t available, of course, since we cannot decode the sequence then