Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cosmetics: actually show the invalid UTF-8 byte sequence, format them in hex #62

Closed
wants to merge 2 commits into from

Conversation

mirabilos
Copy link
Contributor

… except if numexpr isn’t available, of course, since we cannot decode the sequence then

@josephwright
Copy link
Member

The kernel requires e-TeX nowadays [and was meant to for many years :)].

@mirabilos mirabilos changed the title cosmetics: actually show the invalid UTF-8 byte sequence cosmetics: actually show the invalid UTF-8 byte sequence, format them in hex Aug 5, 2018
also, show invalid bytes in hex, not decimal
@mirabilos mirabilos force-pushed the show-invalid-byte-sequence branch 2 times, most recently from bb6a05d to 7213ca2 Compare August 5, 2018 20:56
@davidcarlisle
Copy link
Member

I'm not sure about this one.

I suppose in general the information might be useful but for most people most of the time the internal encoding isn't going to help much: they just need to be told the file is not in UTF-8 and they will re-save the file in the editor or specify \usepackage[latin1]{inputenc} or similar to get back to an 8bit encoding.

As your update to the test suite shows you can get strange artifacts.
The new output from tlb1144 in this PR is

Invalid byte sequence: "E3 "5C "70 "61 "72 "5C "70 "61 "72.

which is ã\par\par which isn't a byte sequence in the input but rather some bytes generated by TeX after passing the internal \par from the blank line through \string.

It would probably be possible to catch this case, but then the code gets more complicated for what as you say is a cosmetic feature.

I'll ping @FrankMittelbach for review...

@mirabilos
Copy link
Contributor Author

mirabilos commented Sep 30, 2018 via email

@davidcarlisle
Copy link
Member

That being said, the backslash from \par is indeed part of the invalid byte sequence.

there is no \par in the input file, it is a blank line in the input after the E3 byte which is being reported as \par by TeX so the actual non-utf8 byte sequence is E3 0A 0A (after TeX's end of line normalisation)

@davidcarlisle
Copy link
Member

Situations may require manual fixing, in which case I’d love to have the actual byte/octet sequence to search for in my editor; this is fastest.)

yes you would, so would I, but if we help two people and confuse half a million with spurious "E3 "5C "70 "61 "72 "5C "70 "61 "72. it's not a net win:-)

Perhaps could say " Invalid byte sequence starting from byte E3" which would limit things to the first known bad byte.

@mirabilos
Copy link
Contributor Author

mirabilos commented Sep 30, 2018 via email

@davidcarlisle
Copy link
Member

That would help negatively (i.e. rather hinder) a search for the corrupt
searching for E3 might have too many hits but would eventually find the bad place.

searching for E3 5C will fail as that byte sequence isn't in the file, so if the intention is to help you search to find the bad bytes then this will not help. Really all Tex knows at this stage, after end of line and \par normalisation and possible macro expansion is that it is confused, it doesn't have a good handle on what the original byte sequence in the file is.

I'll leave it for a while to give time for other team members to review but my current feeling is not to do this. It could make an interesting extension package, as the code itself works fine, and if a user who understands the output opts in to use it to debug something then that would be a useful feature, but I think it's too low level and too confusing in edge cases for a general user facing error message.

@josephwright
Copy link
Member

@davidcarlisle I'm with you here: I think the data is sufficiently specialised that it would be best handled in a package 'for the wise'.

@davidcarlisle
Copy link
Member

davidcarlisle commented Oct 1, 2018

E3 0A, since the 0A terminates it… yes, you’re right, sorry.

a single 0A linebreak in the file would have been reported as E3 20, it is the 0A 0A which is triggering the \par token being reported as E3 "5C "70 "61 "72 .

I think (despite weirdness around linebreaks) as noted above that this would be a useful component in a debugging-inputenc package, perhaps combined with the other useful debugging aid for finding control characters and other hard to find things in the source would be an option to insert FIX ME HERE kinds of text into the generated PDF rather than just make an error message so that people can search for visible known text in the output to help locate the error.

Closing here. But thanks for this set of PRs especially fixing my embarrassing F4/F5 error!

upcoming LaTeX2e releases automation moved this from To do to Done Oct 1, 2018
FrankMittelbach added a commit that referenced this pull request Mar 12, 2024
* support for hang option of footmisc

* typo and missing test update (probably more to show up)

* updating more tests

* update date/version and changes.txt

* attempt to patch a few more styles/classes (that contain \makebox rather than \hb@xt@)

* mumble

* tag note label if hang option is used (this is missing a test!)

* try again with tagging temp disabled
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Development

Successfully merging this pull request may close these issues.

None yet

4 participants