Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide context for error: "pandoc.exe: Cannot decode byte" #8884

Closed
friendly opened this issue Jun 1, 2023 · 9 comments
Closed

Provide context for error: "pandoc.exe: Cannot decode byte" #8884

friendly opened this issue Jun 1, 2023 · 9 comments

Comments

@friendly
Copy link

friendly commented Jun 1, 2023

In this StackOverflow post I describe a problem with pandoc (v. 2.19.2) under quarto (v. 1.1.189)
where I get this error,

pandoc.exe: Cannot decode byte '\x93': Data.Text.Internal.Encoding.decodeUtf8: Invalid UTF-8 stream

and have been persistently unable to find the location in my files where this might occur. The problem is that the error message provides no help in finding the source of this error.

Request:

  • pandoc should return in the error message the name of the file and the location (line number) where such non-UTF-8 characters are found.
  • Is it possible for pandoc to signal a warning rather than an error when such non-UTF-8 characters are found? What problems could arise from this?
@jgm
Copy link
Owner

jgm commented Jun 1, 2023

#4765 was intended to fix this, but there may still be some pathways where the context isn't provided.
At any rate, I'd advise trying with the latest pandoc, as there have been related changes since 2.19.2 and you might find that your problem goes away. If it doesn't, then in order to make progress on this we'd need to have a test case that we can use to reproduce the issue, ideally a minimal one.

@friendly
Copy link
Author

friendly commented Jun 1, 2023

I upgraded to the latest pandoc version, 3.1.2 for Windows, but still the same error, which I've described in more detail in the SO post, https://stackoverflow.com/questions/76366976/quarto-pandoc-cant-find-source-of-pandoc-error-pandoc-exe-cannot-decode

To help track this down, I've made my repo public, at https://github.com/friendly/Vis-MLM-quarto

I'll try to prepare a smaller version for testing, but I'm not quite sure how. I try to compile the book from within RStudio.

@jgm
Copy link
Owner

jgm commented Jun 1, 2023

There's a lot of quarto going on here. I would report this to them first, if none of your source files contain illegal UTF-8.

@friendly
Copy link
Author

friendly commented Jun 2, 2023

I did report this also to the quarto folks:
quarto-dev/quarto-cli#5787

We'll see if anything turns up there.

@friendly
Copy link
Author

friendly commented Jun 2, 2023

Quarto folks unable to help track this down. Are there any pandoc options (e.g., --verbose) to help me track down the file and file location that triggers this error?

@friendly
Copy link
Author

friendly commented Jun 2, 2023

I finally found the source of this error -- a bib file outside my project that was imported and not checked. Yet, it took me over 2 days, and could have been avoided if pandoc returned filename and file location information in its error message.

Hence, I'm requesting this enhancement

Enhancement:

pandoc should return in the error message the name of the file and the location (line number) where such non-UTF-8 characters are found.

@jgm
Copy link
Owner

jgm commented Jun 2, 2023

This would be

        readBibtexString Bibtex locale idpred . UTF8.toText $ raw

at ln. 250 of T.P.Citeproc.hs.

Since getRefs is called by getRefsFromBib, which has the filename, it would be feasible to trap the error and give filename information here.

@friendly
Copy link
Author

friendly commented Jun 2, 2023

Wonderful! Thank you so much.

jgm added a commit that referenced this issue Jun 27, 2023
This is like `Text.Pandoc.UTF8.toText`, except:

- it takes a file path as first argument, in addition to
  bytestring contents
- it raises an informative error with source position if
  the contents are not UTF8-encoded

[API change]

This replaces `utf8ToText` in `Text.Pandoc.App.Input`.

See #8884.
@jgm jgm closed this as completed in e0bea47 Jun 27, 2023
@friendly
Copy link
Author

Many thanks for this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants