Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Windows: En dashes break custom Lua output #2101

Closed
jmcphers opened this issue Apr 20, 2015 · 23 comments · Fixed by #2112
Closed

Windows: En dashes break custom Lua output #2101

jmcphers opened this issue Apr 20, 2015 · 23 comments · Fixed by #2112

Comments

@jmcphers
Copy link

Repro:

Save this as score.md, with UTF-8 encoding (note that it contains an en dash):

The score was 4—15. 

Run Pandoc using its sample Lua custom writer:

C:\> pandoc score.md -t sample.lua

Result:

pandoc.exe: Cannot decode byte '\xe2': Data.Text.Internal.Encoding.Fusion.streamUtf8: Invalid UTF-8 stream

The same is true for many other Unicode characters (such as non-breaking spaces and many/most other non-ASCII characters). This may be a dupe of #1634; it's not clear why that issue was closed.

@jgm
Copy link
Owner

jgm commented Apr 20, 2015

Are you using the latest release (1.13.2.*) or the development version (github master)?

@jmcphers
Copy link
Author

This is with the latest release, 1.13.2.

@nkalvi
Copy link

nkalvi commented Apr 20, 2015

@jmcphers I couldn't reproduce it under Windows 8.1 or Windows 7 (or OS X):

D:\Downloads>pandoc --version
pandoc 1.13.2
Compiled with texmath 0.8.0.1, highlighting-kate 0.5.11.1.
Syntax highlighting is supported for the following languages:
...

D:\Downloads>echo "The score was 4—15." > score.md

D:\Downloads>pandoc score.md -t sample.lua
<p>&quot;The score was 4-15.&quot;</p>

I created the sample.lua with

pandoc --print-default-data-file sample.lua > sample.lua

I was initially suspecting codepage issues under Windows, but I'm not sure what causes it on your system. Could you please post your score.md file?

@nkalvi
Copy link

nkalvi commented Apr 20, 2015

@jmcphers
I'm sorry - it looks like it worked because the file created from command prompt was saved as ANSI.

I do get the error when I created it in Notepad and saved it as UTF-8.
But running the following in command prompt to changing the codepage first

chcp 65001

will process the file without errors.
Could you please test it?

@jmcphers
Copy link
Author

It's definitely Windows only; I can't reproduce it on any other system.

I think the reason you're not reproducing it is that the Win32 shell is automatically converting your en dash into a regular hyphen:

C:\downloads>echo "The score was 4—15." > score.md
C:\downloads>type score.md
"The score was 4-15."

Here's a gist of score.md, if that helps:

https://gist.github.com/anonymous/bb3c2a7a8aab8afbb311

@jmcphers
Copy link
Author

@nkalvi , now we're getting somewhere! With chcp 65001, I get some output:

<p>The score was 4—15.</p>p>

Not sure why it's malformed, though--perhaps just related to the sample LUA writer?

@nkalvi
Copy link

nkalvi commented Apr 21, 2015

@jmcphers
Good - could you please test it with a 'real' input too?

It looks like it is related Windows' default code page - I saw many similar issues with Python apps too.

@jmcphers
Copy link
Author

Yep, works fine with real input!

I added a workaround on our end that sets the codepage to 65001 temporarily while we run Pandoc.

It seems a little weird that this is the only situation (that I know of) in which Pandoc requires the code page to be set to UTF-8 -- should we expect that this will be changed in Pandoc at some point, or should we just plan to always run Pandoc under that codepage on Windows?

@jgm
Copy link
Owner

jgm commented Apr 21, 2015

+++ Jonathan [Apr 21 15 10:00 ]:

It seems a little weird that this is the only situation (that I know
of) in which Pandoc requires the code page to be set to UTF-8 -- should
we expect that this will be changed in Pandoc at some point, or should
we just plan to always run Pandoc under that codepage on Windows?

I'd like to change pandoc, but we need to figure out what to change.

@nkalvi
Copy link

nkalvi commented Apr 21, 2015

@jgm

Could the readers/writers check the encoding and set it if needed using hGetEncoding/hSetEncoding? https://www.haskell.org/hoogle/?hoogle=hSetEncoding

I wonder whether discussion below has some helpful clues:
http://stackoverflow.com/questions/7371978/haskell-read-in-special-characters-from-console

@jgm
Copy link
Owner

jgm commented Apr 21, 2015

+++ nkalvi [Apr 21 15 14:47 ]:

@jgm

Could the readers/writers check the encoding and set it if needed using hGetEncoding/hSetEncoding? https://www.haskell.org/hoogle/?hoogle=hSetEncoding

I think this is something specific to the lua writer; we don't see these
problems with other readers/writers.

jgm added a commit that referenced this issue Apr 22, 2015
@jgm
Copy link
Owner

jgm commented Apr 22, 2015

I'm curious whether 2bca018 helps with this - can one of you compile and test on your Windows setup?

@nkalvi
Copy link

nkalvi commented Apr 22, 2015

Will try later today and report back.

@nkalvi
Copy link

nkalvi commented Apr 22, 2015

@jgm it doesn't seems to help - though it doesn't abort with error message, it removes the En-space:

D:\src>chcp 437
Active code page: 437

D:\src>pandoc score.md -t sample.lua -o t.html

D:\src>type t.html
<p>The score was 415.</p>

I'll see whether there are any other ways to handle this.

@jgm
Copy link
Owner

jgm commented Apr 22, 2015

+++ nkalvi [Apr 22 15 15:36 ]:

[1]@jgm it doesn't seems to help - though it doesn't abort with error
message, it removes the En-space:
D:\src>chcp 437
Active code page: 437

Try without code page 437. You want UTF-8.

@nkalvi
Copy link

nkalvi commented Apr 22, 2015

It works fine with 65001 - so did the previous version.
I thought this change is to make it work regardless of the codepage setting.

@jgm
Copy link
Owner

jgm commented Apr 22, 2015

+++ nkalvi [Apr 22 15 15:42 ]:

It works fine with 65001 - so did the previous version.
I thought this change is to make it work regardless of the codepage
setting.

Okay. I don't really know anything about how Windows works.
But I'm glad to hear this change didn't do any harm.

@nkalvi
Copy link

nkalvi commented Apr 23, 2015

I couldn't find any good solutions so far (experimented quite a bit using setlocale etc. in both Custom.hs and Sample.lua): It looks like an appropriate code page needs to be set before starting Pandoc.

A batch file with seems to be an easy solution.
Alternatively Sample.lua can be edited to include a check like this (in the necessary functions):

codePage = "65001"
currentCodePage = ""

function setCodePage()
  if currentCodePage == "" then
    fh,err = io.popen("chcp","r")
    if fh then
      currentCodePage = fh:read()
    end
  end
  if string.find(currentCodePage, codePage) == nil then
    codePage = os.execute("chcp " .. codePage)
    io.stderr:write(string.format("Appropriate code page has been set. Please run again\n"))
    os.exit(1)
  end
end

function Str(s)
  setCodePage()
  return escape(s)
end

@lierdakil
Copy link
Contributor

I know little of Windows (BSD/Linux user for last 10 years or so), and even less about Lua, but since Lua is encoding-agnostic, I would assume that this happens somewhere in conversion from CString to Haskell String. GHC uses getForeignEncoding internally, so I suppose that's where dependence on console encoding comes in. One option to sidestep this issue entirely could be to marshall CString directly into ByteString, without intermediate conversion to String. But that's more than a little bit complicated. Another option is to try using setForeignEncodig, which may help force UTF-8.

@lierdakil
Copy link
Contributor

Could someone with easy access to Windows box please try the following patch? I believe it could solve this issue.

diff --git a/src/Text/Pandoc/Writers/Custom.hs b/src/Text/Pandoc/Writers/Custom.hs
index 3774fdd..77686c3 100644
--- a/src/Text/Pandoc/Writers/Custom.hs
+++ b/src/Text/Pandoc/Writers/Custom.hs
@@ -48,6 +48,7 @@ import Control.Monad (when)
 import Control.Exception
 import qualified Data.Map as M
 import Text.Pandoc.Templates
+import GHC.IO.Encoding (getForeignEncoding,setForeignEncoding, utf8)

 attrToMap :: Attr -> M.Map ByteString ByteString
 attrToMap (id',classes,keyvals) = M.fromList
@@ -158,6 +159,8 @@ instance Exception PandocLuaException
 writeCustom :: FilePath -> WriterOptions -> Pandoc -> IO String
 writeCustom luaFile opts doc@(Pandoc meta _) = do
   luaScript <- UTF8.readFile luaFile
+  enc <- getForeignEncoding
+  setForeignEncoding utf8
   lua <- Lua.newstate
   Lua.openlibs lua
   status <- Lua.loadstring lua luaScript luaFile
@@ -173,6 +176,7 @@ writeCustom luaFile opts doc@(Pandoc meta _) = do
              (fmap toString . inlineListToCustom lua)
              meta
   Lua.close lua
+  setForeignEncoding enc
   let body = toString rendered
   if writerStandalone opts
      then do

@lierdakil
Copy link
Contributor

BTW, why is ByteString used at all here is a little mysterious to me, since it's only used inside this module -- Scripting.Lua uses plain Haskell Strings, as does Pandoc's AST.

@nkalvi
Copy link

nkalvi commented Apr 26, 2015

Excellent!

Under the default code page 437, the file was saved as UTF8 without any errors and the result look correct.

@lierdakil
Copy link
Contributor

Great! I'll create a PR then.

lierdakil added a commit to lierdakil/pandoc that referenced this issue Apr 26, 2015
Closes jgm#2101, jgm#1634

Also factored out ByteString, since it's only used as an intermediate
representation.
@jgm jgm closed this as completed in #2112 Apr 26, 2015
cderv added a commit to rstudio/rmarkdown that referenced this issue Apr 28, 2023
This usage of chcp was introduced in 955ebae to fix #134

It was pre pandoc 2.0 and accompanied with an issue upstream jgm/pandoc#2101

I believe pandoc now handles directly this better and this is not needed anymore. Let's see
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants