New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Windows: En dashes break custom Lua output #2101
Comments
Are you using the latest release (1.13.2.*) or the development version (github master)? |
This is with the latest release, 1.13.2. |
@jmcphers I couldn't reproduce it under Windows 8.1 or Windows 7 (or OS X):
I created the sample.lua with
I was initially suspecting codepage issues under Windows, but I'm not sure what causes it on your system. Could you please post your score.md file? |
@jmcphers I do get the error when I created it in Notepad and saved it as UTF-8.
will process the file without errors. |
It's definitely Windows only; I can't reproduce it on any other system. I think the reason you're not reproducing it is that the Win32 shell is automatically converting your en dash into a regular hyphen: C:\downloads>echo "The score was 4—15." > score.md Here's a gist of score.md, if that helps: |
@nkalvi , now we're getting somewhere! With
Not sure why it's malformed, though--perhaps just related to the sample LUA writer? |
@jmcphers It looks like it is related Windows' default code page - I saw many similar issues with Python apps too. |
Yep, works fine with real input! I added a workaround on our end that sets the codepage to 65001 temporarily while we run Pandoc. It seems a little weird that this is the only situation (that I know of) in which Pandoc requires the code page to be set to UTF-8 -- should we expect that this will be changed in Pandoc at some point, or should we just plan to always run Pandoc under that codepage on Windows? |
+++ Jonathan [Apr 21 15 10:00 ]:
I'd like to change pandoc, but we need to figure out what to change. |
Could the readers/writers check the encoding and set it if needed using hGetEncoding/hSetEncoding? https://www.haskell.org/hoogle/?hoogle=hSetEncoding I wonder whether discussion below has some helpful clues: |
+++ nkalvi [Apr 21 15 14:47 ]:
I think this is something specific to the lua writer; we don't see these |
I'm curious whether 2bca018 helps with this - can one of you compile and test on your Windows setup? |
Will try later today and report back. |
@jgm it doesn't seems to help - though it doesn't abort with error message, it removes the En-space:
I'll see whether there are any other ways to handle this. |
+++ nkalvi [Apr 22 15 15:36 ]:
Try without code page 437. You want UTF-8. |
It works fine with 65001 - so did the previous version. |
+++ nkalvi [Apr 22 15 15:42 ]:
Okay. I don't really know anything about how Windows works. |
I couldn't find any good solutions so far (experimented quite a bit using setlocale etc. in both Custom.hs and Sample.lua): It looks like an appropriate code page needs to be set before starting Pandoc. A batch file with seems to be an easy solution. codePage = "65001"
currentCodePage = ""
function setCodePage()
if currentCodePage == "" then
fh,err = io.popen("chcp","r")
if fh then
currentCodePage = fh:read()
end
end
if string.find(currentCodePage, codePage) == nil then
codePage = os.execute("chcp " .. codePage)
io.stderr:write(string.format("Appropriate code page has been set. Please run again\n"))
os.exit(1)
end
end
function Str(s)
setCodePage()
return escape(s)
end |
I know little of Windows (BSD/Linux user for last 10 years or so), and even less about Lua, but since Lua is encoding-agnostic, I would assume that this happens somewhere in conversion from CString to Haskell String. GHC uses |
Could someone with easy access to Windows box please try the following patch? I believe it could solve this issue. diff --git a/src/Text/Pandoc/Writers/Custom.hs b/src/Text/Pandoc/Writers/Custom.hs
index 3774fdd..77686c3 100644
--- a/src/Text/Pandoc/Writers/Custom.hs
+++ b/src/Text/Pandoc/Writers/Custom.hs
@@ -48,6 +48,7 @@ import Control.Monad (when)
import Control.Exception
import qualified Data.Map as M
import Text.Pandoc.Templates
+import GHC.IO.Encoding (getForeignEncoding,setForeignEncoding, utf8)
attrToMap :: Attr -> M.Map ByteString ByteString
attrToMap (id',classes,keyvals) = M.fromList
@@ -158,6 +159,8 @@ instance Exception PandocLuaException
writeCustom :: FilePath -> WriterOptions -> Pandoc -> IO String
writeCustom luaFile opts doc@(Pandoc meta _) = do
luaScript <- UTF8.readFile luaFile
+ enc <- getForeignEncoding
+ setForeignEncoding utf8
lua <- Lua.newstate
Lua.openlibs lua
status <- Lua.loadstring lua luaScript luaFile
@@ -173,6 +176,7 @@ writeCustom luaFile opts doc@(Pandoc meta _) = do
(fmap toString . inlineListToCustom lua)
meta
Lua.close lua
+ setForeignEncoding enc
let body = toString rendered
if writerStandalone opts
then do |
BTW, why is |
Excellent! Under the default code page 437, the file was saved as UTF8 without any errors and the result look correct. |
Great! I'll create a PR then. |
This usage of chcp was introduced in 955ebae to fix #134 It was pre pandoc 2.0 and accompanied with an issue upstream jgm/pandoc#2101 I believe pandoc now handles directly this better and this is not needed anymore. Let's see
Repro:
Save this as
score.md
, with UTF-8 encoding (note that it contains an en dash):Run Pandoc using its sample Lua custom writer:
Result:
The same is true for many other Unicode characters (such as non-breaking spaces and many/most other non-ASCII characters). This may be a dupe of #1634; it's not clear why that issue was closed.
The text was updated successfully, but these errors were encountered: