Error when using cpphs in some locale environments #6

asr · 2016-08-19T13:13:45Z

Some Agda users have reported an error when installing Agda in their locale environments.

A MWE (adapted from this example) is the following:

$ cat Test.hs
module Main where

main = putStrLn "∀"

$ LC_CTYPE=C cpphs Test.hs > /dev/null
cpphs: Test.hs: hGetContents: invalid argument (invalid byte sequence)

@nad wrote here:

I guess that cpphs uses the standard, locale-aware methods to read files. I think all of our source files use the UTF-8 character encoding, so the problem can perhaps be solved by setting LC_CTYPE to .UTF-8 before invoking cpphs, for some locale .UTF-8 that is installed. However, I would not be surprised if it is impossible to do this in a system-independent way. Perhaps it would be better to add a --utf8 flag to cpphs.

Blocking agda/agda#2112.

The text was updated successfully, but these errors were encountered:

malcolmwallace · 2016-08-20T10:11:39Z

I can't seem to reproduce the issue with the given steps. cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding? Certainly, setting LC_CTYPE does not seem to change its behaviour.

$ LC_CTYPE=C ./cpphs Test.hs 
#line 1 "Test.hs"
module Main where

main = putStrLn "∀"

asr · 2016-08-20T11:22:39Z

Using the file command for determining the file type I got

$ file Test.hs
Test.hs: UTF-8 Unicode text

What do you get?

malcolmwallace · 2016-08-20T13:36:50Z

The same.

asr · 2016-08-20T14:35:58Z

It seems you have no the C locale installed. Which is the output of running

$ locale -a

?

malcolmwallace · 2016-08-20T16:07:32Z

$ locale -a
af_ZA
af_ZA.ISO8859-1
af_ZA.ISO8859-15
af_ZA.UTF-8
am_ET
am_ET.UTF-8
be_BY
be_BY.CP1131
be_BY.CP1251
be_BY.ISO8859-5
be_BY.UTF-8
bg_BG
bg_BG.CP1251
bg_BG.UTF-8
ca_ES
ca_ES.ISO8859-1
ca_ES.ISO8859-15
ca_ES.UTF-8
cs_CZ
cs_CZ.ISO8859-2
cs_CZ.UTF-8
da_DK
da_DK.ISO8859-1
da_DK.ISO8859-15
da_DK.UTF-8
de_AT
de_AT.ISO8859-1
de_AT.ISO8859-15
de_AT.UTF-8
de_CH
de_CH.ISO8859-1
de_CH.ISO8859-15
de_CH.UTF-8
de_DE
de_DE.ISO8859-1
de_DE.ISO8859-15
de_DE.UTF-8
el_GR
el_GR.ISO8859-7
el_GR.UTF-8
en_AU
en_AU.ISO8859-1
en_AU.ISO8859-15
en_AU.US-ASCII
en_AU.UTF-8
en_CA
en_CA.ISO8859-1
en_CA.ISO8859-15
en_CA.US-ASCII
en_CA.UTF-8
en_GB
en_GB.ISO8859-1
en_GB.ISO8859-15
en_GB.US-ASCII
en_GB.UTF-8
en_IE
en_IE.UTF-8
en_NZ
en_NZ.ISO8859-1
en_NZ.ISO8859-15
en_NZ.US-ASCII
en_NZ.UTF-8
en_US
en_US.ISO8859-1
en_US.ISO8859-15
en_US.US-ASCII
en_US.UTF-8
es_ES
es_ES.ISO8859-1
es_ES.ISO8859-15
es_ES.UTF-8
et_EE
et_EE.ISO8859-15
et_EE.UTF-8
eu_ES
eu_ES.ISO8859-1
eu_ES.ISO8859-15
eu_ES.UTF-8
fi_FI
fi_FI.ISO8859-1
fi_FI.ISO8859-15
fi_FI.UTF-8
fr_BE
fr_BE.ISO8859-1
fr_BE.ISO8859-15
fr_BE.UTF-8
fr_CA
fr_CA.ISO8859-1
fr_CA.ISO8859-15
fr_CA.UTF-8
fr_CH
fr_CH.ISO8859-1
fr_CH.ISO8859-15
fr_CH.UTF-8
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.UTF-8
he_IL
he_IL.UTF-8
hi_IN.ISCII-DEV
hr_HR
hr_HR.ISO8859-2
hr_HR.UTF-8
hu_HU
hu_HU.ISO8859-2
hu_HU.UTF-8
hy_AM
hy_AM.ARMSCII-8
hy_AM.UTF-8
is_IS
is_IS.ISO8859-1
is_IS.ISO8859-15
is_IS.UTF-8
it_CH
it_CH.ISO8859-1
it_CH.ISO8859-15
it_CH.UTF-8
it_IT
it_IT.ISO8859-1
it_IT.ISO8859-15
it_IT.UTF-8
ja_JP
ja_JP.SJIS
ja_JP.UTF-8
ja_JP.eucJP
kk_KZ
kk_KZ.PT154
kk_KZ.UTF-8
ko_KR
ko_KR.CP949
ko_KR.UTF-8
ko_KR.eucKR
lt_LT
lt_LT.ISO8859-13
lt_LT.ISO8859-4
lt_LT.UTF-8
nl_BE
nl_BE.ISO8859-1
nl_BE.ISO8859-15
nl_BE.UTF-8
nl_NL
nl_NL.ISO8859-1
nl_NL.ISO8859-15
nl_NL.UTF-8
no_NO
no_NO.ISO8859-1
no_NO.ISO8859-15
no_NO.UTF-8
pl_PL
pl_PL.ISO8859-2
pl_PL.UTF-8
pt_BR
pt_BR.ISO8859-1
pt_BR.UTF-8
pt_PT
pt_PT.ISO8859-1
pt_PT.ISO8859-15
pt_PT.UTF-8
ro_RO
ro_RO.ISO8859-2
ro_RO.UTF-8
ru_RU
ru_RU.CP1251
ru_RU.CP866
ru_RU.ISO8859-5
ru_RU.KOI8-R
ru_RU.UTF-8
sk_SK
sk_SK.ISO8859-2
sk_SK.UTF-8
sl_SI
sl_SI.ISO8859-2
sl_SI.UTF-8
sr_YU
sr_YU.ISO8859-2
sr_YU.ISO8859-5
sr_YU.UTF-8
sv_SE
sv_SE.ISO8859-1
sv_SE.ISO8859-15
sv_SE.UTF-8
tr_TR
tr_TR.ISO8859-9
tr_TR.UTF-8
uk_UA
uk_UA.ISO8859-5
uk_UA.KOI8-U
uk_UA.UTF-8
zh_CN
zh_CN.GB18030
zh_CN.GB2312
zh_CN.GBK
zh_CN.UTF-8
zh_CN.eucCN
zh_HK
zh_HK.Big5HKSCS
zh_HK.UTF-8
zh_TW
zh_TW.Big5
zh_TW.UTF-8
C
POSIX

malcolmwallace · 2016-08-20T16:10:14Z

I don't know whether the version of ghc might be relevant, but in case it is, I'm compiling cpphs with ghc-7.6.1

asr · 2016-08-20T16:17:36Z

You have the C locale installed. I could reproduce the issue compiling cpphs with GHC 7.6.3. What shell are you using? I'm using

$ echo $SHELL
/bin/bash

nad · 2016-08-23T08:21:33Z

cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding?

I think recent versions of GHC by default use the locale (or code page) to decide what encoding to use.

nad · 2016-08-23T08:29:33Z

A simple (system-dependent) test:

$ echo -e '\u2200' > test
$ cat test
∀
$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'
∀
$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'
<interactive>: test: hGetContents: invalid argument (invalid byte sequence)

nad · 2016-08-23T08:31:38Z

Certainly, setting LC_CTYPE does not seem to change its behaviour.

Perhaps you've set LC_ALL, which overrides LC_CTYPE.

malcolmwallace · 2016-08-24T11:28:51Z

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.4
$ cat test
∀
$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'
∀
$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'
∀
$ LC_ALL=C ghc -e 'putStr =<< readFile "test"'
∀

malcolmwallace · 2016-08-24T11:29:49Z

I think I can close this issue, since it appears that neither cpphs nor ghc is at fault.

asr · 2016-08-24T13:14:52Z

Which operating system and shell are you using?

asr · 2016-08-24T13:18:37Z

Could you reproduce the issue running

$ export LC_ALL=C
$ cpphs test

?

malcolmwallace · 2016-08-24T13:24:59Z

ghc-7.6.1 on MacOSX 10.7.5, with bash.
ghc-7.8.3 on Windows 7 Professional SP1, with bash.

malcolmwallace · 2016-08-24T13:26:08Z

Cannot reproduce the issue, even with LC_ALL=C.

asr · 2016-08-24T13:29:17Z

Did you mean export LC_ALL=C?

asr · 2016-08-24T13:37:49Z

Which is the output of

$ locale
$ LC_ALL=C locale

?

malcolmwallace · 2016-08-24T14:26:39Z

$ locale # MacOSX
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
$ LC_ALL=C locale
LANG="en_GB.UTF-8"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

The result is similar on Windows 7, except that the default is en_US.UTF-8 rather than en_GB.UTF-8.

nad · 2016-08-24T16:23:56Z

ghc-7.6.1 on MacOSX 10.7.5, with bash.
ghc-7.8.3 on Windows 7 Professional SP1, with bash.

I just discussed this issue with a Mac user, and it seems as if the System.IO functions by default always use UTF-8 under MacOS, while the locale is ignored.

Under Windows I guess that one can use chcp to trigger the problem. Perhaps chcp 1252 would work.

GHC has used UTF-8 as the character encoding for source files since version 6.6 (which was released in 2006), so perhaps cpphs could also use this as the default. Note, however, that the GHC documentation states that "invalid UTF-8 sequences [are] ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only".

I've attached a patch that switches to UTF-8 everywhere (?) in cpphs, with two caveats:

The command-line arguments are treated as before.
The encoding of stderr is only changed in the top-level module. If cpphs is intended to be used as a library, and error messages can contain non-ASCII characters, then the encoding of stderr should perhaps be changed in the applicable library modules.

I've used the base library's support for roundtripping to handle illegal characters. Feel free to base any changes on this patch.

asr · 2016-09-04T16:39:00Z

FYI, I reported here the different behaviour in Linux and Mac OS.

malcolmwallace · 2016-09-05T13:58:00Z

Thanks for the patch Nils. I rolled something slightly different, to ensure that e.g. #included files also get the UTF8 encoding. I was not previously aware of the roundtripping style of TextEncoding, so that was a useful addition for me.

asr · 2016-09-05T14:33:10Z

Thanks for fixing the issue (tested on Agda). Could you release a new version, please.

malcolmwallace · 2016-09-05T16:35:58Z

cpphs-1.20.2 released.

The issue related to some locale environments (see malcolmwallace/cpphs#6) was fixed in cpphs 1.20.2.

asr mentioned this issue Aug 19, 2016

Build fails using cpphs in some locale environments agda/agda#2112

Closed

malcolmwallace closed this as completed Aug 24, 2016

asr added a commit to agda/agda that referenced this issue Sep 5, 2016

[ closed #2112 ] Required cpphs 1.20.2.

a25c002

The issue related to some locale environments (see malcolmwallace/cpphs#6) was fixed in cpphs 1.20.2.

carlostome pushed a commit to carlostome/agda that referenced this issue Oct 11, 2016

[ closed agda#2112 ] Required cpphs 1.20.2.

e220ce6

The issue related to some locale environments (see malcolmwallace/cpphs#6) was fixed in cpphs 1.20.2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when using cpphs in some locale environments #6

Error when using cpphs in some locale environments #6

asr commented Aug 19, 2016 •

edited

malcolmwallace commented Aug 20, 2016

asr commented Aug 20, 2016

malcolmwallace commented Aug 20, 2016

asr commented Aug 20, 2016

malcolmwallace commented Aug 20, 2016

malcolmwallace commented Aug 20, 2016

asr commented Aug 20, 2016

nad commented Aug 23, 2016

nad commented Aug 23, 2016

nad commented Aug 23, 2016

malcolmwallace commented Aug 24, 2016

malcolmwallace commented Aug 24, 2016

asr commented Aug 24, 2016

asr commented Aug 24, 2016

malcolmwallace commented Aug 24, 2016

malcolmwallace commented Aug 24, 2016

asr commented Aug 24, 2016

asr commented Aug 24, 2016

malcolmwallace commented Aug 24, 2016

nad commented Aug 24, 2016

asr commented Sep 4, 2016

malcolmwallace commented Sep 5, 2016

asr commented Sep 5, 2016

malcolmwallace commented Sep 5, 2016

Error when using cpphs in some locale environments #6

Error when using cpphs in some locale environments #6

Comments

asr commented Aug 19, 2016 • edited

malcolmwallace commented Aug 20, 2016

asr commented Aug 20, 2016

malcolmwallace commented Aug 20, 2016

asr commented Aug 20, 2016

malcolmwallace commented Aug 20, 2016

malcolmwallace commented Aug 20, 2016

asr commented Aug 20, 2016

nad commented Aug 23, 2016

nad commented Aug 23, 2016

nad commented Aug 23, 2016

malcolmwallace commented Aug 24, 2016

malcolmwallace commented Aug 24, 2016

asr commented Aug 24, 2016

asr commented Aug 24, 2016

malcolmwallace commented Aug 24, 2016

malcolmwallace commented Aug 24, 2016

asr commented Aug 24, 2016

asr commented Aug 24, 2016

malcolmwallace commented Aug 24, 2016

nad commented Aug 24, 2016

asr commented Sep 4, 2016

malcolmwallace commented Sep 5, 2016

asr commented Sep 5, 2016

malcolmwallace commented Sep 5, 2016

asr commented Aug 19, 2016 •

edited