Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error when using cpphs in some locale environments #6

Closed
asr opened this issue Aug 19, 2016 · 24 comments
Closed

Error when using cpphs in some locale environments #6

asr opened this issue Aug 19, 2016 · 24 comments

Comments

@asr
Copy link

asr commented Aug 19, 2016

Some Agda users have reported an error when installing Agda in their locale environments.

A MWE (adapted from this example) is the following:

$ cat Test.hs
module Main where

main = putStrLn "∀"
$ LC_CTYPE=C cpphs Test.hs > /dev/null
cpphs: Test.hs: hGetContents: invalid argument (invalid byte sequence)

@nad wrote here:

I guess that cpphs uses the standard, locale-aware methods to read files. I think all of our source files use the UTF-8 character encoding, so the problem can perhaps be solved by setting LC_CTYPE to .UTF-8 before invoking cpphs, for some locale .UTF-8 that is installed. However, I would not be surprised if it is impossible to do this in a system-independent way. Perhaps it would be better to add a --utf8 flag to cpphs.

Blocking agda/agda#2112.

@malcolmwallace
Copy link
Owner

I can't seem to reproduce the issue with the given steps. cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding? Certainly, setting LC_CTYPE does not seem to change its behaviour.

$ LC_CTYPE=C ./cpphs Test.hs 
#line 1 "Test.hs"
module Main where

main = putStrLn ""

@asr
Copy link
Author

asr commented Aug 20, 2016

Using the file command for determining the file type I got

$ file Test.hs
Test.hs: UTF-8 Unicode text

What do you get?

@malcolmwallace
Copy link
Owner

The same.

@asr
Copy link
Author

asr commented Aug 20, 2016

It seems you have no the C locale installed. Which is the output of running

$ locale -a

?

@malcolmwallace
Copy link
Owner

$ locale -a
af_ZA
af_ZA.ISO8859-1
af_ZA.ISO8859-15
af_ZA.UTF-8
am_ET
am_ET.UTF-8
be_BY
be_BY.CP1131
be_BY.CP1251
be_BY.ISO8859-5
be_BY.UTF-8
bg_BG
bg_BG.CP1251
bg_BG.UTF-8
ca_ES
ca_ES.ISO8859-1
ca_ES.ISO8859-15
ca_ES.UTF-8
cs_CZ
cs_CZ.ISO8859-2
cs_CZ.UTF-8
da_DK
da_DK.ISO8859-1
da_DK.ISO8859-15
da_DK.UTF-8
de_AT
de_AT.ISO8859-1
de_AT.ISO8859-15
de_AT.UTF-8
de_CH
de_CH.ISO8859-1
de_CH.ISO8859-15
de_CH.UTF-8
de_DE
de_DE.ISO8859-1
de_DE.ISO8859-15
de_DE.UTF-8
el_GR
el_GR.ISO8859-7
el_GR.UTF-8
en_AU
en_AU.ISO8859-1
en_AU.ISO8859-15
en_AU.US-ASCII
en_AU.UTF-8
en_CA
en_CA.ISO8859-1
en_CA.ISO8859-15
en_CA.US-ASCII
en_CA.UTF-8
en_GB
en_GB.ISO8859-1
en_GB.ISO8859-15
en_GB.US-ASCII
en_GB.UTF-8
en_IE
en_IE.UTF-8
en_NZ
en_NZ.ISO8859-1
en_NZ.ISO8859-15
en_NZ.US-ASCII
en_NZ.UTF-8
en_US
en_US.ISO8859-1
en_US.ISO8859-15
en_US.US-ASCII
en_US.UTF-8
es_ES
es_ES.ISO8859-1
es_ES.ISO8859-15
es_ES.UTF-8
et_EE
et_EE.ISO8859-15
et_EE.UTF-8
eu_ES
eu_ES.ISO8859-1
eu_ES.ISO8859-15
eu_ES.UTF-8
fi_FI
fi_FI.ISO8859-1
fi_FI.ISO8859-15
fi_FI.UTF-8
fr_BE
fr_BE.ISO8859-1
fr_BE.ISO8859-15
fr_BE.UTF-8
fr_CA
fr_CA.ISO8859-1
fr_CA.ISO8859-15
fr_CA.UTF-8
fr_CH
fr_CH.ISO8859-1
fr_CH.ISO8859-15
fr_CH.UTF-8
fr_FR
fr_FR.ISO8859-1
fr_FR.ISO8859-15
fr_FR.UTF-8
he_IL
he_IL.UTF-8
hi_IN.ISCII-DEV
hr_HR
hr_HR.ISO8859-2
hr_HR.UTF-8
hu_HU
hu_HU.ISO8859-2
hu_HU.UTF-8
hy_AM
hy_AM.ARMSCII-8
hy_AM.UTF-8
is_IS
is_IS.ISO8859-1
is_IS.ISO8859-15
is_IS.UTF-8
it_CH
it_CH.ISO8859-1
it_CH.ISO8859-15
it_CH.UTF-8
it_IT
it_IT.ISO8859-1
it_IT.ISO8859-15
it_IT.UTF-8
ja_JP
ja_JP.SJIS
ja_JP.UTF-8
ja_JP.eucJP
kk_KZ
kk_KZ.PT154
kk_KZ.UTF-8
ko_KR
ko_KR.CP949
ko_KR.UTF-8
ko_KR.eucKR
lt_LT
lt_LT.ISO8859-13
lt_LT.ISO8859-4
lt_LT.UTF-8
nl_BE
nl_BE.ISO8859-1
nl_BE.ISO8859-15
nl_BE.UTF-8
nl_NL
nl_NL.ISO8859-1
nl_NL.ISO8859-15
nl_NL.UTF-8
no_NO
no_NO.ISO8859-1
no_NO.ISO8859-15
no_NO.UTF-8
pl_PL
pl_PL.ISO8859-2
pl_PL.UTF-8
pt_BR
pt_BR.ISO8859-1
pt_BR.UTF-8
pt_PT
pt_PT.ISO8859-1
pt_PT.ISO8859-15
pt_PT.UTF-8
ro_RO
ro_RO.ISO8859-2
ro_RO.UTF-8
ru_RU
ru_RU.CP1251
ru_RU.CP866
ru_RU.ISO8859-5
ru_RU.KOI8-R
ru_RU.UTF-8
sk_SK
sk_SK.ISO8859-2
sk_SK.UTF-8
sl_SI
sl_SI.ISO8859-2
sl_SI.UTF-8
sr_YU
sr_YU.ISO8859-2
sr_YU.ISO8859-5
sr_YU.UTF-8
sv_SE
sv_SE.ISO8859-1
sv_SE.ISO8859-15
sv_SE.UTF-8
tr_TR
tr_TR.ISO8859-9
tr_TR.UTF-8
uk_UA
uk_UA.ISO8859-5
uk_UA.KOI8-U
uk_UA.UTF-8
zh_CN
zh_CN.GB18030
zh_CN.GB2312
zh_CN.GBK
zh_CN.UTF-8
zh_CN.eucCN
zh_HK
zh_HK.Big5HKSCS
zh_HK.UTF-8
zh_TW
zh_TW.Big5
zh_TW.UTF-8
C
POSIX

@malcolmwallace
Copy link
Owner

I don't know whether the version of ghc might be relevant, but in case it is, I'm compiling cpphs with ghc-7.6.1

@asr
Copy link
Author

asr commented Aug 20, 2016

You have the C locale installed. I could reproduce the issue compiling cpphs with GHC 7.6.3. What shell are you using? I'm using

$ echo $SHELL
/bin/bash

@nad
Copy link

nad commented Aug 23, 2016

cpphs uses the standard Haskell/ghc System.IO.openFile, which I think trusts the underlying filesystem's metadata about the file's encoding?

I think recent versions of GHC by default use the locale (or code page) to decide what encoding to use.

@nad
Copy link

nad commented Aug 23, 2016

A simple (system-dependent) test:

$ echo -e '\u2200' > test
$ cat test
∀
$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'
∀
$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'
<interactive>: test: hGetContents: invalid argument (invalid byte sequence)

@nad
Copy link

nad commented Aug 23, 2016

Certainly, setting LC_CTYPE does not seem to change its behaviour.

Perhaps you've set LC_ALL, which overrides LC_CTYPE.

@malcolmwallace
Copy link
Owner

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 7.8.4
$ cat test

$ file test
test: UTF-8 Unicode text
$ ghc -e 'putStr =<< readFile "test"'

$ LC_CTYPE=C ghc -e 'putStr =<< readFile "test"'

$ LC_ALL=C ghc -e 'putStr =<< readFile "test"'

@malcolmwallace
Copy link
Owner

I think I can close this issue, since it appears that neither cpphs nor ghc is at fault.

@asr
Copy link
Author

asr commented Aug 24, 2016

Which operating system and shell are you using?

@asr
Copy link
Author

asr commented Aug 24, 2016

Could you reproduce the issue running

$ export LC_ALL=C
$ cpphs test

?

@malcolmwallace
Copy link
Owner

ghc-7.6.1 on MacOSX 10.7.5, with bash.
ghc-7.8.3 on Windows 7 Professional SP1, with bash.

@malcolmwallace
Copy link
Owner

Cannot reproduce the issue, even with LC_ALL=C.

@asr
Copy link
Author

asr commented Aug 24, 2016

Did you mean export LC_ALL=C?

@asr
Copy link
Author

asr commented Aug 24, 2016

Which is the output of

$ locale
$ LC_ALL=C locale

?

@malcolmwallace
Copy link
Owner

$ locale # MacOSX
LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=
$ LC_ALL=C locale
LANG="en_GB.UTF-8"
LC_COLLATE="C"
LC_CTYPE="C"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL="C"

The result is similar on Windows 7, except that the default is en_US.UTF-8 rather than en_GB.UTF-8.

@nad
Copy link

nad commented Aug 24, 2016

ghc-7.6.1 on MacOSX 10.7.5, with bash.
ghc-7.8.3 on Windows 7 Professional SP1, with bash.

I just discussed this issue with a Mac user, and it seems as if the System.IO functions by default always use UTF-8 under MacOS, while the locale is ignored.

Under Windows I guess that one can use chcp to trigger the problem. Perhaps chcp 1252 would work.

GHC has used UTF-8 as the character encoding for source files since version 6.6 (which was released in 2006), so perhaps cpphs could also use this as the default. Note, however, that the GHC documentation states that "invalid UTF-8 sequences [are] ignored in comments, so it is possible to use other encodings such as Latin-1, as long as the non-comment source code is ASCII only".

I've attached a patch that switches to UTF-8 everywhere (?) in cpphs, with two caveats:

  • The command-line arguments are treated as before.
  • The encoding of stderr is only changed in the top-level module. If cpphs is intended to be used as a library, and error messages can contain non-ASCII characters, then the encoding of stderr should perhaps be changed in the applicable library modules.

I've used the base library's support for roundtripping to handle illegal characters. Feel free to base any changes on this patch.

@asr
Copy link
Author

asr commented Sep 4, 2016

FYI, I reported here the different behaviour in Linux and Mac OS.

@malcolmwallace
Copy link
Owner

Thanks for the patch Nils. I rolled something slightly different, to ensure that e.g. #included files also get the UTF8 encoding. I was not previously aware of the roundtripping style of TextEncoding, so that was a useful addition for me.

@asr
Copy link
Author

asr commented Sep 5, 2016

Thanks for fixing the issue (tested on Agda). Could you release a new version, please.

@malcolmwallace
Copy link
Owner

cpphs-1.20.2 released.

asr added a commit to agda/agda that referenced this issue Sep 5, 2016
The issue related to some locale environments (see
malcolmwallace/cpphs#6) was fixed in
cpphs 1.20.2.
carlostome pushed a commit to carlostome/agda that referenced this issue Oct 11, 2016
The issue related to some locale environments (see
malcolmwallace/cpphs#6) was fixed in
cpphs 1.20.2.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants