Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

String.escaped returns strange results in Mac OS X + LANG=ja_JP.UTF-8 #6521

Closed
vicuna opened this issue Aug 26, 2014 · 10 comments

Comments

Projects
None yet
1 participant
@vicuna
Copy link

commented Aug 26, 2014

Original bug ID: 6521
Reporter: furuse
Status: closed (set by @damiendoligez on 2015-03-11T19:15:48Z)
Resolution: fixed
Priority: normal
Severity: major
Version: 4.02.0+beta1 / +rc1
Target version: 4.03.0+dev / +beta1
Fixed in version: 4.03.0+dev / +beta1
Category: runtime system and C interface
Tags: junior_job
Related to: #6925
Monitored by: @gasche

Bug description

In Mac OS X, if LANG=ja_JP.UTF-8, String.quoted does not quote some characters >= 0x80. It seems that ISO-8859-1 printable chars are not quoted in this setting. See janestreet/sexplib#11 for details. String.escaped is LANG dependent, and in ja_JP.UTF-8 (and probably in other UTF-8 locales too), its results are not valid in UTF-8. This is strange even with the fact that OCaml's string is not in UTF-8 but in ISO-8859-1 unofficially.

The comment of String.escaped does not clearly state which chars are escaped. I thought it escapes ASCII non-printable chars for long but apparently it is not in the above setting. The function internally calls caml_is_printable() which uses setlocale(LC_CTYPE, ""). I am not an i18n guru, but the spec of setlocale says:


"C" Same as POSIX.

"" : Specifies an implementation-dependent native environment. For XSI-conformant systems, this corresponds to the value of the associated environment variables, LC_* and LANG; see the XBD specification, Locale and the XBD specification, Environment Variables .


It seems that isprint() is implementation dependent if LC_TYPE="". This might explain what we see in Mac OS X + LANG=ja_JP.UTF-8.

I propose the followings:

  • Clearly comment what String.escaped returns. I think many believe that it returns strings only contain ASCII printables.
  • Change setlocale(LC_TYPE, "") in caml_is_printable to setlocale(LC_TYPE "C") so that it can become implementation independent.
  • Or, simply hard code ASCII printable check (0x20 <= c && c <= 0x7E)
@vicuna

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

Comment author: furuse

The spec of setlocale I found is here: http://pubs.opengroup.org/onlinepubs/7908799/xsh/setlocale.html

@vicuna

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

Comment author: @mshinwell

Jun, is this a new bug in 4.02?

@vicuna

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

Comment author: @dbuenzli

The bug is also present before 4.02.

If I read the documentation of String.escape and given the encoding of OCaml's string I expect String.escape to escape only the unprintable characters of ISO-8859-1 (i.e. the gray unlabelled boxes here: http://en.wikipedia.org/wiki/ISO/IEC_8859-1#Codepage_layout). That is those that are not in (0x20 <= c && c <= 0x7E || 0xA0 <= c && c <= 0xFF)

So the current behaviour with LANG=ja_JP.UTF-8 seems fine to me it's just the behaviour with LANG=C that is not. Now if historically (what did the original authors expect ?) and statistically the function has given us the behaviour of LANG=C then I also suggest to hard code the check to ASCII printable characters.

Example with my own locale (right behaviour to me) and then LANG=C on osx 10.9.4 with 4.01.0

echo $LANG
fr_CH.UTF-8
ocaml
OCaml version 4.01.0

let s = String.escaped "\233\171\152";;

val s : string = "?\152"

Char.code s.[0];;

  • : int = 233

Char.code s.[1];;

  • : int = 171

export LANG=C
ocaml
OCaml version 4.01.0

let s = String.escaped "\233\171\152";;

val s : string = "\233\171\152"

@vicuna

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

Comment author: @dbuenzli

Well actually given the doc:

Return a copy of the argument, with special characters represented by escape sequences, following the lexical conventions of OCaml.

We can say that both answers in the example above are in fact valid, in the sense that interpreted by an OCaml compiler both strings denote the same sequence of bytes.

@vicuna

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

Comment author: furuse

Yes the both outputs are valid... I think we should choose one of them. It is very confusing that the runtime of OCaml is affected by LANG.

I prefer escaping non-ASCII printables, since

  • I do not check it thoroughly, but it seems to be the behaviour of Linux + any LANG, which many are used to for long.
  • Escaping all the non-ASCII printables make the result valid also as UTF-8 and other encodings.
  • Many use OCaml string to store UTF-8 data knowing or not knowing it is officially in ISO-8859-1. Escaping non ASCII printables is meaningful also to them.
  • I am selfish and I live in Asia :-)

We can choose quote only non-printable ISO-8859-1, but in that case, I would like to have escaped_to_ASCII too.

@vicuna

This comment has been minimized.

Copy link
Author

commented Aug 26, 2014

Comment author: @dbuenzli

Yes to everything (even you being selfish).

I we choose one I'd also be in absolute favour of escaping all the non-ASCII printable characters. UTF-8 compatibility of the returned string is the argument. There's no real point against that if we want a forward looking solution.

@vicuna

This comment has been minimized.

Copy link
Author

commented Aug 27, 2014

Comment author: @alainfrisch

I'm also in favor of the change. FWIW, we already have it in LexiFi's version, to avoid introducing different behaviors between platforms for such a basic function.

@vicuna

This comment has been minimized.

Copy link
Author

commented Aug 28, 2014

Comment author: @damiendoligez

Strings are supposed to be encoding-agnostic, and certainly not officially iso-8859-1.

Definitely change it to escape all non-ascii-printable.

@vicuna

This comment has been minimized.

Copy link
Author

commented Sep 15, 2014

Comment author: @damiendoligez

This will be an incompatible changes, so I'm pushing it back to 4.03.

@vicuna

This comment has been minimized.

Copy link
Author

commented Mar 11, 2015

Comment author: @damiendoligez

Fixed in trunk (commit 15901).

Note that I also changed Bytes.escaped and Char.escaped.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.