Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Toplevel: only escapes bytes and not strings #1231

Merged
merged 1 commit into from Sep 13, 2017

Conversation

Projects
None yet
10 participants
@Octachron
Copy link
Contributor

commented Jul 8, 2017

Escaping strings when printing them in the toplevel has the disadvantage of making unicode text unreadable:

# "한글";;
- : string = "\237\149\156\234\184\128"

This PR proposes to escape only bytes and not strings:

# let cosmos = "κόσμος";;
cosmos : string = "κόσμος"
# Bytes.of_string cosmos;;
- : bytes =
Bytes.of_string "\206\186\207\140\207\131\206\188\206\191\207\130"

This change is not purely aesthetic: the mangling of unicode strings may contribute to the impression of some OCaml newcomers that Ocaml has no support for unicode (and being able to read the corresponding strings in the toplevel benefits all users not able to fluently read raw unicode codepoint).

On a more technical note, in presence of string delimiters inside the printed string, the best string delimiters still available are used:

# "日本\"";;
- : string = {|日本"語|}

In order of preference, the string delimiters used are ", {|, {t|, {top|,{toplevel|, and in the worst case scenario {toplevel%d|:

# "مرسی\n\"|}|t}|top}|toplevel}|toplevel1}|toplevel2}";;
- : string =
{toplevel3|ﻡﺮﺳی
"|}|t}|top}|toplevel}|toplevel1}|toplevel2}|toplevel3}

With this scheme, all strings should be printable as valid string literals.

One disadvantage of this approch is that newline and tabs are not escaped anymore

# "\n\t\n\t\n";;
- : string = "
	
	
"

@Octachron Octachron force-pushed the Octachron:hello_κόσμος branch from d8ee0b7 to 60fe3c1 Jul 8, 2017

@gasche

This comment has been minimized.

Copy link
Member

commented Jul 8, 2017

I like some aspects of the change (the idea that unicode letters are printed back) but not others: not-escaping the whitespace characters leads to a loss of readability, for example. Of course, some time people write string literals with newlines in them, and escaping them instead hurts readability.

I have mixed feelings about the automatic choice of {|..|} for escaping; it's a lesser-known feature that may surprise users, but on the other it makes the feature more self-discoverable. Overall I would say that I like it.

Have people studied the problem of "what is the right choice of which characters to print and which characters to escape" before, and are there solution that do not require more unicode knowledge than available in the OCaml standard library?

(Would we want this behavior to depend on the current user locale? In general it seems that people push for locale-independence these days.)

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jul 8, 2017

I'm not really fond of the choices made by this PR. These would be my suggestions:

  1. Always escape control characters (bytes <= 0x1F) and the string delimiter.
  2. Do not use the "{|" syntax, always delimit strings with "
  3. Consult some kind of environment variable to determine if the terminal is UTF-8 aware (this has to be researched or if nothing reliable is available one ocaml specific should be defined). If the terminal is not UTF-8 aware any byte > 0x7E should still be escaped.
  4. (Ideally the string should be checked for UTF-8 validity and if not valid its non printable US-ASCII bytes should be escaped. But this can be improved once/if we get more UTF-8 tools in the stdlib).
  5. I would really like to have the bytes of string and bytes escaped with hexadecimal numbers. Decimal escapes are a huge PITA if you are dealing with UTF-8 and other byte oriented file formats.

(Would we want this behavior to depend on the current user locale? In general it seems that people push for locale-independence these days.)

This is a user interface not an API, as such I think it would be legitimate to depend on the user locale.

@Octachron

This comment has been minimized.

Copy link
Contributor Author

commented Jul 8, 2017

Concerning point 1 and 2, escaping characters in the C0 control character code set and " may be indeed clearer and it rejoins @gasche remarks.

Concerning point 3 (and 4), I am not completely convinced because it seems much simpler to simply not escape bytes ≥ 0x7F. By doing so, we would keep some compatibility with latin-1 and JIS users at the only cost ( for utf-8 users ) of not escaping control characters in the C1 code set (in particular NEL) and the more exotic LS or PS new lines (on the other hand, it would even give a work-arround for users that really really want non-escaped new lines).

I agree with point 5, but using hexadecimal in escaping sequence does not seems to particularly concern the toplevel, and I think this should be changed globally.

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jul 8, 2017

By doing so, we would keep some compatibility with latin-1 and JIS users at the only cost ( for utf-8 users )

This kind of fortunate coincidence argument doesn't make sense to me. We have been trying to push for sometime now for a model where OCaml strings should be UTF-8 encoded.

I will let the dev team determine if they find it important that the toplevel is still able to function in a 7-bit environment. But I really think that we should have at least an OCAMLTOP_UTF_8 boolean environment variable (that defaults to true), that controls this.

@Octachron

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2017

I have updated this to escape C0 control characters and string delimiters; in other words, quoted string
delimiters are gone and \t and \n are now escaped:

# "серафими\t\"многоꙮчитїи\"";;
- : string = "серафими\t\"многоꙮчитїи\""

@dbuenzli, I am not sure what your proposed OCAMLTOP_UTF_8 environment variable is supposed to do? Disable the escaping of bytes >0x7E but only for utf-8 valid byte sequence?

Anyway, I personally don't dislike the fortunate coincidence that users get back in the toplevel the same string they submitted as input (except for control characters that indeed should not take control of the toplevel printing).

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2017

So @pqwy who studied the problem in the context of his notty library tells me that it seems hopeless to try to find a system environment variable to lookup (though LC_ALL, LANG, LC_CTYPE, could be possibly consulted in that order, but that doesn't tell you if the output actually agrees).

@dbuenzli, I am not sure what your proposed OCAMLTOP_UTF_8 environment variable is supposed to do? Disable the escaping of bytes >0x7E but only for utf-8 valid byte sequence?

No if OCAMLTOP_UTF_8 is set to false you always escape bytes > 0x7E.

Anyway, I personally don't dislike the fortunate coincidence that users get back in the toplevel the same string they submitted as input

Indeed, in a non Unicode aware terminal this user wouldn't be able to input UTF-8, so she would e.g. input "\xC3\xA9" and you would output them unescaped which would break her terminal.

@Octachron

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2017

No if OCAMLTOP_UTF_8 is set to false you always escape bytes > 0x7E.

I agree that being able to reactivate the escaping of bytes > 0x7E is a undeniable improvement.

Indeed, in a non Unicode aware terminal this user wouldn't be able to input UTF-8, so she would e.g. input "\xC3\xA9" and you would output them unescaped which would break her terminal.

Well, \xC3\xA9 is also a perfectly valid latin-1(é) or SHIFT-JIS (テゥ) byte sequence, so breaking the terminal might be a bit hyperbolic here.

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2017

Well, \xC3\xA9 is also a perfectly valid latin-1(é) or SHIFT-JIS (テゥ) byte sequence, so breaking the terminal might be a bit hyperbolic here.

But in that case the user would leave OCAMLTOP_UTF_8 set to true (as strange as it sounds) and would not input hex escapes...

@Octachron

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2017

I have added the OCAMLTOP_UTF_8 variable, man page and manual description included.

Note that I have also deleted the paragraph in the ocaml man page about LC_CTYPE=iso_8859_1:
the described behavior did not seem valid anymore (and they are no trace of iso_8859_1 nor LC_CTYPE in the code base). Moreover, OCAMLTOP_UTF_8 supersedes such use of LC_CTYPE.

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2017

Note that I have also deleted the paragraph in the ocaml man page about LC_CTYPE=iso_8859_1:

FTR @xavierleroy removed that from the manual in 5d385f9. I think this may have gone away with the resolution by @damiendoligez of MPR6521 in e60a2db.

@@ -74,6 +74,34 @@ let parenthesize_if_neg ppf fmt v isneg =
fprintf ppf fmt v;
if isneg then pp_print_char ppf ')'

(** Escape only C0 control characters (bytes <= 0x1F) and '"' *)
let print_out_string ppf s =

This comment has been minimized.

Copy link
@xavierleroy

xavierleroy Jul 9, 2017

Contributor

0x7F, a.k.a. DEL, is a control character and needs escaping too.

This comment has been minimized.

Copy link
@Octachron

Octachron Jul 9, 2017

Author Contributor

Fixed, thanks.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2017

This discussion makes me feel younger, because we had pretty much the same discussion back in the early 1990s when Latin-1 support was added to Caml Light and not all environments would support characters above 0x80...

Escaping control characters is absolutely necessary, and not only to display TAB, CR and LF meaningfully. For example, if terminal escape sequences are printed verbatim, the display can be completely messed up.

A desirable property of the toplevel value printer is that the output, once fed back into the toplevel by cut-and-paste, should parse back to the same value (as much as possible). I think it is the case in the latest incarnation of this PR, but make sure it is.

@lpw25

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2017

Isn't this going to mess up the formatting? Format doesn't understand the widths of these characters. I'm lead to believe that this is quite a tricky problem to solve.

@dbuenzli

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2017

Isn't this going to mess up the formatting?

It will.

I'm lead to believe that this is quite a tricky problem to solve.

It's worse than that. It's a problem that is impossible to solve without being able to interact with the rendering layer to measure how many cells your UTF-8 encoded string is going to span when rendered -- something no terminal out there will provide you.

You can perform some kind of best effort formatting using either Uucp.tty_width_hint (a form of wcwidth designed by @pqwy for this problem) or Uuseg's pretty-printers but even, none of these solutions are entirely fool proof.

@Octachron

This comment has been minimized.

Copy link
Contributor Author

commented Jul 9, 2017

Format output will get messy since Format will tend to overestimate the length of the graphical representation of strings.

Some examples, first in Greek:

# String.split_on_char ' ' "Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾿ Ἀχαιοῖς ἄλγε᾿ ἔθηκε, πολλὰς δ᾿ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν οἰωνοῖσί τε πᾶσι· Διὸς δ᾿ ἐτελείετο βουλή ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.";;   
  - : string list =
["Μῆνιν"; "ἄειδε"; "θεὰ"; "Πηληϊάδεω";
 "Ἀχιλῆος"; "οὐλομένην,"; ""; "μυρί᾿";
 "Ἀχαιοῖς"; "ἄλγε᾿"; "ἔθηκε,"; "πολλὰς";
 "δ᾿"; "ἰφθίμους"; "ψυχὰς"; "Ἄϊδι";
 "προΐαψεν"; "ἡρώων,"; "αὐτοὺς"; "δὲ";
 "ἑλώρια"; "τεῦχε"; "κύνεσσιν"; "οἰωνοῖσί";
 "τε"; "πᾶσι·"; "Διὸς"; "δ᾿"; "ἐτελείετο";
 "βουλή"; "ἐξ"; "οὗ"; "δὴ"; "τὰ"; "πρῶτα";
 "διαστήτην"; "ἐρίσαντε"; "Ἀτρεΐδης"; "τε";
 "ἄναξ"; "ἀνδρῶν"; "καὶ"; "δῖος";
 "Ἀχιλλεύς."]

Compared to the english version

# String.split_on_char ' ' "Achilles sing, O Goddess! Peleus' son; His wrath pernicious, who ten thousand woes Caused to Achaia's host, sent many a soul Illustrious into Ades premature, And Heroes gave (so stood the will of Jove) To dogs and to all ravening fowls a prey, When fierce   dispute had separated once The noble Chief Achilles from the son Of Atreus, Agamemnon, King of men.";;    
- : string list =
["Achilles"; "sing,"; "O"; "Goddess!"; "Peleus'"; "son;"; "His"; "wrath";
 "pernicious,"; "who"; "ten"; "thousand"; "woes"; "Caused"; "to"; "Achaia's";
 "host,"; "sent"; "many"; "a"; "soul"; Illustrious"; "into"; "Ades";
 "premature,"; "And"; "Heroes"; "gave"; "(so"; "stood"; "the"; "will"; "of";
 "Jove)"; "To"; "dogs"; "and"; "to"; "all"; "ravening"; "fowls"; "a";
 "prey,"; "When"; "fierce"; "dispute"; "had"; "separated"; "once"; "The";
 "noble"; "Chief"; "Achilles"; "from"; "the"; "son"; "Of"; "Atreus,";
 "Agamemnon,"; "King"; "of"; "men."]

Similarly with Japanese

# [
"こぬ人を";
"まつほの浦の";
"夕なぎに";
"やくやもしほの";
"身もこがれつつ"
];;            
- : string list =
["こぬ人を"; "まつほの浦の"; "夕なぎに";
 "やくやもしほの"; "身もこがれつつ"]

or Sanskrit

# [       
"अग्निमीळे"; "पुरोहितं"; "यज्ञस्य"; "देवं रत्वीजम";
"होतारं"; "रत्नधातमम";
"अग्निः"; "पूर्वेभिर्र्षिभिरीड्यो"; "नूतनैरुत";
""; "देवानेह"; "वक्षति"
];;           
- : string list =
["अग्निमीळे"; "पुरोहितं";
 "यज्ञस्य"; "देवं रत्वीजम";
 "होतारं"; "रत्नधातमम"; "अग्निः";
 "पूर्वेभिर्र्षिभिरीड्यो";
 "नूतनैरुत"; ""; "देवानेह";
 "वक्षति"]
@gasche

This comment has been minimized.

Copy link
Member

commented Jul 9, 2017

On the other hand, any of these examples are completely unreadable with the current pretty-printing scheme, so the output you show (if formatted a bit weirdly compared to the english version) is a strong improvement.

Given that this only affect the toplevel output (and not calls to Format in user programs), I believe that not having a general solution to length formatting is fine.

@DemiMarie

This comment has been minimized.

Copy link
Contributor

commented Jul 9, 2017

I think that Rust’s approach is best long-term one: strings must be in UTF-8 and are immutable. Creating a string that is not valid UTF-8 is undefined behavior.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Jul 14, 2017

Also, if I'm reading the code correctly, backslash is not escaped...

Why not base your implementation on that of String.escaped / Bytes.escaped? That would avoid so many regressions.

@Octachron Octachron force-pushed the Octachron:hello_κόσμος branch from 0b82612 to a9b3057 Jul 14, 2017

@Octachron

This comment has been minimized.

Copy link
Contributor Author

commented Jul 14, 2017

There is a cosmetic regression w.r.t the current escaping of strings: CR, LF, TAB and BS are printed with numeric escapes, while previously they were printed \r, \n, \t, \b.

They were not? At least not during my tests? (https://github.com/ocaml/ocaml/blob/trunk/stdlib/char.ml#L29)

Also, if I'm reading the code correctly, backslash is not escaped...

Of this, I am atrociously guilty. I should have added a test on the testsuite covering the whole ascii range.
This lack of test is fixed.

Why not base your implementation on that of String.escaped / Bytes.escaped?

As wished, I have reimplemented the string escape in the style of Bytes.escaped.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Jul 15, 2017

My comment about CR LF and co was based on a wrong reading of the code, just ignore it and apologies about this.

@Octachron Octachron force-pushed the Octachron:hello_κόσμος branch from a9b3057 to c66b9b9 Jul 16, 2017

@xavierleroy
Copy link
Contributor

left a comment

Looks very good to me, with a bit of LaTeX tweaking recommended.

Changes Outdated
@@ -530,6 +530,11 @@ Next major version (4.05.0):
- PR#7060, GPR#1035: Print exceptions in installed custom printers
(Tadeu Zagallo, review by David Allsopp)

- GPR#1231: improved printing of unicode texts in the toplevel,
when OCAMLTOP_UTF_8 is not set to false.

This comment has been minimized.

Copy link
@xavierleroy

xavierleroy Sep 11, 2017

Contributor

"unless OCAMLTOP_UTF_8 is set to false" would read better, I think.

This comment has been minimized.

Copy link
@Octachron

Octachron Sep 12, 2017

Author Contributor

I agree. Fixed.

@@ -126,6 +126,10 @@ The following command-line options are recognized by the "ocaml" command.
\begin{unix}
The following environment variables are also consulted:
\begin{options}
\item["OCAMLTOP_UTF_8"] When printing string values, non-ascii bytes
(>0x7E) are printed as decimal escape sequence if "OCAMLTOP_UTF_8" is

This comment has been minimized.

Copy link
@xavierleroy

xavierleroy Sep 11, 2017

Contributor

(>0x7E) will format poorly in LaTeX, with the > character rendered as upside-down question mark or some such. Please format as a proper math formula:

  ($ {} > "\0x7E" $)

This comment has been minimized.

Copy link
@Octachron

Octachron Sep 12, 2017

Author Contributor

Fixed.

@Octachron Octachron force-pushed the Octachron:hello_κόσμος branch from 17b77ac to 3fb6deb Sep 12, 2017

@gasche

This comment has been minimized.

Copy link
Member

commented Sep 12, 2017

I remarked while reading the code that the "max string length" parameter of the Oval_string node may not be respected, given that escaping (done after this test) may increase the length. However, (1) previous implementations and the Bytes codepath also suffer from this issue and (2) this parameter is currently not fixed by the user (then it would be nice to respect it) but by the Genprintval recursion-depth control code, so it sounds reasonable to only respect it approximately.

@gasche

This comment has been minimized.

Copy link
Member

commented Sep 12, 2017

@Octachron I think you should feel free to rebase (if you want to squash some of the intermediary commits) and merge.

@Octachron Octachron force-pushed the Octachron:hello_κόσμος branch from 3fb6deb to 2e6a78a Sep 12, 2017

toplevel: only escapes bytes and not strings
Escaping strings when printing them in the toplevel has the disadvantage
of mangling unicode text:

```
\# "한글";;
- : string = "\237\149\156\234\184\128"
```

With this commit, strings are not escaped anymore, contrarily to bytes:

```
\# let cosmos = "κόσμος";;
cosmos : string = "κόσμος"
\# Bytes.of_string cosmos;;
- : bytes =
Bytes.of_string "\206\186\207\140\207\131\206\188\206\191\207\130"
```

This new behavior can be disabled dynamically by setting the environment
variable OCAMLTOP_UTF_8 to false

This change is not solely aesthetic: the mangling of unicode string may
contribute to the impression of some OCaml newcomers that Ocaml has no
support for unicode.

@Octachron Octachron merged commit 588c231 into ocaml:trunk Sep 13, 2017

2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@Octachron

This comment has been minimized.

Copy link
Contributor Author

commented Sep 13, 2017

Squashed to a single commit and merged.

@nojb

This comment has been minimized.

Copy link
Contributor

commented Oct 5, 2017

There is another place where a similar treatment of strings may be a good idea, namely Printexc.to_string, see here. It currently uses %S which ends up escaping Unicode filenames (which can easily appear as arguments of Sys_error).

@gasche

This comment has been minimized.

Copy link
Member

commented Oct 5, 2017

Do we want to consider relaxing the escaping in %S? The semantic specification, as I understand it, is to print strings in the way that they can be parsed back as OCaml literals. Maybe we can assume that unicode strings are parsed back as OCaml literals in a unicode file, so that non-escaping is justified?

@gasche

This comment has been minimized.

Copy link
Member

commented Oct 5, 2017

(On the other hand, printing unicode on Windows seems to still be an open problem, so maybe escaping is not that bad?)

@damiendoligez

This comment has been minimized.

Copy link
Member

commented Oct 9, 2017

IMO Printexc.to_string is a debugging tool, its output should be seen by developers but not by end users. For a developer, it's probably more convenient to have the escaped version of the string.

@yallop

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

I was surprised just now by the change in escaping behaviour introduced by this PR. Running OCaml in an Emacs subshell, I see this:

# "\255";;
- : string = "\377"

whereas before OCaml 4.06 I see this:

# "\255";;
- : string = "\255"

What's happening: OCaml used to escape characters above 0x80 when printing strings, and printed them using the same decimal format used in string literals. However, since this PR such characters are now printed directly to the terminal instead. Since they can't be displayed directly, the terminal emulator prints them as octal escapes.

@xavierleroy

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

I agree the octal escape is confusing. At least, Emacs colors the \377 in red, doesn't it? So, that gives a hint to the (experienced) Emacs user...

@shindere

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

@yallop

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

Emacs does indeed colour the \377, and the printed string behaves correctly in other ways: for example, copying and pasting the string produces the original value, and navigation commands treat the displayed \377 as a single character.

@yallop

This comment has been minimized.

Copy link
Member

commented Jan 8, 2019

@shindere: while the colour hint is only useful for sighted users, I wonder whether users who don't rely on the visual interface avoid the confusing issue in the first place. For example, emacspeak seems to treat the printed character as a single unit (which I think it pronounces "y umlaut") rather than reading out the octal escape code. I'm curious whether the interface you use is similarly helpful.

@shindere

This comment has been minimized.

Copy link
Contributor

commented Jan 8, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.