# Toplevel: only escapes bytes and not strings #1231

Merged
merged 1 commit into from Sep 13, 2017

## Conversation

Projects
None yet
10 participants
Contributor

### Octachron commented Jul 8, 2017

 Escaping strings when printing them in the toplevel has the disadvantage of making unicode text unreadable: # "한글";; - : string = "\237\149\156\234\184\128" This PR proposes to escape only bytes and not strings: # let cosmos = "κόσμος";; cosmos : string = "κόσμος" # Bytes.of_string cosmos;; - : bytes = Bytes.of_string "\206\186\207\140\207\131\206\188\206\191\207\130" This change is not purely aesthetic: the mangling of unicode strings may contribute to the impression of some OCaml newcomers that Ocaml has no support for unicode (and being able to read the corresponding strings in the toplevel benefits all users not able to fluently read raw unicode codepoint). On a more technical note, in presence of string delimiters inside the printed string, the best string delimiters still available are used: # "日本\"語";; - : string = {|日本"語|} In order of preference, the string delimiters used are ", {|, {t|, {top|,{toplevel|, and in the worst case scenario {toplevel%d|: # "مرسی\n\"|}|t}|top}|toplevel}|toplevel1}|toplevel2}";; - : string = {toplevel3|ﻡﺮﺳی "|}|t}|top}|toplevel}|toplevel1}|toplevel2}|toplevel3} With this scheme, all strings should be printable as valid string literals. One disadvantage of this approch is that newline and tabs are not escaped anymore # "\n\t\n\t\n";; - : string = " "

Member

### gasche commented Jul 8, 2017

 I like some aspects of the change (the idea that unicode letters are printed back) but not others: not-escaping the whitespace characters leads to a loss of readability, for example. Of course, some time people write string literals with newlines in them, and escaping them instead hurts readability. I have mixed feelings about the automatic choice of {|..|} for escaping; it's a lesser-known feature that may surprise users, but on the other it makes the feature more self-discoverable. Overall I would say that I like it. Have people studied the problem of "what is the right choice of which characters to print and which characters to escape" before, and are there solution that do not require more unicode knowledge than available in the OCaml standard library? (Would we want this behavior to depend on the current user locale? In general it seems that people push for locale-independence these days.)
Contributor

### dbuenzli commented Jul 8, 2017 • edited

 I'm not really fond of the choices made by this PR. These would be my suggestions: Always escape control characters (bytes <= 0x1F) and the string delimiter. Do not use the "{|" syntax, always delimit strings with " Consult some kind of environment variable to determine if the terminal is UTF-8 aware (this has to be researched or if nothing reliable is available one ocaml specific should be defined). If the terminal is not UTF-8 aware any byte > 0x7E should still be escaped. (Ideally the string should be checked for UTF-8 validity and if not valid its non printable US-ASCII bytes should be escaped. But this can be improved once/if we get more UTF-8 tools in the stdlib). I would really like to have the bytes of string and bytes escaped with hexadecimal numbers. Decimal escapes are a huge PITA if you are dealing with UTF-8 and other byte oriented file formats. (Would we want this behavior to depend on the current user locale? In general it seems that people push for locale-independence these days.) This is a user interface not an API, as such I think it would be legitimate to depend on the user locale.
Contributor Author

### Octachron commented Jul 8, 2017

 Concerning point 1 and 2, escaping characters in the C0 control character code set and " may be indeed clearer and it rejoins @gasche remarks. Concerning point 3 (and 4), I am not completely convinced because it seems much simpler to simply not escape bytes ≥ 0x7F. By doing so, we would keep some compatibility with latin-1 and JIS users at the only cost ( for utf-8 users ) of not escaping control characters in the C1 code set (in particular NEL) and the more exotic LS or PS new lines (on the other hand, it would even give a work-arround for users that really really want non-escaped new lines). I agree with point 5, but using hexadecimal in escaping sequence does not seems to particularly concern the toplevel, and I think this should be changed globally.
Contributor

### dbuenzli commented Jul 8, 2017

 By doing so, we would keep some compatibility with latin-1 and JIS users at the only cost ( for utf-8 users ) This kind of fortunate coincidence argument doesn't make sense to me. We have been trying to push for sometime now for a model where OCaml strings should be UTF-8 encoded. I will let the dev team determine if they find it important that the toplevel is still able to function in a 7-bit environment. But I really think that we should have at least an OCAMLTOP_UTF_8 boolean environment variable (that defaults to true), that controls this.
Contributor Author

### Octachron commented Jul 9, 2017

 I have updated this to escape C0 control characters and string delimiters; in other words, quoted string delimiters are gone and \t and \n are now escaped: # "серафими\t\"многоꙮчитїи\"";; - : string = "серафими\t\"многоꙮчитїи\"" @dbuenzli, I am not sure what your proposed OCAMLTOP_UTF_8 environment variable is supposed to do? Disable the escaping of bytes >0x7E but only for utf-8 valid byte sequence? Anyway, I personally don't dislike the fortunate coincidence that users get back in the toplevel the same string they submitted as input (except for control characters that indeed should not take control of the toplevel printing).
Contributor

### dbuenzli commented Jul 9, 2017 • edited

 So @pqwy who studied the problem in the context of his notty library tells me that it seems hopeless to try to find a system environment variable to lookup (though LC_ALL, LANG, LC_CTYPE, could be possibly consulted in that order, but that doesn't tell you if the output actually agrees). @dbuenzli, I am not sure what your proposed OCAMLTOP_UTF_8 environment variable is supposed to do? Disable the escaping of bytes >0x7E but only for utf-8 valid byte sequence? No if OCAMLTOP_UTF_8 is set to false you always escape bytes > 0x7E. Anyway, I personally don't dislike the fortunate coincidence that users get back in the toplevel the same string they submitted as input Indeed, in a non Unicode aware terminal this user wouldn't be able to input UTF-8, so she would e.g. input "\xC3\xA9" and you would output them unescaped which would break her terminal.
Contributor Author

### Octachron commented Jul 9, 2017

 No if OCAMLTOP_UTF_8 is set to false you always escape bytes > 0x7E. I agree that being able to reactivate the escaping of bytes > 0x7E is a undeniable improvement. Indeed, in a non Unicode aware terminal this user wouldn't be able to input UTF-8, so she would e.g. input "\xC3\xA9" and you would output them unescaped which would break her terminal. Well, \xC3\xA9 is also a perfectly valid latin-1(Ã©) or SHIFT-JIS (ﾃｩ) byte sequence, so breaking the terminal might be a bit hyperbolic here.
Contributor

### dbuenzli commented Jul 9, 2017

 Well, \xC3\xA9 is also a perfectly valid latin-1(Ã©) or SHIFT-JIS (ﾃｩ) byte sequence, so breaking the terminal might be a bit hyperbolic here. But in that case the user would leave OCAMLTOP_UTF_8 set to true (as strange as it sounds) and would not input hex escapes...
Contributor Author

### Octachron commented Jul 9, 2017 • edited

 I have added the OCAMLTOP_UTF_8 variable, man page and manual description included. Note that I have also deleted the paragraph in the ocaml man page about LC_CTYPE=iso_8859_1: the described behavior did not seem valid anymore (and they are no trace of iso_8859_1 nor LC_CTYPE in the code base). Moreover, OCAMLTOP_UTF_8 supersedes such use of LC_CTYPE.
Contributor

### dbuenzli commented Jul 9, 2017 • edited

 Note that I have also deleted the paragraph in the ocaml man page about LC_CTYPE=iso_8859_1: FTR @xavierleroy removed that from the manual in 5d385f9. I think this may have gone away with the resolution by @damiendoligez of MPR6521 in e60a2db.

### xavierleroy reviewed Jul 9, 2017

 @@ -74,6 +74,34 @@ let parenthesize_if_neg ppf fmt v isneg = fprintf ppf fmt v; if isneg then pp_print_char ppf ')' (** Escape only C0 control characters (bytes <= 0x1F) and '"' *) let print_out_string ppf s =

#### xavierleroy Jul 9, 2017

Contributor

0x7F, a.k.a. DEL, is a control character and needs escaping too.

#### Octachron Jul 9, 2017

Author Contributor

Fixed, thanks.

Contributor

### xavierleroy commented Jul 9, 2017

 This discussion makes me feel younger, because we had pretty much the same discussion back in the early 1990s when Latin-1 support was added to Caml Light and not all environments would support characters above 0x80... Escaping control characters is absolutely necessary, and not only to display TAB, CR and LF meaningfully. For example, if terminal escape sequences are printed verbatim, the display can be completely messed up. A desirable property of the toplevel value printer is that the output, once fed back into the toplevel by cut-and-paste, should parse back to the same value (as much as possible). I think it is the case in the latest incarnation of this PR, but make sure it is.
Contributor

### lpw25 commented Jul 9, 2017

 Isn't this going to mess up the formatting? Format doesn't understand the widths of these characters. I'm lead to believe that this is quite a tricky problem to solve.
Contributor

### dbuenzli commented Jul 9, 2017

 Isn't this going to mess up the formatting? It will. I'm lead to believe that this is quite a tricky problem to solve. It's worse than that. It's a problem that is impossible to solve without being able to interact with the rendering layer to measure how many cells your UTF-8 encoded string is going to span when rendered -- something no terminal out there will provide you. You can perform some kind of best effort formatting using either Uucp.tty_width_hint (a form of wcwidth designed by @pqwy for this problem) or Uuseg's pretty-printers but even, none of these solutions are entirely fool proof.
Contributor Author

### Octachron commented Jul 9, 2017

 Format output will get messy since Format will tend to overestimate the length of the graphical representation of strings. Some examples, first in Greek: # String.split_on_char ' ' "Μῆνιν ἄειδε θεὰ Πηληϊάδεω Ἀχιλῆος οὐλομένην, ἣ μυρί᾿ Ἀχαιοῖς ἄλγε᾿ ἔθηκε, πολλὰς δ᾿ ἰφθίμους ψυχὰς Ἄϊδι προΐαψεν ἡρώων, αὐτοὺς δὲ ἑλώρια τεῦχε κύνεσσιν οἰωνοῖσί τε πᾶσι· Διὸς δ᾿ ἐτελείετο βουλή ἐξ οὗ δὴ τὰ πρῶτα διαστήτην ἐρίσαντε Ἀτρεΐδης τε ἄναξ ἀνδρῶν καὶ δῖος Ἀχιλλεύς.";; - : string list = ["Μῆνιν"; "ἄειδε"; "θεὰ"; "Πηληϊάδεω"; "Ἀχιλῆος"; "οὐλομένην,"; "ἣ"; "μυρί᾿"; "Ἀχαιοῖς"; "ἄλγε᾿"; "ἔθηκε,"; "πολλὰς"; "δ᾿"; "ἰφθίμους"; "ψυχὰς"; "Ἄϊδι"; "προΐαψεν"; "ἡρώων,"; "αὐτοὺς"; "δὲ"; "ἑλώρια"; "τεῦχε"; "κύνεσσιν"; "οἰωνοῖσί"; "τε"; "πᾶσι·"; "Διὸς"; "δ᾿"; "ἐτελείετο"; "βουλή"; "ἐξ"; "οὗ"; "δὴ"; "τὰ"; "πρῶτα"; "διαστήτην"; "ἐρίσαντε"; "Ἀτρεΐδης"; "τε"; "ἄναξ"; "ἀνδρῶν"; "καὶ"; "δῖος"; "Ἀχιλλεύς."] Compared to the english version # String.split_on_char ' ' "Achilles sing, O Goddess! Peleus' son; His wrath pernicious, who ten thousand woes Caused to Achaia's host, sent many a soul Illustrious into Ades premature, And Heroes gave (so stood the will of Jove) To dogs and to all ravening fowls a prey, When fierce dispute had separated once The noble Chief Achilles from the son Of Atreus, Agamemnon, King of men.";; - : string list = ["Achilles"; "sing,"; "O"; "Goddess!"; "Peleus'"; "son;"; "His"; "wrath"; "pernicious,"; "who"; "ten"; "thousand"; "woes"; "Caused"; "to"; "Achaia's"; "host,"; "sent"; "many"; "a"; "soul"; Illustrious"; "into"; "Ades"; "premature,"; "And"; "Heroes"; "gave"; "(so"; "stood"; "the"; "will"; "of"; "Jove)"; "To"; "dogs"; "and"; "to"; "all"; "ravening"; "fowls"; "a"; "prey,"; "When"; "fierce"; "dispute"; "had"; "separated"; "once"; "The"; "noble"; "Chief"; "Achilles"; "from"; "the"; "son"; "Of"; "Atreus,"; "Agamemnon,"; "King"; "of"; "men."] Similarly with Japanese # [ "こぬ人を"; "まつほの浦の"; "夕なぎに"; "やくやもしほの"; "身もこがれつつ" ];; - : string list = ["こぬ人を"; "まつほの浦の"; "夕なぎに"; "やくやもしほの"; "身もこがれつつ"] or Sanskrit # [ "अग्निमीळे"; "पुरोहितं"; "यज्ञस्य"; "देवं रत्वीजम"; "होतारं"; "रत्नधातमम"; "अग्निः"; "पूर्वेभिर्र्षिभिरीड्यो"; "नूतनैरुत"; "स"; "देवानेह"; "वक्षति" ];; - : string list = ["अग्निमीळे"; "पुरोहितं"; "यज्ञस्य"; "देवं रत्वीजम"; "होतारं"; "रत्नधातमम"; "अग्निः"; "पूर्वेभिर्र्षिभिरीड्यो"; "नूतनैरुत"; "स"; "देवानेह"; "वक्षति"]
Member

### gasche commented Jul 9, 2017

 On the other hand, any of these examples are completely unreadable with the current pretty-printing scheme, so the output you show (if formatted a bit weirdly compared to the english version) is a strong improvement. Given that this only affect the toplevel output (and not calls to Format in user programs), I believe that not having a general solution to length formatting is fine.
Contributor

### DemiMarie commented Jul 9, 2017

 I think that Rust’s approach is best long-term one: strings must be in UTF-8 and are immutable. Creating a string that is not valid UTF-8 is undefined behavior.
Contributor

### xavierleroy commented Jul 14, 2017 • edited

 Also, if I'm reading the code correctly, backslash is not escaped... Why not base your implementation on that of String.escaped / Bytes.escaped? That would avoid so many regressions.

### Octachronforce-pushed the Octachron:hello_κόσμος branch from 0b82612 to a9b3057Jul 14, 2017

Contributor Author

### Octachron commented Jul 14, 2017

 There is a cosmetic regression w.r.t the current escaping of strings: CR, LF, TAB and BS are printed with numeric escapes, while previously they were printed \r, \n, \t, \b. They were not? At least not during my tests? (https://github.com/ocaml/ocaml/blob/trunk/stdlib/char.ml#L29) Also, if I'm reading the code correctly, backslash is not escaped... Of this, I am atrociously guilty. I should have added a test on the testsuite covering the whole ascii range. This lack of test is fixed. Why not base your implementation on that of String.escaped / Bytes.escaped? As wished, I have reimplemented the string escape in the style of Bytes.escaped.
Contributor

Contributor

### xavierleroy left a comment

 Looks very good to me, with a bit of LaTeX tweaking recommended.
Changes Outdated
 @@ -530,6 +530,11 @@ Next major version (4.05.0): - PR#7060, GPR#1035: Print exceptions in installed custom printers (Tadeu Zagallo, review by David Allsopp) - GPR#1231: improved printing of unicode texts in the toplevel, when OCAMLTOP_UTF_8 is not set to false.

#### xavierleroy Sep 11, 2017

Contributor

"unless OCAMLTOP_UTF_8 is set to false" would read better, I think.

#### Octachron Sep 12, 2017

Author Contributor

I agree. Fixed.

 @@ -126,6 +126,10 @@ The following command-line options are recognized by the "ocaml" command. \begin{unix} The following environment variables are also consulted: \begin{options} \item["OCAMLTOP_UTF_8"] When printing string values, non-ascii bytes (>0x7E) are printed as decimal escape sequence if "OCAMLTOP_UTF_8" is

#### xavierleroy Sep 11, 2017

Contributor

(>0x7E) will format poorly in LaTeX, with the > character rendered as upside-down question mark or some such. Please format as a proper math formula:

  (${} > "\0x7E"$)


#### Octachron Sep 12, 2017

Author Contributor

Fixed.

Member

### gasche commented Sep 12, 2017

 I remarked while reading the code that the "max string length" parameter of the Oval_string node may not be respected, given that escaping (done after this test) may increase the length. However, (1) previous implementations and the Bytes codepath also suffer from this issue and (2) this parameter is currently not fixed by the user (then it would be nice to respect it) but by the Genprintval recursion-depth control code, so it sounds reasonable to only respect it approximately.
Member

### gasche commented Sep 12, 2017

 @Octachron I think you should feel free to rebase (if you want to squash some of the intermediary commits) and merge.

### Octachronforce-pushed the Octachron:hello_κόσμος branch from 3fb6deb to 2e6a78aSep 12, 2017

 toplevel: only escapes bytes and not strings 
Escaping strings when printing them in the toplevel has the disadvantage
of mangling unicode text:


\# "한글";;
- : string = "\237\149\156\234\184\128"


With this commit, strings are not escaped anymore, contrarily to bytes:


\# let cosmos = "κόσμος";;
cosmos : string = "κόσμος"
\# Bytes.of_string cosmos;;
- : bytes =
Bytes.of_string "\206\186\207\140\207\131\206\188\206\191\207\130"


This new behavior can be disabled dynamically by setting the environment
variable OCAMLTOP_UTF_8 to false

This change is not solely aesthetic: the mangling of unicode string may
contribute to the impression of some OCaml newcomers that Ocaml has no
support for unicode.
 2e6a78a 

### Octachron merged commit 588c231 into ocaml:trunk Sep 13, 2017 2 checks passed

#### 2 checks passed

continuous-integration/appveyor/pr AppVeyor build succeeded
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
Contributor Author

### Octachron commented Sep 13, 2017

 Squashed to a single commit and merged.
Contributor

### nojb commented Oct 5, 2017

 There is another place where a similar treatment of strings may be a good idea, namely Printexc.to_string, see here. It currently uses %S which ends up escaping Unicode filenames (which can easily appear as arguments of Sys_error).
Member

### gasche commented Oct 5, 2017

 Do we want to consider relaxing the escaping in %S? The semantic specification, as I understand it, is to print strings in the way that they can be parsed back as OCaml literals. Maybe we can assume that unicode strings are parsed back as OCaml literals in a unicode file, so that non-escaping is justified?
Member

### gasche commented Oct 5, 2017

 (On the other hand, printing unicode on Windows seems to still be an open problem, so maybe escaping is not that bad?)

Merged

Member

### damiendoligez commented Oct 9, 2017

 IMO Printexc.to_string is a debugging tool, its output should be seen by developers but not by end users. For a developer, it's probably more convenient to have the escaped version of the string.

Closed

Member

### yallop commented Jan 8, 2019

 I was surprised just now by the change in escaping behaviour introduced by this PR. Running OCaml in an Emacs subshell, I see this: # "\255";; - : string = "\377"  whereas before OCaml 4.06 I see this: # "\255";; - : string = "\255"  What's happening: OCaml used to escape characters above 0x80 when printing strings, and printed them using the same decimal format used in string literals. However, since this PR such characters are now printed directly to the terminal instead. Since they can't be displayed directly, the terminal emulator prints them as octal escapes.
Contributor

### xavierleroy commented Jan 8, 2019

 I agree the octal escape is confusing. At least, Emacs colors the \377 in red, doesn't it? So, that gives a hint to the (experienced) Emacs user...
Contributor

### shindere commented Jan 8, 2019

 Xavier Leroy (2019/01/08 09:31 -0800): I agree the octal escape is confusing. At least, Emacs colors the \377 in red, doesn't it? So, that gives a hint to the (experienced) Emacs user... You mean experienced and sighted, right?
Member

### yallop commented Jan 8, 2019

 Emacs does indeed colour the \377, and the printed string behaves correctly in other ways: for example, copying and pasting the string produces the original value, and navigation commands treat the displayed \377 as a single character.
Member

### yallop commented Jan 8, 2019

 @shindere: while the colour hint is only useful for sighted users, I wonder whether users who don't rely on the visual interface avoid the confusing issue in the first place. For example, emacspeak seems to treat the printed character as a single unit (which I think it pronounces "y umlaut") rather than reading out the octal escape code. I'm curious whether the interface you use is similarly helpful.
Contributor

### shindere commented Jan 8, 2019

 Well unfortunately not. I am using a braille display which does not really have a way to convey the same kind of information, but it's great to read that emacspeak does such a great job, so many thanks for having reported this back!

Open