Expose Oprint.escape_string as String.escape #10457

dmbaturin · 2021-06-15T14:57:53Z

The motivation for is that the original String.escaped is overly aggressive and escapes everything except the printable ASCII range, which is a major inconvenience in a Unicode world.

Add a String.escape_ascii alias for the old String.escaped to simplify its eventual deprecation, in line with other ASCII-only functions (lowercase_ascii etc.).

Adding a deprecation warning to String.escaped ([@@ocaml.deprecated "Use String.escape_ascii instead."]) breaks the stringLabels.ml module when building with warning as an error option, so I only added a deprecation tag to the documentation comment.

The motivation for is that the original String.escaped is overly aggressive and escapes everything except the printable ASCII range, which is a major invoncenience in a Unicode world. Add a String.escape_ascii alias for the old String.escaped to simplify its eventual deprecation.

stdlib/stringLabels.mli

dra27 · 2021-06-15T18:39:18Z

I'm not completely sold on the need to rename String.escaped (mainly because I don't relish another 6 years of seeing the deprecation warnings in builds...) - there's nothing wrong with it, per se? Can't the additional escaping function be added with a new name and String.escaped left alone?

dmbaturin · 2021-06-16T01:10:45Z

Can't the additional escaping function be added with a new name and String.escaped left alone?

I'm also unsure about that. If they were named String.escape (Unicode) and String.escape_ascii from the beginning, I think that would have been a good solution. However, String.escaped isn't going away any soon, so there'are going to be two different functions with a single-letter name difference.

Arguments for renaming String.escape_ascii:

It's probably the best name for what the old String.escaped does.
There are lots of _ascii renames and deprecations, so when the time comes, it will be easier to remove all deprecated functions at once.

Arguments against the renaming:

Deprecation warnings indeed may last a long time, possibly forever.

I don't have a strong opinion here. If you have ideas for a better name for String.escape, please share.

dmbaturin · 2021-06-16T01:49:56Z

Anecdata: from 284 archives in my OPAM download cache, 20 unique packages use String.escape, and I think many of them (like omd) can benefit from updating to the less aggressive version.

xavierleroy

I fully support having functions that escape special characters but not UTF8 characters in strings and in byte arrays in the standard library.

I disagree with the proposed implementation and with the proposed interface.

xavierleroy · 2021-06-18T16:29:28Z

stdlib/string.ml

+let escape s =
+  (* Escape only C0 control characters (bytes <= 0x1F), DEL(0x7F), '\\'
+     and '"' *)
+   let n = ref 0 in
+  for i = 0 to length s - 1 do
+    n := !n +
+      (match unsafe_get s i with
+       | '\"' | '\\' | '\n' | '\t' | '\r' | '\b' -> 2
+       | '\x00' .. '\x1F'
+       | '\x7F' -> 4
+       | _ -> 1)
+  done;
+  if !n = length s then s else begin
+    let s' = B.create !n in
+    n := 0;
+    for i = 0 to length s - 1 do
+      begin match unsafe_get s i with
+      | ('\"' | '\\') as c ->
+          B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n c
+      | '\n' ->
+          B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 'n'
+      | '\t' ->
+          B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 't'
+      | '\r' ->
+          B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 'r'
+      | '\b' ->
+          B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 'b'
+      | '\x00' .. '\x1F' | '\x7F' as c ->
+          let a = Char.code c in
+          B.unsafe_set s' !n '\\';
+          incr n;
+          B.unsafe_set s' !n (Char.chr (48 + a / 100));
+          incr n;
+          B.unsafe_set s' !n (Char.chr (48 + (a / 10) mod 10));
+          incr n;
+          B.unsafe_set s' !n (Char.chr (48 + a mod 10));
+      | c -> B.unsafe_set s' !n c
+      end;
+      incr n
+    done;
+    B.to_string s'
+  end


Why not share an implementation with String.escaped and Bytes.escaped? It's the same loops, all you need is a Boolean to say whether '\x80'...'\xFF' needs escaping or not.

I agree I guess. That function was duplicated in the toplevel implementation, I simply moved it to the standard library.

Do I get your plan right?

Add a new function like val _escaped : bool → string → string with a boolean flag to indicate whether to escape unicode or not, but don't expose it in the module interface.

Add escaped_ascii and escaped_utf8 wrappers for it?

Alias escaped to escape_ascii.

Yes, except that you also need to update the Bytes module... So, the new bool -> string -> string function probably goes in Bytes, and is exported but not documented, so that it can be used from String as well.

What do you think of adding an optional argument ?(escape_unicode=true) to String.escaped and making the new escaped_ascii and escaped_utf8 functions aliases for it with different value of that argument?

I like optional arguments, generally speaking. But here, with an optional argument on escaped, it would be strange to provide escaped_ascii and escaped_utf8 in addition to escaped. Also, putting a deprecation warning on escaped (regardless of optional argument) would make little sense.

I imagine that we could have an optional argument with the parameter escape_unicode set as true by default at first, and then switch to false instead of deprecating the escaped function few version later.

If I were to decide, I'd go for an optional argument.

Also, tangentially related: there's code duplication with Char.escaped. Is there a reason for it, or we can make Char.escaped a building block for this?

xavierleroy · 2021-06-18T16:31:53Z

stdlib/string.mli

+val escape_ascii : string -> string
+(** [escape_ascii s] is [s] with all characters outside the printable US-ASCII
+    range represented by escape sequences.
+
+    All characters outside the US-ASCII printable range \[0x20;0x7E\] are
+    escaped, as well as backslash (0x2F) and double-quote (0x22).
+
+    The function {!Scanf.unescaped} is a left inverse of [escaped],
+    i.e. [Scanf.unescaped (escaped s) = s] for any string [s] (unless
+    [escaped s] fails).
+
+    @raise Invalid_argument if the result is longer than
+    {!Sys.max_string_length} bytes.
+
+    @since 4.14.0 *)
+
+val escape : string -> string
+(** [escape s] is [s] with special characters replaced with escape sequences,
+    following the lexical conventions of OCaml.
+
+    The following characters are escaped:
+    double quote (['\"'], newline (['\n']), carriage return (['\r']),
+    tab (['\t\]), backspace (['\b']), DEL (0x7F),
+    and everything in the ['\x00' .. '\x1F'] range.
+
+    @raise Invalid_argument if the result is longer than
+    {!Sys.max_string_length} bytes.
+
+    @since 4.14.0 *)


I prefer escaped to escape, but opinions from native English speakers are most welcome.

Also, I'd rather have two new functions escaped_ascii and escaped_utf8, with escaped defaulting to escaped_ascii for the time being, then eventually deprecated, then eventually defined as escaped_utf8.

The current naming tends to favor adjectives/past participles, but not consistently. For example, String.trim, String.capitalize, String.uncapitalize are verbs.

This likely needs a broader and more general discussion. I don't have a strong opinion, I'll go with maintainers decision.

I'm curious to hear from native English speakers. I still like the idea of verbs being used for imperative functions (e.g. print, output) and nouns/adjectives for pure functions (e.g. length), but as you point out, we have many exceptions to this idea already.

Related point: naming a function escape_utf8 (verb) would be a bad idea because it sounds like it escapes unicode characters, while escaped_utf8makes it more clear that utf8 is a modifier, not the object of the escape verb (or so I think).

I agree with @xavierleroy in liking the pure/functional nature of escaped (although as he frequently improves and corrects my written English in documentation, I don't put much weight on it being my first language!)

To me, Bytes.escape : bytes -> unit mutates its argument, where Bytes.escaped : bytes -> bytes leaves the argument unmodified.

In terms of the inconsistency, having String.escaped and String.uppercase is one thing, having String.escape and Scanf.unescaped is another.

I always thought of uppercase as an adjective here, not a verb (like escaped).

stdlib/string.mli

xavierleroy · 2021-08-24T17:44:30Z

A number of comments were made, and none was taken into account, as far as I can see. @dmbaturin : do you intend to work on this PR further, or should it be closed?

dmbaturin · 2021-08-24T17:50:50Z

@xavierleroy Sorry, I got a bit too carried away with working on other things! I'll produce a new implementation for your review by next Monday.

…to escaped_*

dmbaturin · 2021-08-30T08:50:56Z

I'm aware that some tests are broken, I'll fetch the current trunk and look into it later today.

Octachron · 2021-08-30T09:37:58Z

You have not updated the computation of the length of the escaped string in function of escape_unicode, the current code cannot work (you can run the tests locally before pushing).

Octachron · 2021-08-30T09:42:21Z

stdlib/bytes.mli

+    @since 4.14.0 *)
+
+(* Intentionally undocumented. *)
+val _escaped : ?escape_unicode:bool -> bytes -> bytes


The name should not be _name, prefixing by _ generally means unused functions. Also, you should move the function after (**/**) (at the same place as unsafe_get). This will remove the function from the documentation.

Also escape_unicode is a very misleading label name, use utf_8.

It was a part of the plan that Xavier Leroy didn't expressly disapprove of, so I assumed it was an acceptable thing to do.

What should I name it if it's not acceptable? It should somehow be marked "do not use".
Duplication may be the cleanest even if not prettiest way to handle the issue.

I would propose to either add an optional argument (for_utf_8?) to escaped , or add a single function escaped_for: [ Utf_8 | Ascii] -> string -> string to Bytes and String that can be exposed directly.

@Octachron I expressed that idea earlier. I'm happy to change the implementation if we agree on it.
It will be compatible with all existing code if the new parameter is set to false, and indeed will avoid the issues with undocumented functions and naming disputes.

Change the type of Bytes.escaped to escaped : ?for_utf_8:bool -> bytes -> bytes.

Use it from the String module to provide String.escaped : ?for_utf_8:bool -> bytes -> bytes.

Remove escaped_ascii and escaped_unicode from the implementation.

If no one objects, I'd like to also add an optional argument for excluding specific characters from escaping, to make it useful in different settings.

Personally I'm still totally unconvinced by the usefulness of this PR. An this whether you add arguments for excluding specific characters or not: the escape syntax will remain the one of OCaml (and "%S" will likely correspond to String.escape forever).

I think this API should rather be left in peace as it is now.

dbuenzli

Since these functions are new. Could we please have these escape use hexadecimal escapes ?

When you work with UTF-8 text (which is to say almost all of the time nowadays) decimal escapes are just a chore.

dbuenzli · 2021-08-30T10:26:13Z

stdlib/bytes.mli

+
+    @since 4.14.0 *)
+
+val escaped_utf8 : bytes -> bytes


This should be utf_8 for consistency with Buffer.add_uchar_utf_8.

Personally I don't think String.escaped_{ascii,utf_8} reads well. From the name I would rather expect them to return a string in which ascii or UTF-8 is escaped. Maybe String.escaped_for_{ascii,utf_8} ?

I fear we have to choose what exactly we make it consistent with, unless we can agree to revamp the other functions. I was aiming for consistency with the rest of the Bytes/String module where we have functions like capitalize_ascii and lowercase_ascii and where _ascii essentially means "assume all characters are ASCII".

like capitalize_ascii and lowercase_ascii and where _ascii essentially means "assume all characters are ASCII".

Personally, that's not the way I read these functions, I read these as: "capitalize ASCII characters".

I don't think there's any need to revamp other functions or make these ones read wrong in the name of consistency. String.escaped_ascii really reads to me as a "string with ASCII characters escaped" which is really not what it does.

Myself, I definitely agree with the idea of hex escapes, and if I were to design the interface, I'd also add an option to exclude certain characters from escaping (if you are formatting strings for something other than OCaml, that becomes relevant). However, I don't know how well it aligns with maintainers' plans.

Anyway, I fixed that sleep deprivation induced blunder that broke the tests, and it may be a good idea to collect all questions about the interface and proposed options now to make it easier to come to a consensus.

If you ask me, I don't find the function added by this PR very useful and pertinent.

I rarely need to escape my bytes according to OCaml syntax. Often it's just quotes or double quotes.

I would much prefer a little infrastructure to easily devise byte escaping functions. For example something like that.

From my point of view, the escaped_for_utf_8 (or utf_8_escaped?) function is just an updated and improved version of the escaped function which was already used in the toplevel. It is indeed quite specific to the OCaml compiler itself.

dbuenzli · 2021-08-30T10:30:14Z

stdlib/bytes.mli

+    @since 4.14.0 *)
+
+(* Intentionally undocumented. *)
+val _escaped : ?escape_unicode:bool -> bytes -> bytes


Also escape_unicode is a very misleading label name, use utf_8.

xavierleroy · 2021-09-01T17:28:25Z

That's quite a discussion we're having on this PR! @dmbaturin : please don't give up, I think this "escape special characters but not UTF-8 encodings" functionality is useful, if only because the OCaml toplevel needs it. Also, escaping using OCaml lexical conventions is what the existing "escape" functions do, and we'll keep doing this even if @dbuenzli finds it useless.

I feel that @Octachron has a precise idea of what the final API should look like. Why don't you push your proposal straight to this PR, using your maintainer privileges, so that everyone can see the API?

dbuenzli · 2021-09-02T13:04:24Z

I think this "escape special characters but not UTF-8 encodings" functionality is useful, if only because the OCaml toplevel needs it.

Just for the record this is already the case since #1231. There's even an environment variable that controls the behaviour for backward compatibility: OCAMLTOP_UTF_8.

Also, escaping using OCaml lexical conventions is what the existing "escape" functions do, and we'll keep doing this even if @dbuenzli finds it useless.

I guess you can always find a use for them. But these kind of functions in the stdlib that are tied to the conventions of the OCaml language tend to lead to bugs in user programs (using int_of_string to parse, say a decimal number, is another example).

They seem to do the right thing for the task at hand and they do but only most of the time; since programmers are lazy (or in a hurry) they end up using it anyway.

Octachron · 2021-09-02T14:47:28Z

@xavierleroy : if we are betting that utf_8 will not get displaced in the next decades (which is a bet that I am willing to take), the optional argument option val escaped: ?for_utf_8:bool -> string -> stringand its Byte sibling seems like the simplest option. And I would propose to keep it simple without any other parameters because as @dbuenzli said this is a specific function that exposes OCaml's vision of what is an escaped string.

xavierleroy · 2021-10-04T13:05:28Z

Apologies for leaving this open.

val escaped: ?for_utf_8:bool -> string -> string and its Byte sibling seems like the simplest option.

I'm fine with that.

dmbaturin · 2021-10-04T13:07:55Z

@xavierleroy @Octachron Ok, I'll change the proposed API to val escaped: ?for_utf_8:bool -> string -> string then.

dmbaturin force-pushed the string-escape branch 4 times, most recently from 7b11fd8 to b77aa74 Compare June 15, 2021 15:21

dmbaturin force-pushed the string-escape branch from b77aa74 to e68832d Compare June 15, 2021 15:36

dra27 reviewed Jun 15, 2021

View reviewed changes

stdlib/stringLabels.mli Outdated Show resolved Hide resolved

dra27 reviewed Jun 15, 2021

View reviewed changes

stdlib/stringLabels.mli Outdated Show resolved Hide resolved

Fix String.escape_ascii name in the documentation.

7063573

dmbaturin force-pushed the string-escape branch from c360d2a to 7063573 Compare June 16, 2021 01:13

xavierleroy reviewed Jun 18, 2021

View reviewed changes

Remove the deprecation flag.

b67015f

dmbaturin force-pushed the string-escape branch 2 times, most recently from 4f7683b to b679dc6 Compare August 30, 2021 05:02

Move the implemenation of escaping to Bytes and rename the functions …

494a0bf

…to escaped_*

dmbaturin force-pushed the string-escape branch from b679dc6 to 494a0bf Compare August 30, 2021 05:03

Octachron reviewed Aug 30, 2021

View reviewed changes

dbuenzli reviewed Aug 30, 2021

View reviewed changes

Fix length calculation in Bytes.escaped

e87981e

Expose Oprint.escape_string as String.escape #10457

Are you sure you want to change the base?

Expose Oprint.escape_string as String.escape #10457

Conversation

dmbaturin commented Jun 15, 2021

dra27 commented Jun 15, 2021

dmbaturin commented Jun 16, 2021

dmbaturin commented Jun 16, 2021 • edited

xavierleroy left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xavierleroy commented Aug 24, 2021

dmbaturin commented Aug 24, 2021

dmbaturin commented Aug 30, 2021

Octachron commented Aug 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbuenzli left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dbuenzli Aug 30, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

xavierleroy commented Sep 1, 2021

dbuenzli commented Sep 2, 2021

Octachron commented Sep 2, 2021

xavierleroy commented Oct 4, 2021

dmbaturin commented Oct 4, 2021

dmbaturin commented Jun 16, 2021 •

edited

dbuenzli Aug 30, 2021 •

edited