Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expose Oprint.escape_string as String.escape #10457

Open
wants to merge 5 commits into
base: trunk
Choose a base branch
from

Conversation

dmbaturin
Copy link
Contributor

The motivation for is that the original String.escaped is overly aggressive and escapes everything except the printable ASCII range, which is a major inconvenience in a Unicode world.

Add a String.escape_ascii alias for the old String.escaped to simplify its eventual deprecation, in line with other ASCII-only functions (lowercase_ascii etc.).

Adding a deprecation warning to String.escaped ([@@ocaml.deprecated "Use String.escape_ascii instead."]) breaks the stringLabels.ml module when building with warning as an error option, so I only added a deprecation tag to the documentation comment.

@dmbaturin dmbaturin force-pushed the string-escape branch 4 times, most recently from 7b11fd8 to b77aa74 Compare June 15, 2021 15:21
The motivation for is that the original String.escaped
is overly aggressive and escapes everything except the printable ASCII range,
which is a major invoncenience in a Unicode world.

Add a String.escape_ascii alias for the old String.escaped
to simplify its eventual deprecation.
stdlib/stringLabels.mli Outdated Show resolved Hide resolved
stdlib/stringLabels.mli Outdated Show resolved Hide resolved
@dra27
Copy link
Member

dra27 commented Jun 15, 2021

I'm not completely sold on the need to rename String.escaped (mainly because I don't relish another 6 years of seeing the deprecation warnings in builds...) - there's nothing wrong with it, per se? Can't the additional escaping function be added with a new name and String.escaped left alone?

@dmbaturin
Copy link
Contributor Author

Can't the additional escaping function be added with a new name and String.escaped left alone?

I'm also unsure about that. If they were named String.escape (Unicode) and String.escape_ascii from the beginning, I think that would have been a good solution. However, String.escaped isn't going away any soon, so there'are going to be two different functions with a single-letter name difference.

Arguments for renaming String.escape_ascii:

  • It's probably the best name for what the old String.escaped does.
  • There are lots of _ascii renames and deprecations, so when the time comes, it will be easier to remove all deprecated functions at once.

Arguments against the renaming:

  • Deprecation warnings indeed may last a long time, possibly forever.

I don't have a strong opinion here. If you have ideas for a better name for String.escape, please share.

@dmbaturin
Copy link
Contributor Author

dmbaturin commented Jun 16, 2021

Anecdata: from 284 archives in my OPAM download cache, 20 unique packages use String.escape, and I think many of them (like omd) can benefit from updating to the less aggressive version.

Copy link
Contributor

@xavierleroy xavierleroy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fully support having functions that escape special characters but not UTF8 characters in strings and in byte arrays in the standard library.

I disagree with the proposed implementation and with the proposed interface.

stdlib/string.ml Outdated
Comment on lines 128 to 169
let escape s =
(* Escape only C0 control characters (bytes <= 0x1F), DEL(0x7F), '\\'
and '"' *)
let n = ref 0 in
for i = 0 to length s - 1 do
n := !n +
(match unsafe_get s i with
| '\"' | '\\' | '\n' | '\t' | '\r' | '\b' -> 2
| '\x00' .. '\x1F'
| '\x7F' -> 4
| _ -> 1)
done;
if !n = length s then s else begin
let s' = B.create !n in
n := 0;
for i = 0 to length s - 1 do
begin match unsafe_get s i with
| ('\"' | '\\') as c ->
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n c
| '\n' ->
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 'n'
| '\t' ->
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 't'
| '\r' ->
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 'r'
| '\b' ->
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 'b'
| '\x00' .. '\x1F' | '\x7F' as c ->
let a = Char.code c in
B.unsafe_set s' !n '\\';
incr n;
B.unsafe_set s' !n (Char.chr (48 + a / 100));
incr n;
B.unsafe_set s' !n (Char.chr (48 + (a / 10) mod 10));
incr n;
B.unsafe_set s' !n (Char.chr (48 + a mod 10));
| c -> B.unsafe_set s' !n c
end;
incr n
done;
B.to_string s'
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not share an implementation with String.escaped and Bytes.escaped? It's the same loops, all you need is a Boolean to say whether '\x80'...'\xFF' needs escaping or not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree I guess. That function was duplicated in the toplevel implementation, I simply moved it to the standard library.

Do I get your plan right?

  • Add a new function like val _escaped : bool → string → string with a boolean flag to indicate whether to escape unicode or not, but don't expose it in the module interface.
  • Add escaped_ascii and escaped_utf8 wrappers for it?
  • Alias escaped to escape_ascii.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, except that you also need to update the Bytes module... So, the new bool -> string -> string function probably goes in Bytes, and is exported but not documented, so that it can be used from String as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think of adding an optional argument ?(escape_unicode=true) to String.escaped and making the new escaped_ascii and escaped_utf8 functions aliases for it with different value of that argument?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like optional arguments, generally speaking. But here, with an optional argument on escaped, it would be strange to provide escaped_ascii and escaped_utf8 in addition to escaped. Also, putting a deprecation warning on escaped (regardless of optional argument) would make little sense.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I imagine that we could have an optional argument with the parameter escape_unicode set as true by default at first, and then switch to false instead of deprecating the escaped function few version later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I were to decide, I'd go for an optional argument.

Also, tangentially related: there's code duplication with Char.escaped. Is there a reason for it, or we can make Char.escaped a building block for this?

Comment on lines 267 to 295
val escape_ascii : string -> string
(** [escape_ascii s] is [s] with all characters outside the printable US-ASCII
range represented by escape sequences.

All characters outside the US-ASCII printable range \[0x20;0x7E\] are
escaped, as well as backslash (0x2F) and double-quote (0x22).

The function {!Scanf.unescaped} is a left inverse of [escaped],
i.e. [Scanf.unescaped (escaped s) = s] for any string [s] (unless
[escaped s] fails).

@raise Invalid_argument if the result is longer than
{!Sys.max_string_length} bytes.

@since 4.14.0 *)

val escape : string -> string
(** [escape s] is [s] with special characters replaced with escape sequences,
following the lexical conventions of OCaml.

The following characters are escaped:
double quote (['\"'], newline (['\n']), carriage return (['\r']),
tab (['\t\]), backspace (['\b']), DEL (0x7F),
and everything in the ['\x00' .. '\x1F'] range.

@raise Invalid_argument if the result is longer than
{!Sys.max_string_length} bytes.

@since 4.14.0 *)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer escaped to escape, but opinions from native English speakers are most welcome.

Also, I'd rather have two new functions escaped_ascii and escaped_utf8, with escaped defaulting to escaped_ascii for the time being, then eventually deprecated, then eventually defined as escaped_utf8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current naming tends to favor adjectives/past participles, but not consistently. For example, String.trim, String.capitalize, String.uncapitalize are verbs.

This likely needs a broader and more general discussion. I don't have a strong opinion, I'll go with maintainers decision.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious to hear from native English speakers. I still like the idea of verbs being used for imperative functions (e.g. print, output) and nouns/adjectives for pure functions (e.g. length), but as you point out, we have many exceptions to this idea already.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related point: naming a function escape_utf8 (verb) would be a bad idea because it sounds like it escapes unicode characters, while escaped_utf8makes it more clear that utf8 is a modifier, not the object of the escape verb (or so I think).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @xavierleroy in liking the pure/functional nature of escaped (although as he frequently improves and corrects my written English in documentation, I don't put much weight on it being my first language!)

To me, Bytes.escape : bytes -> unit mutates its argument, where Bytes.escaped : bytes -> bytes leaves the argument unmodified.

In terms of the inconsistency, having String.escaped and String.uppercase is one thing, having String.escape and Scanf.unescaped is another.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I always thought of uppercase as an adjective here, not a verb (like escaped).

stdlib/string.mli Outdated Show resolved Hide resolved
@xavierleroy
Copy link
Contributor

A number of comments were made, and none was taken into account, as far as I can see. @dmbaturin : do you intend to work on this PR further, or should it be closed?

@dmbaturin
Copy link
Contributor Author

@xavierleroy Sorry, I got a bit too carried away with working on other things! I'll produce a new implementation for your review by next Monday.

@dmbaturin dmbaturin force-pushed the string-escape branch 2 times, most recently from 4f7683b to b679dc6 Compare August 30, 2021 05:02
@dmbaturin
Copy link
Contributor Author

I'm aware that some tests are broken, I'll fetch the current trunk and look into it later today.

@Octachron
Copy link
Member

You have not updated the computation of the length of the escaped string in function of escape_unicode, the current code cannot work (you can run the tests locally before pushing).

@since 4.14.0 *)

(* Intentionally undocumented. *)
val _escaped : ?escape_unicode:bool -> bytes -> bytes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name should not be _name, prefixing by _ generally means unused functions. Also, you should move the function after (**/**) (at the same place as unsafe_get). This will remove the function from the documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also escape_unicode is a very misleading label name, use utf_8.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a part of the plan that Xavier Leroy didn't expressly disapprove of, so I assumed it was an acceptable thing to do.

What should I name it if it's not acceptable? It should somehow be marked "do not use".
Duplication may be the cleanest even if not prettiest way to handle the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would propose to either add an optional argument (for_utf_8?) to escaped , or add a single function escaped_for: [ Utf_8 | Ascii] -> string -> string to Bytes and String that can be exposed directly.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Octachron I expressed that idea earlier. I'm happy to change the implementation if we agree on it.
It will be compatible with all existing code if the new parameter is set to false, and indeed will avoid the issues with undocumented functions and naming disputes.

  • Change the type of Bytes.escaped to escaped : ?for_utf_8:bool -> bytes -> bytes.
  • Use it from the String module to provide String.escaped : ?for_utf_8:bool -> bytes -> bytes.
  • Remove escaped_ascii and escaped_unicode from the implementation.

If no one objects, I'd like to also add an optional argument for excluding specific characters from escaping, to make it useful in different settings.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'm still totally unconvinced by the usefulness of this PR. An this whether you add arguments for excluding specific characters or not: the escape syntax will remain the one of OCaml (and "%S" will likely correspond to String.escape forever).

I think this API should rather be left in peace as it is now.

Copy link
Contributor

@dbuenzli dbuenzli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since these functions are new. Could we please have these escape use hexadecimal escapes ?

When you work with UTF-8 text (which is to say almost all of the time nowadays) decimal escapes are just a chore.


@since 4.14.0 *)

val escaped_utf8 : bytes -> bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be utf_8 for consistency with Buffer.add_uchar_utf_8.

Personally I don't think String.escaped_{ascii,utf_8} reads well. From the name I would rather expect them to return a string in which ascii or UTF-8 is escaped. Maybe String.escaped_for_{ascii,utf_8} ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I fear we have to choose what exactly we make it consistent with, unless we can agree to revamp the other functions. I was aiming for consistency with the rest of the Bytes/String module where we have functions like capitalize_ascii and lowercase_ascii and where _ascii essentially means "assume all characters are ASCII".

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like capitalize_ascii and lowercase_ascii and where _ascii essentially means "assume all characters are ASCII".

Personally, that's not the way I read these functions, I read these as: "capitalize ASCII characters".

I don't think there's any need to revamp other functions or make these ones read wrong in the name of consistency. String.escaped_ascii really reads to me as a "string with ASCII characters escaped" which is really not what it does.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Myself, I definitely agree with the idea of hex escapes, and if I were to design the interface, I'd also add an option to exclude certain characters from escaping (if you are formatting strings for something other than OCaml, that becomes relevant). However, I don't know how well it aligns with maintainers' plans.

Anyway, I fixed that sleep deprivation induced blunder that broke the tests, and it may be a good idea to collect all questions about the interface and proposed options now to make it easier to come to a consensus.

Copy link
Contributor

@dbuenzli dbuenzli Aug 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you ask me, I don't find the function added by this PR very useful and pertinent.

I rarely need to escape my bytes according to OCaml syntax. Often it's just quotes or double quotes.

I would much prefer a little infrastructure to easily devise byte escaping functions. For example something like that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my point of view, the escaped_for_utf_8 (or utf_8_escaped?) function is just an updated and improved version of the escaped function which was already used in the toplevel. It is indeed quite specific to the OCaml compiler itself.

@since 4.14.0 *)

(* Intentionally undocumented. *)
val _escaped : ?escape_unicode:bool -> bytes -> bytes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also escape_unicode is a very misleading label name, use utf_8.

@xavierleroy
Copy link
Contributor

That's quite a discussion we're having on this PR! @dmbaturin : please don't give up, I think this "escape special characters but not UTF-8 encodings" functionality is useful, if only because the OCaml toplevel needs it. Also, escaping using OCaml lexical conventions is what the existing "escape" functions do, and we'll keep doing this even if @dbuenzli finds it useless.

I feel that @Octachron has a precise idea of what the final API should look like. Why don't you push your proposal straight to this PR, using your maintainer privileges, so that everyone can see the API?

@dbuenzli
Copy link
Contributor

dbuenzli commented Sep 2, 2021

I think this "escape special characters but not UTF-8 encodings" functionality is useful, if only because the OCaml toplevel needs it.

Just for the record this is already the case since #1231. There's even an environment variable that controls the behaviour for backward compatibility: OCAMLTOP_UTF_8.

Also, escaping using OCaml lexical conventions is what the existing "escape" functions do, and we'll keep doing this even if @dbuenzli finds it useless.

I guess you can always find a use for them. But these kind of functions in the stdlib that are tied to the conventions of the OCaml language tend to lead to bugs in user programs (using int_of_string to parse, say a decimal number, is another example).

They seem to do the right thing for the task at hand and they do but only most of the time; since programmers are lazy (or in a hurry) they end up using it anyway.

@Octachron
Copy link
Member

@xavierleroy : if we are betting that utf_8 will not get displaced in the next decades (which is a bet that I am willing to take), the optional argument option val escaped: ?for_utf_8:bool -> string -> stringand its Byte sibling seems like the simplest option. And I would propose to keep it simple without any other parameters because as @dbuenzli said this is a specific function that exposes OCaml's vision of what is an escaped string.

@xavierleroy
Copy link
Contributor

Apologies for leaving this open.

val escaped: ?for_utf_8:bool -> string -> string and its Byte sibling seems like the simplest option.

I'm fine with that.

@dmbaturin
Copy link
Contributor Author

@xavierleroy @Octachron Ok, I'll change the proposed API to val escaped: ?for_utf_8:bool -> string -> string then.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants