New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Expose Oprint.escape_string as String.escape #10457
base: trunk
Are you sure you want to change the base?
Conversation
7b11fd8
to
b77aa74
Compare
The motivation for is that the original String.escaped is overly aggressive and escapes everything except the printable ASCII range, which is a major invoncenience in a Unicode world. Add a String.escape_ascii alias for the old String.escaped to simplify its eventual deprecation.
I'm not completely sold on the need to rename |
I'm also unsure about that. If they were named Arguments for renaming
Arguments against the renaming:
I don't have a strong opinion here. If you have ideas for a better name for |
Anecdata: from 284 archives in my OPAM download cache, 20 unique packages use |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fully support having functions that escape special characters but not UTF8 characters in strings and in byte arrays in the standard library.
I disagree with the proposed implementation and with the proposed interface.
stdlib/string.ml
Outdated
let escape s = | ||
(* Escape only C0 control characters (bytes <= 0x1F), DEL(0x7F), '\\' | ||
and '"' *) | ||
let n = ref 0 in | ||
for i = 0 to length s - 1 do | ||
n := !n + | ||
(match unsafe_get s i with | ||
| '\"' | '\\' | '\n' | '\t' | '\r' | '\b' -> 2 | ||
| '\x00' .. '\x1F' | ||
| '\x7F' -> 4 | ||
| _ -> 1) | ||
done; | ||
if !n = length s then s else begin | ||
let s' = B.create !n in | ||
n := 0; | ||
for i = 0 to length s - 1 do | ||
begin match unsafe_get s i with | ||
| ('\"' | '\\') as c -> | ||
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n c | ||
| '\n' -> | ||
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 'n' | ||
| '\t' -> | ||
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 't' | ||
| '\r' -> | ||
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 'r' | ||
| '\b' -> | ||
B.unsafe_set s' !n '\\'; incr n; B.unsafe_set s' !n 'b' | ||
| '\x00' .. '\x1F' | '\x7F' as c -> | ||
let a = Char.code c in | ||
B.unsafe_set s' !n '\\'; | ||
incr n; | ||
B.unsafe_set s' !n (Char.chr (48 + a / 100)); | ||
incr n; | ||
B.unsafe_set s' !n (Char.chr (48 + (a / 10) mod 10)); | ||
incr n; | ||
B.unsafe_set s' !n (Char.chr (48 + a mod 10)); | ||
| c -> B.unsafe_set s' !n c | ||
end; | ||
incr n | ||
done; | ||
B.to_string s' | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not share an implementation with String.escaped
and Bytes.escaped
? It's the same loops, all you need is a Boolean to say whether '\x80'...'\xFF'
needs escaping or not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree I guess. That function was duplicated in the toplevel implementation, I simply moved it to the standard library.
Do I get your plan right?
- Add a new function like
val _escaped : bool → string → string
with a boolean flag to indicate whether to escape unicode or not, but don't expose it in the module interface. - Add
escaped_ascii
andescaped_utf8
wrappers for it? - Alias
escaped
toescape_ascii
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, except that you also need to update the Bytes
module... So, the new bool -> string -> string
function probably goes in Bytes
, and is exported but not documented, so that it can be used from String
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think of adding an optional argument ?(escape_unicode=true)
to String.escaped
and making the new escaped_ascii
and escaped_utf8
functions aliases for it with different value of that argument?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like optional arguments, generally speaking. But here, with an optional argument on escaped
, it would be strange to provide escaped_ascii
and escaped_utf8
in addition to escaped
. Also, putting a deprecation warning on escaped
(regardless of optional argument) would make little sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I imagine that we could have an optional argument with the parameter escape_unicode
set as true by default at first, and then switch to false
instead of deprecating the escaped
function few version later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I were to decide, I'd go for an optional argument.
Also, tangentially related: there's code duplication with Char.escaped
. Is there a reason for it, or we can make Char.escaped
a building block for this?
stdlib/string.mli
Outdated
val escape_ascii : string -> string | ||
(** [escape_ascii s] is [s] with all characters outside the printable US-ASCII | ||
range represented by escape sequences. | ||
|
||
All characters outside the US-ASCII printable range \[0x20;0x7E\] are | ||
escaped, as well as backslash (0x2F) and double-quote (0x22). | ||
|
||
The function {!Scanf.unescaped} is a left inverse of [escaped], | ||
i.e. [Scanf.unescaped (escaped s) = s] for any string [s] (unless | ||
[escaped s] fails). | ||
|
||
@raise Invalid_argument if the result is longer than | ||
{!Sys.max_string_length} bytes. | ||
|
||
@since 4.14.0 *) | ||
|
||
val escape : string -> string | ||
(** [escape s] is [s] with special characters replaced with escape sequences, | ||
following the lexical conventions of OCaml. | ||
|
||
The following characters are escaped: | ||
double quote (['\"'], newline (['\n']), carriage return (['\r']), | ||
tab (['\t\]), backspace (['\b']), DEL (0x7F), | ||
and everything in the ['\x00' .. '\x1F'] range. | ||
|
||
@raise Invalid_argument if the result is longer than | ||
{!Sys.max_string_length} bytes. | ||
|
||
@since 4.14.0 *) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer escaped
to escape
, but opinions from native English speakers are most welcome.
Also, I'd rather have two new functions escaped_ascii
and escaped_utf8
, with escaped
defaulting to escaped_ascii
for the time being, then eventually deprecated, then eventually defined as escaped_utf8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current naming tends to favor adjectives/past participles, but not consistently. For example, String.trim
, String.capitalize
, String.uncapitalize
are verbs.
This likely needs a broader and more general discussion. I don't have a strong opinion, I'll go with maintainers decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious to hear from native English speakers. I still like the idea of verbs being used for imperative functions (e.g. print
, output
) and nouns/adjectives for pure functions (e.g. length
), but as you point out, we have many exceptions to this idea already.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Related point: naming a function escape_utf8
(verb) would be a bad idea because it sounds like it escapes unicode characters, while escaped_utf8
makes it more clear that utf8
is a modifier, not the object of the escape
verb (or so I think).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @xavierleroy in liking the pure/functional nature of escaped
(although as he frequently improves and corrects my written English in documentation, I don't put much weight on it being my first language!)
To me, Bytes.escape : bytes -> unit
mutates its argument, where Bytes.escaped : bytes -> bytes
leaves the argument unmodified.
In terms of the inconsistency, having String.escaped
and String.uppercase
is one thing, having String.escape
and Scanf.unescaped
is another.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I always thought of uppercase
as an adjective here, not a verb (like escaped
).
A number of comments were made, and none was taken into account, as far as I can see. @dmbaturin : do you intend to work on this PR further, or should it be closed? |
@xavierleroy Sorry, I got a bit too carried away with working on other things! I'll produce a new implementation for your review by next Monday. |
4f7683b
to
b679dc6
Compare
b679dc6
to
494a0bf
Compare
I'm aware that some tests are broken, I'll fetch the current trunk and look into it later today. |
You have not updated the computation of the length of the escaped string in function of |
@since 4.14.0 *) | ||
|
||
(* Intentionally undocumented. *) | ||
val _escaped : ?escape_unicode:bool -> bytes -> bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The name should not be _name
, prefixing by _
generally means unused functions. Also, you should move the function after (**/**)
(at the same place as unsafe_get
). This will remove the function from the documentation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also escape_unicode
is a very misleading label name, use utf_8
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was a part of the plan that Xavier Leroy didn't expressly disapprove of, so I assumed it was an acceptable thing to do.
What should I name it if it's not acceptable? It should somehow be marked "do not use".
Duplication may be the cleanest even if not prettiest way to handle the issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would propose to either add an optional argument (for_utf_8
?) to escaped , or add a single function escaped_for: [
Utf_8 | Ascii] -> string -> string
to Bytes
and String
that can be exposed directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Octachron I expressed that idea earlier. I'm happy to change the implementation if we agree on it.
It will be compatible with all existing code if the new parameter is set to false
, and indeed will avoid the issues with undocumented functions and naming disputes.
- Change the type of
Bytes.escaped
toescaped : ?for_utf_8:bool -> bytes -> bytes
. - Use it from the
String
module to provideString.escaped : ?for_utf_8:bool -> bytes -> bytes
. - Remove
escaped_ascii
andescaped_unicode
from the implementation.
If no one objects, I'd like to also add an optional argument for excluding specific characters from escaping, to make it useful in different settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'm still totally unconvinced by the usefulness of this PR. An this whether you add arguments for excluding specific characters or not: the escape syntax will remain the one of OCaml (and "%S" will likely correspond to String.escape
forever).
I think this API should rather be left in peace as it is now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since these functions are new. Could we please have these escape use hexadecimal escapes ?
When you work with UTF-8 text (which is to say almost all of the time nowadays) decimal escapes are just a chore.
|
||
@since 4.14.0 *) | ||
|
||
val escaped_utf8 : bytes -> bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be utf_8
for consistency with Buffer.add_uchar_utf_8
.
Personally I don't think String.escaped_{ascii,utf_8}
reads well. From the name I would rather expect them to return a string in which ascii or UTF-8 is escaped. Maybe String.escaped_for_{ascii,utf_8}
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fear we have to choose what exactly we make it consistent with, unless we can agree to revamp the other functions. I was aiming for consistency with the rest of the Bytes/String
module where we have functions like capitalize_ascii
and lowercase_ascii
and where _ascii
essentially means "assume all characters are ASCII".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like
capitalize_ascii
andlowercase_ascii
and where_ascii
essentially means "assume all characters are ASCII".
Personally, that's not the way I read these functions, I read these as: "capitalize ASCII characters".
I don't think there's any need to revamp other functions or make these ones read wrong in the name of consistency. String.escaped_ascii
really reads to me as a "string with ASCII characters escaped" which is really not what it does.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Myself, I definitely agree with the idea of hex escapes, and if I were to design the interface, I'd also add an option to exclude certain characters from escaping (if you are formatting strings for something other than OCaml, that becomes relevant). However, I don't know how well it aligns with maintainers' plans.
Anyway, I fixed that sleep deprivation induced blunder that broke the tests, and it may be a good idea to collect all questions about the interface and proposed options now to make it easier to come to a consensus.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you ask me, I don't find the function added by this PR very useful and pertinent.
I rarely need to escape my bytes according to OCaml syntax. Often it's just quotes or double quotes.
I would much prefer a little infrastructure to easily devise byte escaping functions. For example something like that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my point of view, the escaped_for_utf_8
(or utf_8_escaped
?) function is just an updated and improved version of the escaped
function which was already used in the toplevel. It is indeed quite specific to the OCaml compiler
itself.
@since 4.14.0 *) | ||
|
||
(* Intentionally undocumented. *) | ||
val _escaped : ?escape_unicode:bool -> bytes -> bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also escape_unicode
is a very misleading label name, use utf_8
.
That's quite a discussion we're having on this PR! @dmbaturin : please don't give up, I think this "escape special characters but not UTF-8 encodings" functionality is useful, if only because the OCaml toplevel needs it. Also, escaping using OCaml lexical conventions is what the existing "escape" functions do, and we'll keep doing this even if @dbuenzli finds it useless. I feel that @Octachron has a precise idea of what the final API should look like. Why don't you push your proposal straight to this PR, using your maintainer privileges, so that everyone can see the API? |
Just for the record this is already the case since #1231. There's even an environment variable that controls the behaviour for backward compatibility:
I guess you can always find a use for them. But these kind of functions in the stdlib that are tied to the conventions of the OCaml language tend to lead to bugs in user programs (using They seem to do the right thing for the task at hand and they do but only most of the time; since programmers are lazy (or in a hurry) they end up using it anyway. |
@xavierleroy : if we are betting that utf_8 will not get displaced in the next decades (which is a bet that I am willing to take), the optional argument option |
Apologies for leaving this open.
I'm fine with that. |
@xavierleroy @Octachron Ok, I'll change the proposed API to |
The motivation for is that the original
String.escaped
is overly aggressive and escapes everything except the printable ASCII range, which is a major inconvenience in a Unicode world.Add a
String.escape_ascii
alias for the oldString.escaped
to simplify its eventual deprecation, in line with other ASCII-only functions (lowercase_ascii
etc.).Adding a deprecation warning to
String.escaped
([@@ocaml.deprecated "Use String.escape_ascii instead."]
) breaks thestringLabels.ml
module when building with warning as an error option, so I only added a deprecation tag to the documentation comment.