-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support unescaped Uchar.t literals #12696
Comments
Even with this proposal, this will not be possible since |
I'm a bit confused--- |
Right, sorry, I assumed |
Good summary. I think the following rather severe issue with the forth option ( |
@paurkedal Thanks. The summary was updated with a new confusing case in that design. |
Why
In a nutshell, I want to write code like this:
And if range patterns are difficult, I believe the following code is still an improvement over what can be done today:
The most critical component seems to be supporting literals for
Uchar.t
('₀'
and'₉'
), hence this feature request. I intended to summarize as many technical points as possible, but given the complexity of this issue, the summary here is inevitably incomplete.Possible syntax choices and considerations
I have collected several “reasonable” notations from existing discussions:
string
literals'\u{1F60F}'
'😀'
''\u{1F60F}''
''😀''
'+\u{1F60F}'
'+😀'
'++'
'\u{1F60F}'
'\u{😀}'
'\u{}}'
,"\u{0}"
'\u{1F60F}'u
'😀'u
u'\u{1F60F}'
u'😀'
The following is a list of major factors that one might want to consider:
Non-graphic characters and marks
As a first cut, I propose only accepting a Unicode character whose general category is L (letter), N (number), P (punctuation), S (symbol), or Zs (space separator). This is what’s called base character, essentially all graphic characters except M (mark). The main reason to exclude marks is that they could potentially be combined with the deliminators. One competing proposal could be allowing an extra ZWNJ (U+200C) to prevent the combination, but I am concerned that this would complicate the proposal a bit too much. (EDIT: The issues about marks can be revisited in the future!)
To improve UX, maybe the compiler can suggest fixes for some of the following cases:
e
and◌́
, the compiler can point out that using NFC may solve the problem.e
and◌́
and the compiler can suggest writingé
.Note: the difference between 2ii and 2iii is that maybe the checking algorithm is easier to implement than the full normalization algorithm.
Range patterns
It will be great if range patterns such as
'a' .. 'z'
also works forUchar.t
, but I understand this requires range patterns forint
(#8504). Maybe the Unicode characters will eventually motivate someone to implement them.Consistency between escape and unescaped characters
I believe it would be rather confusing if escaped and unescaped Unicode characters are using different notations, for example
'\u{1F60F}'
andu'😀'
. The above table already rules out several options that fail this criterion.Consistency with string literals
I also believe it is good to match the current
\u{NNNN}
syntax instring
. The unescaped characters arguably will further improve the compatibility withstring
literals. Note that I put⚠️
for\u{😀}
because we can change the syntax ofstring
to accept\u{😀}
.Backward compatibility
Some choices suggest adding a prefix or suffix
u
to it, which could break existing programs. While I doubt any reasonable program will have code like that, it is theoretically possible. (BTW, is it okay if we reserve all prefix/suffix and give a warning to such strange code?)Syntax overloading and type inference
The first suggestion (
'😀'
) involves type-directed disambiguation for character literals, with strong bias towards the oldchar
for characters within the ASCII range (such as'a'
).Printing
Uchar.t
The resolution of this feature request should give a clear guidance for printing
Uchar.t
(#11999, #12462). In my opinion, the main reason we don't know how to print Unicode characters is because we don't know how to write them.ISO/IEC 8859-1
This feature request should probably be implemented only after OCaml has declared that UTF-8 is the only valid encoding for the source file, which means ISO/IEC 8859-1 will no longer be considered. However, I think it is still nice to make a plan now.
Suggested choices
I personally prefer adding
'😀'
or''😀''
, but would welcome any of the options in the table.Related links
Uchar.t
in toplevel #12462Changes
**Mark (M)**
can be revisited later.The text was updated successfully, but these errors were encountered: