Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support unescaped Uchar.t literals #12696

Open
favonia opened this issue Oct 28, 2023 · 5 comments
Open

Support unescaped Uchar.t literals #12696

favonia opened this issue Oct 28, 2023 · 5 comments

Comments

@favonia
Copy link
Contributor

favonia commented Oct 28, 2023

Why

In a nutshell, I want to write code like this:

match c with
| '₀' .. '₉' -> ... (* Unicode numeric subscripts *)
| ...

And if range patterns are difficult, I believe the following code is still an improvement over what can be done today:

if '₀' <= c && c <= '₉'
then ... (* Unicode numeric subscripts *)
else ...

The most critical component seems to be supporting literals for Uchar.t ('₀' and '₉'), hence this feature request. I intended to summarize as many technical points as possible, but given the complexity of this issue, the summary here is inevitably incomplete.

Possible syntax choices and considerations

I have collected several “reasonable” notations from existing discussions:

escaped literals unescaped literals consistency between escaped and unescaped literals consistency with string literals backward compat overloading-free? succinct? possibly confusing cases
'\u{1F60F}' '😀' ⭐⭐⭐
''\u{1F60F}'' ''😀'' ✅❓ ⭐⭐
'+\u{1F60F}' '+😀' ⭐⭐ '++'
'\u{1F60F}' '\u{😀}' ✅❓ ⚠️ '\u{}}', "\u{0}"
'\u{1F60F}'u '😀'u ⭐⭐
u'\u{1F60F}' u'😀' ⭐⭐

The following is a list of major factors that one might want to consider:

Non-graphic characters and marks

As a first cut, I propose only accepting a Unicode character whose general category is L (letter), N (number), P (punctuation), S (symbol), or Zs (space separator). This is what’s called base character, essentially all graphic characters except M (mark). The main reason to exclude marks is that they could potentially be combined with the deliminators. One competing proposal could be allowing an extra ZWNJ (U+200C) to prevent the combination, but I am concerned that this would complicate the proposal a bit too much. (EDIT: The issues about marks can be revisited in the future!)

To improve UX, maybe the compiler can suggest fixes for some of the following cases:

  1. A Unicode character is successfully decoded but does not fall into the allowed categories (e.g., a mark, a character in private use area, etc.). The compiler may suggest using its escaped form.
  2. More than one Unicode character is successfully decoded.
    1. The compiler may show the beginning (2-3?) Unicode characters, possibly using the escaped forms to reduce confusion.
    2. If the sequence is not in NFC and there’s a chance that its NFC may consist of only one Unicode character in the allowed categories, maybe the compiler can suggest using NFC. For example, maybe the programmer is writing e and ◌́, the compiler can point out that using NFC may solve the problem.
    3. If the sequence is not in NFC and its NFC will consist of only one Unicode character in the allowed categories, the compiler may suggest using NFC. For example, maybe the programmer is writing e and ◌́ and the compiler can suggest writing é.

Note: the difference between 2ii and 2iii is that maybe the checking algorithm is easier to implement than the full normalization algorithm.

Range patterns

It will be great if range patterns such as 'a' .. 'z' also works for Uchar.t, but I understand this requires range patterns for int (#8504). Maybe the Unicode characters will eventually motivate someone to implement them.

Consistency between escape and unescaped characters

I believe it would be rather confusing if escaped and unescaped Unicode characters are using different notations, for example '\u{1F60F}' and u'😀'. The above table already rules out several options that fail this criterion.

Consistency with string literals

I also believe it is good to match the current \u{NNNN} syntax in string. The unescaped characters arguably will further improve the compatibility with string literals. Note that I put ⚠️ for \u{😀} because we can change the syntax of string to accept \u{😀}.

Backward compatibility

Some choices suggest adding a prefix or suffix u to it, which could break existing programs. While I doubt any reasonable program will have code like that, it is theoretically possible. (BTW, is it okay if we reserve all prefix/suffix and give a warning to such strange code?)

Syntax overloading and type inference

The first suggestion ('😀') involves type-directed disambiguation for character literals, with strong bias towards the old char for characters within the ASCII range (such as 'a').

Printing Uchar.t

The resolution of this feature request should give a clear guidance for printing Uchar.t (#11999, #12462). In my opinion, the main reason we don't know how to print Unicode characters is because we don't know how to write them.

ISO/IEC 8859-1

This feature request should probably be implemented only after OCaml has declared that UTF-8 is the only valid encoding for the source file, which means ISO/IEC 8859-1 will no longer be considered. However, I think it is still nice to make a plan now.

Suggested choices

I personally prefer adding '😀' or ''😀'', but would welcome any of the options in the table.

Related links

Changes

  • Added a point that characters belonging to **Mark (M)** can be revisited later.
  • Clarified the "beginning Unicode characters"
  • Typo: I meant unescaped characters can further improve the compatibility with string literals.
@nojb
Copy link
Contributor

nojb commented Oct 28, 2023

And if range patterns are difficult, I believe the following code is still an improvement over what can be done today:

if '₀' <= code && code <= '₉'
then ... (* Unicode numeric subscripts *)
else ...

Even with this proposal, this will not be possible since int and Uchar.t will not be compatible; you will still have to pass your literals to Uchar.to_int.

@favonia
Copy link
Contributor Author

favonia commented Oct 28, 2023

And if range patterns are difficult, I believe the following code is still an improvement over what can be done today:

if '₀' <= code && code <= '₉'
then ... (* Unicode numeric subscripts *)
else ...

Even with this proposal, this will not be possible since int and Uchar.t will not be compatible; you will still have to pass your literals to Uchar.to_int.

I'm a bit confused---Stdlib.(<=) is polymorphic (maybe for the wrong reasons). Why do they need to be int? However, I realized I should not write code but c instead. I am revising it now.

@nojb
Copy link
Contributor

nojb commented Oct 28, 2023

However, I realized I should not write code but c instead. I am revising it now.

Right, sorry, I assumed code was of type int...

@paurkedal
Copy link

Good summary. I think the following rather severe issue with the forth option ('\u{😀}') was overlooked on the forum: If the escaping is to be consistent with strings, then '\u{0}', ..., '\u{f}' needs to be interpreted as escape sequences, which leaves no way to express the unicode variants of '0', ..., '9', 'a', ..., 'f' literally.

@favonia
Copy link
Contributor Author

favonia commented Nov 22, 2023

@paurkedal Thanks. The summary was updated with a new confusing case in that design.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants