Support unescaped Uchar.t literals #12696

favonia · 2023-10-28T08:47:03Z

Why

In a nutshell, I want to write code like this:

match c with
| '₀' .. '₉' -> ... (* Unicode numeric subscripts *)
| ...

And if range patterns are difficult, I believe the following code is still an improvement over what can be done today:

if '₀' <= c && c <= '₉'
then ... (* Unicode numeric subscripts *)
else ...

The most critical component seems to be supporting literals for Uchar.t ('₀' and '₉'), hence this feature request. I intended to summarize as many technical points as possible, but given the complexity of this issue, the summary here is inevitably incomplete.

Possible syntax choices and considerations

I have collected several “reasonable” notations from existing discussions:

escaped literals	unescaped literals	consistency between escaped and unescaped literals	consistency with `string` literals	backward compat	overloading-free?	succinct?	possibly confusing cases
`'\u{1F60F}'`	`'😀'`	✅	✅	✅	❌	⭐⭐⭐
`''\u{1F60F}''`	`''😀''`	✅	✅	✅❓	✅	⭐⭐
`'+\u{1F60F}'`	`'+😀'`	✅	❌	✅	✅	⭐⭐	`'++'`
`'\u{1F60F}'`	`'\u{😀}'`	✅❓	⚠️	✅	✅	⭐	`'\u{}}'`, `"\u{0}"`
`'\u{1F60F}'u`	`'😀'u`	✅	✅	❌	✅	⭐⭐
`u'\u{1F60F}'`	`u'😀'`	✅	✅	❌	✅	⭐⭐

The following is a list of major factors that one might want to consider:

Non-graphic characters and marks

As a first cut, I propose only accepting a Unicode character whose general category is L (letter), N (number), P (punctuation), S (symbol), or Zs (space separator). This is what’s called base character, essentially all graphic characters except M (mark). The main reason to exclude marks is that they could potentially be combined with the deliminators. One competing proposal could be allowing an extra ZWNJ (U+200C) to prevent the combination, but I am concerned that this would complicate the proposal a bit too much. (EDIT: The issues about marks can be revisited in the future!)

To improve UX, maybe the compiler can suggest fixes for some of the following cases:

A Unicode character is successfully decoded but does not fall into the allowed categories (e.g., a mark, a character in private use area, etc.). The compiler may suggest using its escaped form.
More than one Unicode character is successfully decoded.
1. The compiler may show the beginning (2-3?) Unicode characters, possibly using the escaped forms to reduce confusion.
2. If the sequence is not in NFC and there’s a chance that its NFC may consist of only one Unicode character in the allowed categories, maybe the compiler can suggest using NFC. For example, maybe the programmer is writing e and ◌́, the compiler can point out that using NFC may solve the problem.
3. If the sequence is not in NFC and its NFC will consist of only one Unicode character in the allowed categories, the compiler may suggest using NFC. For example, maybe the programmer is writing e and ◌́ and the compiler can suggest writing é.

Note: the difference between 2ii and 2iii is that maybe the checking algorithm is easier to implement than the full normalization algorithm.

Range patterns

It will be great if range patterns such as 'a' .. 'z' also works for Uchar.t, but I understand this requires range patterns for int (#8504). Maybe the Unicode characters will eventually motivate someone to implement them.

Consistency between escape and unescaped characters

I believe it would be rather confusing if escaped and unescaped Unicode characters are using different notations, for example '\u{1F60F}' and u'😀'. The above table already rules out several options that fail this criterion.

Consistency with string literals

I also believe it is good to match the current \u{NNNN} syntax in string. The unescaped characters arguably will further improve the compatibility with string literals. Note that I put ⚠️ for \u{😀} because we can change the syntax of string to accept \u{😀}.

Backward compatibility

Some choices suggest adding a prefix or suffix u to it, which could break existing programs. While I doubt any reasonable program will have code like that, it is theoretically possible. (BTW, is it okay if we reserve all prefix/suffix and give a warning to such strange code?)

Syntax overloading and type inference

The first suggestion ('😀') involves type-directed disambiguation for character literals, with strong bias towards the old char for characters within the ASCII range (such as 'a').

Printing `Uchar.t`

The resolution of this feature request should give a clear guidance for printing Uchar.t (#11999, #12462). In my opinion, the main reason we don't know how to print Unicode characters is because we don't know how to write them.

ISO/IEC 8859-1

This feature request should probably be implemented only after OCaml has declared that UTF-8 is the only valid encoding for the source file, which means ISO/IEC 8859-1 will no longer be considered. However, I think it is still nice to make a plan now.

Suggested choices

I personally prefer adding '😀' or ''😀'', but would welcome any of the options in the table.

Changes

Added a point that characters belonging to **Mark (M)** can be revisited later.
Clarified the "beginning Unicode characters"
Typo: I meant unescaped characters can further improve the compatibility with string literals.

The text was updated successfully, but these errors were encountered:

nojb · 2023-10-28T09:11:38Z

And if range patterns are difficult, I believe the following code is still an improvement over what can be done today:
if '₀' <= code && code <= '₉'
then ... (* Unicode numeric subscripts *)
else ...

Even with this proposal, this will not be possible since int and Uchar.t will not be compatible; you will still have to pass your literals to Uchar.to_int.

favonia · 2023-10-28T09:16:25Z

And if range patterns are difficult, I believe the following code is still an improvement over what can be done today:
if '₀' <= code && code <= '₉'
then ... (* Unicode numeric subscripts *)
else ...
Even with this proposal, this will not be possible since int and Uchar.t will not be compatible; you will still have to pass your literals to Uchar.to_int.

I'm a bit confused---Stdlib.(<=) is polymorphic (maybe for the wrong reasons). Why do they need to be int? However, I realized I should not write code but c instead. I am revising it now.

nojb · 2023-10-28T09:22:49Z

However, I realized I should not write code but c instead. I am revising it now.

Right, sorry, I assumed code was of type int...

paurkedal · 2023-11-22T21:49:59Z

Good summary. I think the following rather severe issue with the forth option ('\u{😀}') was overlooked on the forum: If the escaping is to be consistent with strings, then '\u{0}', ..., '\u{f}' needs to be interpreted as escape sequences, which leaves no way to express the unicode variants of '0', ..., '9', 'a', ..., 'f' literally.

favonia · 2023-11-22T23:15:30Z

@paurkedal Thanks. The summary was updated with a new confusing case in that design.

favonia added the feature-wish label Oct 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support unescaped Uchar.t literals #12696

Support unescaped Uchar.t literals #12696

favonia commented Oct 28, 2023 •

edited

nojb commented Oct 28, 2023

favonia commented Oct 28, 2023

nojb commented Oct 28, 2023

paurkedal commented Nov 22, 2023

favonia commented Nov 22, 2023

Support unescaped Uchar.t literals #12696

Support unescaped Uchar.t literals #12696

Comments

favonia commented Oct 28, 2023 • edited

Why

Possible syntax choices and considerations

Non-graphic characters and marks

Range patterns

Consistency between escape and unescaped characters

Consistency with string literals

Backward compatibility

Syntax overloading and type inference

Printing Uchar.t

ISO/IEC 8859-1

Suggested choices

Related links

Changes

nojb commented Oct 28, 2023

favonia commented Oct 28, 2023

nojb commented Oct 28, 2023

paurkedal commented Nov 22, 2023

favonia commented Nov 22, 2023

favonia commented Oct 28, 2023 •

edited

Printing `Uchar.t`