Unicode characters can sometimes not be read in the form they are emitted #1802

triska · 2023-05-04T18:25:02Z

Let's ask Scryer Prolog whether there are any other cases like #1768, where the system itself emits a term that cannot be read back with read/1:

?- length(_, N),
   char_code(Char, N),
   write_term_to_chars(Char, [quoted(true)], Cs0),
   append(Cs0, " .", Cs),
   catch(read_from_chars(Cs, Term), error(syntax_error(_),_), portray_clause(cannot_read(N,Char))),
   false.

With #1799 and #1800 already applied, I get the following 17 remaining cases:

cannot_read(5760,' ').
cannot_read(8192,' ').
cannot_read(8193,' ').
cannot_read(8194,' ').
cannot_read(8195,' ').
cannot_read(8196,' ').
cannot_read(8197,' ').
cannot_read(8198,' ').
cannot_read(8199,' ').
cannot_read(8200,' ').
cannot_read(8201,' ').
cannot_read(8202,' ').
cannot_read(8232,' ').
cannot_read(8233,' ').
cannot_read(8239,' ').
cannot_read(8287,' ').
cannot_read(12288,'　').

For instance, taking the first example:

?- char_code(Char, 5760).
   Char = ' '.
?- Char = ' '.
   error(syntax_error(invalid_single_quoted_character),read_term/3).

The overall impression I get from this is that the situation is very hopeful: If these remaining cases cannot be meaningfully checked for with one of the available character categorizations, then we can simply add these rather few cases as special cases that need to be emitted as hexadecimal escape sequences.

Low priority. For syntactic ISO conformance, #1771, #1773, #1778 etc. are far more important.

The text was updated successfully, but these errors were encountered:

UWN · 2023-05-04T18:47:44Z

It seems, all these characters make sense within a quoted context. I mean, 1/4 has always been a regular printable character in Latin1. Hexadecimal sequences should be reserved for the truly ambiguous cases, like the nbsp.

triska · 2023-05-04T19:34:03Z

I now found out that applying #1799 reduces the list to 17 remaining cases, I have updated the list in the original post!

This addresses mthom#1802.

triska · 2023-05-15T18:19:41Z

Resolved via #1805.

This addresses #1802.

triska added a commit to triska/scryer-prolog that referenced this issue May 14, 2023

extend logic to all control and whitespace characters

49addc7

This addresses mthom#1802.

triska mentioned this issue May 14, 2023

extend logic to all control and whitespace characters #1805

Merged

triska closed this as completed May 15, 2023

mthom pushed a commit that referenced this issue Jun 23, 2023

extend logic to all control and whitespace characters

c2f2623

This addresses #1802.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unicode characters can sometimes not be read in the form they are emitted #1802

Unicode characters can sometimes not be read in the form they are emitted #1802

triska commented May 4, 2023 •

edited

Loading

UWN commented May 4, 2023

triska commented May 4, 2023

triska commented May 15, 2023

Unicode characters can sometimes not be read in the form they are emitted #1802

Unicode characters can sometimes not be read in the form they are emitted #1802

Comments

triska commented May 4, 2023 • edited Loading

UWN commented May 4, 2023

triska commented May 4, 2023

triska commented May 15, 2023

triska commented May 4, 2023 •

edited

Loading