Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode characters can sometimes not be read in the form they are emitted #1802

Closed
triska opened this issue May 4, 2023 · 3 comments
Closed

Comments

@triska
Copy link
Contributor

triska commented May 4, 2023

Let's ask Scryer Prolog whether there are any other cases like #1768, where the system itself emits a term that cannot be read back with read/1:

?- length(_, N),
   char_code(Char, N),
   write_term_to_chars(Char, [quoted(true)], Cs0),
   append(Cs0, " .", Cs),
   catch(read_from_chars(Cs, Term), error(syntax_error(_),_), portray_clause(cannot_read(N,Char))),
   false.

With #1799 and #1800 already applied, I get the following 17 remaining cases:

cannot_read(5760,' ').
cannot_read(8192,' ').
cannot_read(8193,' ').
cannot_read(8194,' ').
cannot_read(8195,' ').
cannot_read(8196,' ').
cannot_read(8197,' ').
cannot_read(8198,' ').
cannot_read(8199,' ').
cannot_read(8200,' ').
cannot_read(8201,' ').
cannot_read(8202,' ').
cannot_read(8232,'
').
cannot_read(8233,'
').
cannot_read(8239,' ').
cannot_read(8287,' ').
cannot_read(12288,' ').

For instance, taking the first example:

?- char_code(Char, 5760).
   Char = ' '.
?- Char = ' '.
   error(syntax_error(invalid_single_quoted_character),read_term/3).

The overall impression I get from this is that the situation is very hopeful: If these remaining cases cannot be meaningfully checked for with one of the available character categorizations, then we can simply add these rather few cases as special cases that need to be emitted as hexadecimal escape sequences.

Low priority. For syntactic ISO conformance, #1771, #1773, #1778 etc. are far more important.

@UWN
Copy link

UWN commented May 4, 2023

It seems, all these characters make sense within a quoted context. I mean, 1/4 has always been a regular printable character in Latin1. Hexadecimal sequences should be reserved for the truly ambiguous cases, like the nbsp.

@triska
Copy link
Contributor Author

triska commented May 4, 2023

I now found out that applying #1799 reduces the list to 17 remaining cases, I have updated the list in the original post!

triska added a commit to triska/scryer-prolog that referenced this issue May 14, 2023
@triska
Copy link
Contributor Author

triska commented May 15, 2023

Resolved via #1805.

@triska triska closed this as completed May 15, 2023
mthom pushed a commit that referenced this issue Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants