FAQ/unicode_entry Unicode characters that look alike #1444

AlexDaniel · 2017-08-10T16:59:40Z

This issue was mentioned a couple of times. ∖ and \ are visually too similar. Worse, sometimes they are even rendered identically. But this is not the only case, there are many non-ascii characters that look like something else. How to deal with this stuff?

As an example, here are some screenshots of how it's rendered for me:

Emacs:

Firefox:

You'll notice that my emacs screenshot clearly shows that these characters are different. However, there's nothing clever about this:

my recommendation is really stupid. I'm using bitmap fonts for everything, these typically don't have the whole unicode range, so it falls back to any font that the system has with that glyph. This way, ascii range renders sharp and everything else is noticeably blurry

Anyway, I think that the proper solution would be to configure your editor to highlight characters out of ascii range (similarly to how people highlight whitespace). After some experiments I'll write about it.

The text was updated successfully, but these errors were encountered:

toolforger · 2017-08-10T19:25:22Z

Unicode has data about which glyphs are confusable, and recommendations how to deal with them.
See http://www.unicode.org/reports/tr39/ "Unicode Security Mechanisms".
As usual for Unicode, the TR details what data is available, how to interpret it, and recommendations for various application domains - in this case, for international domain names, for email addresses, for programming-language identifiers, and they added a section on integers since I last looked.

Confusability data will occasionally be extended, i.e. glyphs that are not considered confusable now might be considered confusable by future Unicode revisions.
I.e. if Perl6 rejects Unicode-confusable names, there needs to be a way to switch the confusability check off, or to specify for which Unicode version a Perl6 modules was written, or some other mechanism to ensure future-compatibility.

toolforger · 2017-08-10T19:28:08Z

BTW on http://unicode.org/cldr/utility/confusables.jsp you can enter a text and see how many confusables the Unicode Consortium detects.
For , it lists 12 variants, see http://unicode.org/cldr/utility/confusables.jsp?a=%5C&r=None .

JJ · 2018-03-16T06:41:00Z

What is the recommendation? Use a single form? Or something else?

AlexDaniel · 2018-03-16T08:04:22Z

Well, the recommendation is to not use nonsense in your source code, but from time to time you'll stumble upon code (maybe as a form of a joke) that produces some weird error message simply because it contains some character that looks like something else. So question is how can you configure your editor so that it renders nonsense in a distinguishable way :)

toolforger · 2018-03-16T20:53:07Z

There's also stuff like the obfuscated C contest, Perl golf.
That's all tongue-in-cheek, but the coder in your company who is secretly injecting malware is a serious matter. And yes this has been tried in the past, and will be tried in the future.

Some fonts are designed to minimize confusability. Slashed or dotted zeroes are an early approach, but with more glyphs there are more confusables of course.
The other option is to have an editor that is applying the confusability checks and highlights that code.
The third: compilers that refuse confusables in the same namespace, i.e. considering confusable names to be a name conflict.

AlexDaniel · 2018-03-28T08:56:15Z

No need to complicate things. There are confusables for almost any character out there, so it doesn't help.

IMO the simplest way is to colorize all non-ascii characters. Here's an example on how Emacs renders non-breaking space (by default):

Just adding a slight tint to non-ascii characters should be good enough. And it must be possible in most editors (with custom config).

AlexDaniel · 2018-07-07T15:55:06Z

Well, looking at rakudo/rakudo#2003, I think this isn't that much of a doc issue then. Although ∖ and \ most likely won't be covered when that issue is resolved.

We can take the list of editors and start submitting tickets (for highlighting of non-ascii chars). After that's done we'd still need a FAQ entry for those who use editors that don't have this feature (yet).

AlexDaniel added the wishlist "nice to have" issues; might require a lot of work or a big change or be low priority label Aug 10, 2017

AlexDaniel self-assigned this Aug 10, 2017

JJ mentioned this issue Jun 24, 2018

[Trap] Some Unicode operator might be mistaken #2119

Closed

AlexDaniel mentioned this issue Jul 1, 2018

Should we allow identifiers with different scripts? rakudo/rakudo#2003

Open

AlexDaniel added the external Depends on another ticket (likely in another repo) label Jul 7, 2018

AlexDaniel removed their assignment Nov 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FAQ/unicode_entry Unicode characters that look alike #1444

FAQ/unicode_entry Unicode characters that look alike #1444

AlexDaniel commented Aug 10, 2017

toolforger commented Aug 10, 2017

toolforger commented Aug 10, 2017

JJ commented Mar 16, 2018

AlexDaniel commented Mar 16, 2018

toolforger commented Mar 16, 2018

AlexDaniel commented Mar 28, 2018 •

edited

AlexDaniel commented Jul 7, 2018

FAQ/unicode_entry Unicode characters that look alike #1444

FAQ/unicode_entry Unicode characters that look alike #1444

Comments

AlexDaniel commented Aug 10, 2017

toolforger commented Aug 10, 2017

toolforger commented Aug 10, 2017

JJ commented Mar 16, 2018

AlexDaniel commented Mar 16, 2018

toolforger commented Mar 16, 2018

AlexDaniel commented Mar 28, 2018 • edited

AlexDaniel commented Jul 7, 2018

AlexDaniel commented Mar 28, 2018 •

edited