Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FAQ/unicode_entry Unicode characters that look alike #1444

Open
AlexDaniel opened this issue Aug 10, 2017 · 7 comments
Open

FAQ/unicode_entry Unicode characters that look alike #1444

AlexDaniel opened this issue Aug 10, 2017 · 7 comments
Labels
external Depends on another ticket (likely in another repo) wishlist "nice to have" issues; might require a lot of work or a big change or be low priority

Comments

@AlexDaniel
Copy link
Member

This issue was mentioned a couple of times. and \ are visually too similar. Worse, sometimes they are even rendered identically. But this is not the only case, there are many non-ascii characters that look like something else. How to deal with this stuff?

As an example, here are some screenshots of how it's rendered for me:

Emacs:
emacs screenshot

Firefox:
firefox screenshot

You'll notice that my emacs screenshot clearly shows that these characters are different. However, there's nothing clever about this:

my recommendation is really stupid. I'm using bitmap fonts for everything, these typically don't have the whole unicode range, so it falls back to any font that the system has with that glyph. This way, ascii range renders sharp and everything else is noticeably blurry

Anyway, I think that the proper solution would be to configure your editor to highlight characters out of ascii range (similarly to how people highlight whitespace). After some experiments I'll write about it.

@AlexDaniel AlexDaniel added the wishlist "nice to have" issues; might require a lot of work or a big change or be low priority label Aug 10, 2017
@AlexDaniel AlexDaniel self-assigned this Aug 10, 2017
@toolforger
Copy link

Unicode has data about which glyphs are confusable, and recommendations how to deal with them.
See http://www.unicode.org/reports/tr39/ "Unicode Security Mechanisms".
As usual for Unicode, the TR details what data is available, how to interpret it, and recommendations for various application domains - in this case, for international domain names, for email addresses, for programming-language identifiers, and they added a section on integers since I last looked.

Confusability data will occasionally be extended, i.e. glyphs that are not considered confusable now might be considered confusable by future Unicode revisions.
I.e. if Perl6 rejects Unicode-confusable names, there needs to be a way to switch the confusability check off, or to specify for which Unicode version a Perl6 modules was written, or some other mechanism to ensure future-compatibility.

@toolforger
Copy link

BTW on http://unicode.org/cldr/utility/confusables.jsp you can enter a text and see how many confusables the Unicode Consortium detects.
For , it lists 12 variants, see http://unicode.org/cldr/utility/confusables.jsp?a=%5C&r=None .

@JJ
Copy link
Contributor

JJ commented Mar 16, 2018

What is the recommendation? Use a single form? Or something else?

@AlexDaniel
Copy link
Member Author

Well, the recommendation is to not use nonsense in your source code, but from time to time you'll stumble upon code (maybe as a form of a joke) that produces some weird error message simply because it contains some character that looks like something else. So question is how can you configure your editor so that it renders nonsense in a distinguishable way :)

@toolforger
Copy link

There's also stuff like the obfuscated C contest, Perl golf.
That's all tongue-in-cheek, but the coder in your company who is secretly injecting malware is a serious matter. And yes this has been tried in the past, and will be tried in the future.

Some fonts are designed to minimize confusability. Slashed or dotted zeroes are an early approach, but with more glyphs there are more confusables of course.
The other option is to have an editor that is applying the confusability checks and highlights that code.
The third: compilers that refuse confusables in the same namespace, i.e. considering confusable names to be a name conflict.

@AlexDaniel
Copy link
Member Author

AlexDaniel commented Mar 28, 2018

No need to complicate things. There are confusables for almost any character out there, so it doesn't help.

IMO the simplest way is to colorize all non-ascii characters. Here's an example on how Emacs renders non-breaking space (by default):

image

Just adding a slight tint to non-ascii characters should be good enough. And it must be possible in most editors (with custom config).

@AlexDaniel
Copy link
Member Author

Well, looking at rakudo/rakudo#2003, I think this isn't that much of a doc issue then. Although ∖ and \ most likely won't be covered when that issue is resolved.

We can take the list of editors and start submitting tickets (for highlighting of non-ascii chars). After that's done we'd still need a FAQ entry for those who use editors that don't have this feature (yet).

@AlexDaniel AlexDaniel added the external Depends on another ticket (likely in another repo) label Jul 7, 2018
@AlexDaniel AlexDaniel removed their assignment Nov 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
external Depends on another ticket (likely in another repo) wishlist "nice to have" issues; might require a lot of work or a big change or be low priority
Projects
None yet
Development

No branches or pull requests

3 participants