-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add document for Unicode casemapping #272
Conversation
documentation/rfc7700.md
Outdated
* `(',', 0x2C)` - Used as a separator. | ||
* `('*', 0x2A)` - Used in mask matching. | ||
* `('?', 0x3F)` - Used in mask matching. | ||
* `('!', 0x21)` - Separates username from hostname. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix: separates nickname from username
documentation/rfc7700.md
Outdated
Nicknames cannot contain the following characters: | ||
|
||
* `(' ', 0x20)` - Separates parameters. | ||
* `(':', 0x3A)` - Separates trailing parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This & identical entries below seem unnecessary; :
only has special meaning as the first character of a parameter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, but since it's already disallowed and I could see it causing possible confusion with libraries that split parameters strangely, figured it was better to disallow it. If we figure it's not required I can definitely remove it though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might as well forbid :
in privmsgs, topics, etc. A library blindly splitting on :
is not "strange", it's outright buggy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's fair. In that case, I can just note that the first letter of one can't be :
? (since if i.e. a nickname started with :
then you wouldn't be able to use it in normal messages)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it should be fine under the "first character" list. (For channels it's already implied by CHANTYPE.)
documentation/rfc7700.md
Outdated
|
||
These steps MUST happen in the order shown, or else the restricted characters check may miss characters that should be legitimately restricted. | ||
|
||
If a name does contain a restricted character (whether disallowed by the [Nickname profile](https://tools.ietf.org/html/rfc7700#section-2.2) or this document), it MUST be rejected by the server and MUST NOT be propogated to other clients. This is done through the appropriate numeric for the command which tried to set or use the invalid name such as `ERR_ERRONEUSNICKNAME`, `ERR_NOSUCHCHANNEL`, or whichever numeric is most appropriate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At this point better use the named link syntax:
The `rfc7700` casemapping uses the PRECIS [Nickname profile][precis] as defined in [Section 2 of RFC 7700][precis].
[precis]: https://tools.ietf.org/html/rfc7700#section-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, I changed the other links above to use the named link syntax since they were all referring to the same URL. The link here and the link to rfc7700 up the top differ from the others since they're linking to different sections (and the url's only replicated once throughout the doc).
documentation/rfc7700.md
Outdated
* `('*', 0x2A)` - Used in mask matching. | ||
* `('?', 0x3F)` - Used in mask matching. | ||
* `('.', 0x2E)` - Denotes a server name. | ||
* `('!', 0x21)` - Separates username from hostname. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as previously
documentation/rfc7700.md
Outdated
Hostnames cannot contain the following charactes: | ||
|
||
* `(' ', 0x20)` - Separates parameters. | ||
* `(':', 0x3A)` - Separates trailing parameter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IPv6 IPs need :
in the hostname (spotted by @jobe1986)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They even technically can have it as the first character. Which is a bit problematic for 352 RPL_WHOREPLY and 311 RPL_WHOISUSER.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Servers add a 0
prefix to IPv6 IPs beginning with :
so that's not a problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My mistake, meant to remove those with another change. This has been removed.
@DanielOaks Could you add some examples, particularly ones that illustrate how comparisons work? |
documentation/rfc7700.md
Outdated
* `(' ', 0x20)` - Separates parameters. | ||
* `(',', 0x2C)` - Used as a separator. | ||
* `('*', 0x2A)` - Used in mask matching. | ||
* `('?', 0x3F)` - Used in mask matching. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is mask matching in channel names a thing? *
and ?
are valid channel characters at the moment, this seems overly restrictive. (spotted by @jobe1986)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, removed those
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We (InspIRCd) use glob matching on channel names in various places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@SaberUK Example? Do you also forbid those characters in channel names or is there just no way to specify them without accidentally over-globbing?
Actually, I just tested and was able to create a channel on Insp with both *
and ?
. I think the recommendation should probably not go against existing valid characters.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jwheare We don't presently forbid them although they are used in various places like e.g.
https://github.com/inspircd/inspircd/blob/master/docs/conf/modules.conf.example#L714
This does unfortunately result in some problems like what you mentioned though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible to block them, as documented: https://github.com/inspircd/inspircd/blob/insp20/docs/conf/modules.conf.example#L417-L419
For clarification: Does PRECIS affect only comparisons or display as well? If it affects display, does the PRECIS case-folding rule mean that it's impossible to use mixed-case nicknames (since they get mapped to lowercase)? |
@grawity As I understand it only affects comparisons, if adopting it meant losing upper case characters in nicks then it would be a step backwards. |
documentation/rfc7700.md
Outdated
* `('6', 0x36)` - Disallowed. | ||
* `('7', 0x37)` - Disallowed. | ||
* `('8', 0x38)` - Disallowed. | ||
* `('9', 0x39)` - Disallowed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not allowing numbers as the first char of a nick shouldn't be in the spec for these reasons:
- Servers already change the nick of clients to nicks starting with a number e.g. in case of collision and with this restriction that is a violation of the spec.
- Presently most (or all) servers don't allow nicks starting with numbers but in the future servers should be able to relax this restriction without updating the casemapping.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's nothing stopping servers from accepting a subset of nicks allowed by this spec (they can send an invalid nick numeric for any nick they don't like) so servers can still disallow digits if they want but they cannot allow more nicks than what this spec allows. Also clients must be prepared to see nicks starting with digits.
@grawity @attilamolnar Correct, PRECIS (and casemapping) does not affect display, similarly to how casemapping works currently. |
Made the Disallowed Characters section recommended instead of required, as suggested by @attilamolnar, threw |
PRECIS (RFC 7564) defines two classes, To quote from 7564 (with emphasis added by me):
With that context out of the way, here's my question: Should we be re-using the Nickname profile (RFC 7700) for channel names as well as nicks and usernames? It would make more sense to me to restrict channel names to the |
That's a good point... Using multiple algorithms (one for chans, one for nicks, and/or something similar), imo is just begging for trouble but I'll certainly have a closer look into and read of that, thanks for pointing it out. |
documentation/rfc7700.md
Outdated
|
||
With the large numbers of new characters allowed comes the risk of introducing confusion for users. The PRECIS framework (much like the earlier framework [stringprep](https://tools.ietf.org/html/rfc3454)) aims to avoid this through mapping confusable characters to a single base character, and by allowing specific known-good characters. | ||
|
||
The PRECIS framework represents the most modern standardized solution today for doing this sort of mapping and handling of internationalized names, and should mitigate most of the issues around this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a highly misleading statement.
Reading Section 12.5 (Security Considerations - Visually Similar Characters) of RFC 7564 it says:
Because PRECIS-compliant strings can contain almost any properly encoded Unicode code point, it can be relatively easy to fake or mimic some strings in systems that use the PRECIS framework. The fact that some strings are easily confused introduces security vulnerabilities of the kind that have also plagued the World Wide Web, specifically the phenomenon known as phishing.
[...]
Because it is impossible to map visually similar characters without a great deal of context (such as knowing the font families used), the PRECIS framework does nothing to map similar-looking characters together, nor does it prohibit some characters because they look like others.
[...]
The challenges inherent in supporting the full range of Unicode code points have in the past led some to hope for a way to programmatically negotiate more restrictive ranges based on locale, script, or other relevant factors; to tag the locale associated with a particular string; etc. As a general-purpose internationalization technology, the PRECIS framework does not include such mechanisms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless I'm mistaken, I believe this would be covered by the rules of the Nickname profile itself here (specifically, 3+4+5). Regardless, I'll have another read over both those documents and probably adjust the text here to make it more clear exactly what I'm referring to, thanks for pointing this out.
documentation/rfc7700.md
Outdated
|
||
Names being prepared MUST apply the following rules in the order shown: | ||
|
||
1. Preperation using the PRECIS [Nickname profile][precis]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/Preperation/Preparation/
documentation/rfc7700.md
Outdated
period: "2016" | ||
email: "daniel@danieloaks.net" | ||
--- | ||
This document describes a unicode-aware casemapping for IRC, based on the recommendations in [RFC 7700](https://tools.ietf.org/html/rfc7700). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unicode is a proper noun so it should be capitalised.
Using an IdentifierClass, as pointed out by @M2Ys4U, is much better than using a FreeformClass.
Yo @M2Ys4U, now using UsernameCaseMapped (an IdentifierClass profile) for everything. In my tests... seems to work fine, and if it's better locked-down than the Nickname class then all the better. |
On IRC we discussed the use case of emoji in channel names (#🥕 for example) - irccloud and others allow this in production right now. It seems UsernameCaseMapped might disallow such channel names. @DanielOaks is investigating the difficulty of a custom precis profile that permits such modifications to other profiles. |
Hmm. I can't find any C library that has PRECIS and those profiles. But that could also be my current lack of knowledge with regards to unicode (and utf8). In any case, the availability of a library or drop-in code that various IRCd's could use for checking "is this nick permitted?" and "are these nicks the same?" would make implementing this much more doable, possibly even crucial for success. And of course, not just for IRC servers but also for services and (I suppose) clients. Also, I read that as of October 2017 RFC8265 obsoletes RFC7613 and RFC8266 obsoletes RFC7700. |
Yeah, there's some trouble with this approach around confusable characters, so I've got this specification 'on hold' until I work out those issues. Once I've got those issues worked out I'll change this spec from 7613 to one of the newer RFC numbers. To be specific, PRECIS doesn't in any way attempt to map confusable characters to a single codepoint. Well, it does, really, but only certain confusable characters, and not others. Which means you can actually get two nicknames that look exactly the same following this method. See also, section 12.5 - Visually Similar Characters of RFC 7564. |
I only saw your edit just now:
That is disappointing. I guess I misunderstood what PRECIS does then (could be because I didn't read it :D). I must say that the Security Considerations in your draft gave me a bit of a false sense of security.. it starts with saying it has considerable security impact but then outlines the avoiding of confusing characters etc. etc... it sounded quite reassuring. So you may want to reword that or, better, see if a solution is possible (see next). I think for something workable on IRC you would have to "solve" the problem of identical looking UTF8 nicks as well. Or give suggestions about what should be done in the IRCd. Don't you agree? |
For sure, yeah. I assumed PRECIS protected against that as well (because it does map a fair few of those characters together, just not the identical-looking ones). I wrote up the spec, then someone demonstrated certain pairs of characters that look identical, but the PRECIS UsernameCaseMapped profile keeps separate, so yeah. Don't you love Unicode? I'm planning on something along those lines as well, similar to the PRECIS suggestions around possibly only allowing one script or similar (as much as that feels like a copout). |
@DanielOaks: I tried to contact you a while back via email (27 Nov 11:53 UTC) from syzop@vunscan.org. I could put part of that here in the open: I've added experimental UTF8 support in set::allowed-nickchars in UnrealIRCd which allows the admin to allow certain utf8 characters in nick names. In the release notes I mention that, like the original set::allowed-nickchars, it does not do any special CASEMAPPING or "similar looking character detection", and summing up the known problems with the lack of such support. I also noticed that for example anope does not seem to allow such characters which further limits the current use. In my opinion the goal of IRCv3, or in any case the IRC community in general, should be to add a new CASEMAPPING in some standard way/library/tables so the same casemapping (and other stuff PRECIS does) is applied the same way to irc servers and services (and clients). If every software implementation is going to choose it's own casemapping it's rather annoying and confusing. This is especially notable in the servers vs services case where f.e. account names are compared. The spec is just as important as having common code/lib/implementations. In my email to you I also ask for some technical suggestions with regards to that. |
Heyo @syzop! Sorry for not responding, emails have fallen behind a little with lots going on at work and home. I'm thinking of changing this proposal slightly to better integrate with existing servers that use something like Totally agree with the standard way to do such a thing, that's been mostly the intention of this since it started coming up and since I threw this proposal in. I'll respond to your email and either later this week or over the weekend throw those changes into this specification to clear things up and make it easier for existing servers to implement. Thanks for the push with this and I'm excited to see what comes of Unreal's new experimental char support :) |
Great. And no problem at all! Glad to see your continued interest and look forward to working with you. |
…ibility with servers, laid out Visually Similar Characters section
s/preperation/preparation/g Should there be a way to specify an allowed list of characters/sets as described in the visually similar section? Another ISUPPORT token? |
Thanks, will fix that spelling issue. Nah, no way to do something like that. It'd end up being a multi-KB (maybe even MB) blob sent to the client on connection/registration, and we'd be inventing the format from scratch. Since the server can just refuse that channel/nick name they can just refuse it with an appropriate error message in the numeric if they want to. |
Fair enough. Just wondering if something like Unreal's allowed-nickchars labels might work instead of a huge list of chracters https://www.unrealircd.org/docs/Nick_Character_Sets hebrew-utf8, etc aren't really standard labels but maybe something similar exists? |
You could try, but they wouldn't be able to be used as anything more than a rough "Hey this is what we sorta allow". In addition to standard character sets, since the issue is around confusable characters you'd also need to have the ability to list explicit characters as well, which combinations of characters specifically are allowed and which combinations aren't allowed together. For instance, in my (admittedly fairly lax) server I'm looking at doing an interesting approach around the confusables lists distributed by Unicode and somewhat-heuristically along with character sets determining whether specific names are allowed. Those are the sorta details you can't really codify. Given the above, I think just plain leaving it to the server and them explicitly telling the user why that nick/channel name isn't allowed would be the best option. |
As a note, there's a running implementation of this on It should be noted that aside from the banned characters in the spec, it doesn't implement any sort of additional protections recommended (it's planned, just a lot of work to either build up whitelists or build up some sort of blacklist system based around Unicode's confusables list). This means it's pretty similar to get similar-looking nicks using this homograph attack generator, but that's just what you get when you don't implement proper protections. I don't consider this a spec issue because the spec leaves those specific protection mechanisms up for debate. We can't legislate a single good one because there isn't any defined good one. Even browsers change how they do it regularly, and they've had to deal with it a lot longer than us. |
Yo @slingamn if you could take a look over the two commits just added that'd be ace. You're the one that's looked deeper into the skeletonisation requirements so if I've got anything wrong just let me know ;) |
@@ -73,7 +73,9 @@ As noted in the [Visually Similar Characters section](https://tools.ietf.org/htm | |||
|
|||
With the new allowed Unicode characters comes the ability to use characters that look the same. For example, `E (0x45)`, `Ε (U+0395)` (Greek Capital Letter Epsilon), and `Е (U+0415)` (Cyrillic Capital Letter IE) look the same in most fonts, but are treated as separate characters by this casemapping. More examples of these can be found in Unicode's [Confusables document](https://www.unicode.org/Public/security/latest/confusables.txt). | |||
|
|||
To combat this, we recommend only allowing characters from a single character set or locale to be used in names, or for the allowed characters to be a specific list of known, non-confusable characters. Other recommendations are available in the [Visually Similar Characters section](https://tools.ietf.org/html/rfc8264#section-12.5) of the PRECIS framework specification. Names that have the opportunity to be confusing SHOULD be disallowed by servers. | |||
Unicode skeletonisation is the method we recommend to combat this. For each identifier (nick/channel name) on the server, a 'skeleton' is generated by taking the **casefolded** name, and then applying to it the transformations described in the [Unicode Security Mechanisms document](http://unicode.org/reports/tr39/#Confusable_Detection). These skeletons, if used, MUST ONLY be used for comparison, and not as any user-visible identifier (as they intentionally contain complicated mixes of scripts and characters). When users change nicknames or create new channels, the casefolded names should be compared and the skeletons should also be compared to ensure that both are globally-unique (with any non-unique names rejected outright). This seems to be the most reliable method as of right now, but does require storing the skeletons of all in-use names for comparison purposes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't what we implemented in oragono --- we skeletonize the unfolded, original identifier (the one that is displayed to the users), and only then apply a round of width and case normalization. The rationale is that an initial round of casefolding may lose information about visual confusability. Hypothetically, you could have a non-Latin character with both uppercase and lowercase forms, such that its uppercase form is visually confusable with a Latin character but its lowercase form is visually distinct. Casefolding first would allow an impersonation attack using the uppercase character.
To be honest, I think it would help to see how this plays out in the wild --- try to get real-world user stories from people using non-Latin scripts, also see if we can get some Unicode experts to play with the implementation and try to break it.
I think the only thing to fix with this spec is that we described our skeletonisation a bit incorrectly, but otherwise this should describe our implementation pretty well. As far as a general i18n name solution, I like the prospect of display names more than this spec because of the simpler implementation (once you've got Metadata at least) and the flexibility. It takes a lot to keep Unicode identifiers unique, evident from our skeletonisation description above. We're gonna keep this implementation in Oragono, but I might publish this as a vendor thing on our site instead of keeping it as a PR here. Particularly as PRECIS libraries that IRC servers written in C (and the like) don't really seem to be available. |
Unicode names have been wanted for a while, and used in experimental implementations as well in certain bouncers integrating with other messaging systems.
This document outlines a method based on RFC 7700 which should represent a reasonable, modern solution for those projects that wish to allow unicode characters and casemap them appropriately.
There's previous discussion around this in #259.
This casemapping does not specify any sort of backwards-compatibility measures. Being compatible with clients and servers that cannot correctly handle unicode has been brought up many times during discussions about unicode casemappings. Below outlines some of the most reasonable suggestions, and why I haven't included them in this specification:
Encoding names so non-rfc7700 servers can accept them
This suggestion revolves around the client encoding nicknames and channel names into currently IRC-friendly characters before it sends them to the server (allowing them to be used on every server out there today). When receiving these encoded names, other unicode-aware clients will decode them to their proper unicode counterpart before displaying them.
Pros
Cons
Because of the security implications this would bring up, I think this is an extremely bad idea.
Encoding names so that non-unicode-aware clients can accept them
This suggestion revolves around the server encoding nicknames and channel names into currently IRC-friendly characters before it sends them to the client (allowing them to be accepted by every client out there). When receiving these encoded names, unicode-aware clients will decode them into their proper unicode counterpart before displaying them.
Pros
Cons
I don't think this is required because getting this casemapping widely implemented will take time. By the time this casemapping gets into large enough use to warrant worrying about legacy clients, I think a large majority of the clients currently in use will support unicode names without issues. As well, a number of clients already successfully accept unicode names.
Because of the complexity this process adds and how much I see it as a non-issue at this point this is implemented, I don't think this should happen and believe it's more effort than it's worth I think this measure would just cause more problems than it would solve.