Add document for Unicode casemapping #272

Open
wants to merge 9 commits into
from

Projects

None yet

7 participants

@DanielOaks
Member

Unicode names have been wanted for a while, and used in experimental implementations as well in certain bouncers integrating with other messaging systems.

This document outlines a method based on RFC 7700 which should represent a reasonable, modern solution for those projects that wish to allow unicode characters and casemap them appropriately.

There's previous discussion around this in #259.

This casemapping does not specify any sort of backwards-compatibility measures. Being compatible with clients and servers that cannot correctly handle unicode has been brought up many times during discussions about unicode casemappings. Below outlines some of the most reasonable suggestions, and why I haven't included them in this specification:

Encoding names so non-rfc7700 servers can accept them

This suggestion revolves around the client encoding nicknames and channel names into currently IRC-friendly characters before it sends them to the server (allowing them to be used on every server out there today). When receiving these encoded names, other unicode-aware clients will decode them to their proper unicode counterpart before displaying them.

Pros

  • Unicode nicknames and channel names can be used on servers that don't natively support unicode.
  • Non-unicode-aware clients can connect to servers that are unicode-aware.

Cons

  • Non-unicode-aware servers will allow nicknames nicknames that look like duplicates, due to the encoding required and the server not being able to enforce the name preperation described above.
  • Possible duplication of names by encoding names that contain only irc-friendly characters (or otherwise, strict client-side checking that is likely to be misinterpreted or misimplemented).

Because of the security implications this would bring up, I think this is an extremely bad idea.

Encoding names so that non-unicode-aware clients can accept them

This suggestion revolves around the server encoding nicknames and channel names into currently IRC-friendly characters before it sends them to the client (allowing them to be accepted by every client out there). When receiving these encoded names, unicode-aware clients will decode them into their proper unicode counterpart before displaying them.

Pros

  • We can be assured that any client, including ones that can't do unicode, will be able to accept the names.

Cons

  • Due to the encoding, these encoded nicknames are not going to be easily readable by non-unicode-aware clients, and are going to appear as a blob of unreadable text.
  • Even if we only encode names that contain special characters, that complicates message sending in ways that's likely going to irritate server authors into not implementing this.
  • The decoding/encoding required by this (particularly if only certain names are encoded) complicates client programming in ways that is likely to be misimplemented.

I don't think this is required because getting this casemapping widely implemented will take time. By the time this casemapping gets into large enough use to warrant worrying about legacy clients, I think a large majority of the clients currently in use will support unicode names without issues. As well, a number of clients already successfully accept unicode names.

Because of the complexity this process adds and how much I see it as a non-issue at this point this is implemented, I don't think this should happen and believe it's more effort than it's worth I think this measure would just cause more problems than it would solve.

documentation/rfc7700.md
+* `(',', 0x2C)` - Used as a separator.
+* `('*', 0x2A)` - Used in mask matching.
+* `('?', 0x3F)` - Used in mask matching.
+* `('!', 0x21)` - Separates username from hostname.
@attilamolnar
attilamolnar Sep 15, 2016 Member

Fix: separates nickname from username

documentation/rfc7700.md
+
+These steps MUST happen in the order shown, or else the restricted characters check may miss characters that should be legitimately restricted.
+
+If a name does contain a restricted character (whether disallowed by the [Nickname profile](https://tools.ietf.org/html/rfc7700#section-2.2) or this document), it MUST be rejected by the server and MUST NOT be propogated to other clients. This is done through the appropriate numeric for the command which tried to set or use the invalid name such as `ERR_ERRONEUSNICKNAME`, `ERR_NOSUCHCHANNEL`, or whichever numeric is most appropriate.
@grawity
grawity Sep 15, 2016 edited Contributor

At this point better use the named link syntax:

The `rfc7700` casemapping uses the PRECIS [Nickname profile][precis] as defined in [Section 2 of RFC 7700][precis].

[precis]: https://tools.ietf.org/html/rfc7700#section-2
@DanielOaks
DanielOaks Sep 15, 2016 Member

Cool, I changed the other links above to use the named link syntax since they were all referring to the same URL. The link here and the link to rfc7700 up the top differ from the others since they're linking to different sections (and the url's only replicated once throughout the doc).

documentation/rfc7700.md
+Nicknames cannot contain the following characters:
+
+* `(' ', 0x20)` - Separates parameters.
+* `(':', 0x3A)` - Separates trailing parameter.
@grawity
grawity Sep 15, 2016 edited Contributor

This & identical entries below seem unnecessary; : only has special meaning as the first character of a parameter.

@DanielOaks
DanielOaks Sep 15, 2016 Member

It does, but since it's already disallowed and I could see it causing possible confusion with libraries that split parameters strangely, figured it was better to disallow it. If we figure it's not required I can definitely remove it though.

@grawity
grawity Sep 15, 2016 Contributor

Might as well forbid : in privmsgs, topics, etc. A library blindly splitting on : is not "strange", it's outright buggy.

@DanielOaks
DanielOaks Sep 15, 2016 Member

Hmm, that's fair. In that case, I can just note that the first letter of one can't be :? (since if i.e. a nickname started with : then you wouldn't be able to use it in normal messages)

@grawity
grawity Sep 15, 2016 Contributor

Yeah, it should be fine under the "first character" list. (For channels it's already implied by CHANTYPE.)

documentation/rfc7700.md
+* `('*', 0x2A)` - Used in mask matching.
+* `('?', 0x3F)` - Used in mask matching.
+* `('.', 0x2E)` - Denotes a server name.
+* `('!', 0x21)` - Separates username from hostname.
@attilamolnar
attilamolnar Sep 15, 2016 Member

Same as previously

documentation/rfc7700.md
+Hostnames cannot contain the following charactes:
+
+* `(' ', 0x20)` - Separates parameters.
+* `(':', 0x3A)` - Separates trailing parameter.
@attilamolnar
attilamolnar Sep 15, 2016 edited Member

IPv6 IPs need : in the hostname (spotted by @jobe1986)

@jwheare
jwheare Sep 15, 2016 edited Member

They even technically can have it as the first character. Which is a bit problematic for 352 RPL_WHOREPLY and 311 RPL_WHOISUSER.

@attilamolnar
attilamolnar Sep 15, 2016 Member

Servers add a 0 prefix to IPv6 IPs beginning with : so that's not a problem.

@DanielOaks
DanielOaks Sep 15, 2016 Member

My mistake, meant to remove those with another change. This has been removed.

@attilamolnar
Member

@DanielOaks Could you add some examples, particularly ones that illustrate how comparisons work?

documentation/rfc7700.md
+* `(' ', 0x20)` - Separates parameters.
+* `(',', 0x2C)` - Used as a separator.
+* `('*', 0x2A)` - Used in mask matching.
+* `('?', 0x3F)` - Used in mask matching.
@jwheare
jwheare Sep 15, 2016 edited Member

Is mask matching in channel names a thing? * and ? are valid channel characters at the moment, this seems overly restrictive. (spotted by @jobe1986)

@DanielOaks
DanielOaks Sep 15, 2016 Member

Cool, removed those

@SaberUK
SaberUK Sep 15, 2016 Contributor

We (InspIRCd) use glob matching on channel names in various places.

@jwheare
jwheare Sep 15, 2016 Member

@SaberUK Example? Do you also forbid those characters in channel names or is there just no way to specify them without accidentally over-globbing?

Actually, I just tested and was able to create a channel on Insp with both * and ?. I think the recommendation should probably not go against existing valid characters.

@SaberUK
SaberUK Sep 15, 2016 Contributor

@jwheare We don't presently forbid them although they are used in various places like e.g.

https://github.com/inspircd/inspircd/blob/master/docs/conf/modules.conf.example#L714

This does unfortunately result in some problems like what you mentioned though.

@grawity
Contributor
grawity commented Sep 15, 2016 edited

For clarification: Does PRECIS affect only comparisons or display as well? If it affects display, does the PRECIS case-folding rule mean that it's impossible to use mixed-case nicknames (since they get mapped to lowercase)?

@attilamolnar
Member

@grawity As I understand it only affects comparisons, if adopting it meant losing upper case characters in nicks then it would be a step backwards.

documentation/rfc7700.md
+* `('6', 0x36)` - Disallowed.
+* `('7', 0x37)` - Disallowed.
+* `('8', 0x38)` - Disallowed.
+* `('9', 0x39)` - Disallowed.
@attilamolnar
attilamolnar Sep 15, 2016 Member

Not allowing numbers as the first char of a nick shouldn't be in the spec for these reasons:

  • Servers already change the nick of clients to nicks starting with a number e.g. in case of collision and with this restriction that is a violation of the spec.
  • Presently most (or all) servers don't allow nicks starting with numbers but in the future servers should be able to relax this restriction without updating the casemapping.
@attilamolnar
attilamolnar Sep 15, 2016 Member

There's nothing stopping servers from accepting a subset of nicks allowed by this spec (they can send an invalid nick numeric for any nick they don't like) so servers can still disallow digits if they want but they cannot allow more nicks than what this spec allows. Also clients must be prepared to see nicks starting with digits.

@DanielOaks
Member
DanielOaks commented Sep 15, 2016 edited

@grawity @attilamolnar Correct, PRECIS (and casemapping) does not affect display, similarly to how casemapping works currently.

@DanielOaks
Member

Made the Disallowed Characters section recommended instead of required, as suggested by @attilamolnar, threw * and ? back into the usernames section, various other minor edits of the copy.

@M2Ys4U
Contributor
M2Ys4U commented Sep 15, 2016 edited

PRECIS (RFC 7564) defines two classes, IdentifierClass and FreeformClass, the Nickname profile (RFC 7700) builds upon the latter.

To quote from 7564 (with emphasis added by me):

IdentifierClass: a sequence of letters, numbers, and some symbols that is used to identify or address a network entity such as a user account, a venue (e.g., a chatroom), an information source (e.g., a data feed), or a collection of data (e.g., a file); the intent is that this class will minimize user confusion in a wide variety of application protocols, with the result that safety has been prioritized over expressiveness for this class.

FreeformClass: a sequence of letters, numbers, symbols, spaces, and other characters that is used for free-form strings, including passwords as well as display elements such as human-friendly nicknames for devices or for participants in a chatroom; the intent is that this class will allow nearly any Unicode character, with the result that expressiveness has been prioritized over safety for this class. Note well that protocol designers, application developers, service providers, and end users might not understand or be able to enter all of the characters that can be included in the FreeformClass -- see Section 12.3 for details.

With that context out of the way, here's my question:

Should we be re-using the Nickname profile (RFC 7700) for channel names as well as nicks and usernames?

It would make more sense to me to restrict channel names to the IdentifierClass, however I can see the appeal of using a single algorithm for all IRC identifiers.

@DanielOaks
Member
DanielOaks commented Sep 15, 2016 edited

That's a good point... Using multiple algorithms (one for chans, one for nicks, and/or something similar), imo is just begging for trouble but I'll certainly have a closer look into and read of that, thanks for pointing it out.

documentation/rfc7700.md
+
+With the large numbers of new characters allowed comes the risk of introducing confusion for users. The PRECIS framework (much like the earlier framework [stringprep](https://tools.ietf.org/html/rfc3454)) aims to avoid this through mapping confusable characters to a single base character, and by allowing specific known-good characters.
+
+The PRECIS framework represents the most modern standardized solution today for doing this sort of mapping and handling of internationalized names, and should mitigate most of the issues around this.
@M2Ys4U
M2Ys4U Sep 15, 2016 Contributor

I think this is a highly misleading statement.

Reading Section 12.5 (Security Considerations - Visually Similar Characters) of RFC 7564 it says:

Because PRECIS-compliant strings can contain almost any properly encoded Unicode code point, it can be relatively easy to fake or mimic some strings in systems that use the PRECIS framework. The fact that some strings are easily confused introduces security vulnerabilities of the kind that have also plagued the World Wide Web, specifically the phenomenon known as phishing.

[...]

Because it is impossible to map visually similar characters without a great deal of context (such as knowing the font families used), the PRECIS framework does nothing to map similar-looking characters together, nor does it prohibit some characters because they look like others.

[...]

The challenges inherent in supporting the full range of Unicode code points have in the past led some to hope for a way to programmatically negotiate more restrictive ranges based on locale, script, or other relevant factors; to tag the locale associated with a particular string; etc. As a general-purpose internationalization technology, the PRECIS framework does not include such mechanisms.

@DanielOaks
DanielOaks Sep 15, 2016 Member

Unless I'm mistaken, I believe this would be covered by the rules of the Nickname profile itself here (specifically, 3+4+5). Regardless, I'll have another read over both those documents and probably adjust the text here to make it more clear exactly what I'm referring to, thanks for pointing this out.

documentation/rfc7700.md
+
+Names being prepared MUST apply the following rules in the order shown:
+
+1. Preperation using the PRECIS [Nickname profile][precis].
@SaberUK
SaberUK Sep 15, 2016 Contributor

s/Preperation/Preparation/

documentation/rfc7700.md
+ period: "2016"
+ email: "daniel@danieloaks.net"
+---
+This document describes a unicode-aware casemapping for IRC, based on the recommendations in [RFC 7700](https://tools.ietf.org/html/rfc7700).
@SaberUK
SaberUK Sep 15, 2016 Contributor

Unicode is a proper noun so it should be capitalised.

@jwheare jwheare added the protocol label Jan 7, 2017
@jwheare jwheare modified the milestone: Roadmap Jan 7, 2017
@DanielOaks DanielOaks changed the title from Add document for rfc7700 casemapping to Add document for Unicode casemapping Jan 13, 2017
@DanielOaks DanielOaks unicode_casemapping: 7700 -> 7613. Now using UsernameCaseMapped.
Using an IdentifierClass, as pointed out by @M2Ys4U, is much better than using a FreeformClass.
0d6435d
@DanielOaks
Member

Yo @M2Ys4U, now using UsernameCaseMapped (an IdentifierClass profile) for everything. In my tests... seems to work fine, and if it's better locked-down than the Nickname class then all the better.

@CarrotCodes
CarrotCodes commented Jan 16, 2017 edited

On IRC we discussed the use case of emoji in channel names (#🥕 for example) - irccloud and others allow this in production right now. It seems UsernameCaseMapped might disallow such channel names.

@DanielOaks is investigating the difficulty of a custom precis profile that permits such modifications to other profiles.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment