New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add document for Unicode casemapping #272

Open
wants to merge 11 commits into
from

Conversation

@DanielOaks
Member

DanielOaks commented Sep 15, 2016

Unicode names have been wanted for a while, and used in experimental implementations as well in certain bouncers integrating with other messaging systems.

This document outlines a method based on RFC 7700 which should represent a reasonable, modern solution for those projects that wish to allow unicode characters and casemap them appropriately.

There's previous discussion around this in #259.

This casemapping does not specify any sort of backwards-compatibility measures. Being compatible with clients and servers that cannot correctly handle unicode has been brought up many times during discussions about unicode casemappings. Below outlines some of the most reasonable suggestions, and why I haven't included them in this specification:

Encoding names so non-rfc7700 servers can accept them

This suggestion revolves around the client encoding nicknames and channel names into currently IRC-friendly characters before it sends them to the server (allowing them to be used on every server out there today). When receiving these encoded names, other unicode-aware clients will decode them to their proper unicode counterpart before displaying them.

Pros

  • Unicode nicknames and channel names can be used on servers that don't natively support unicode.
  • Non-unicode-aware clients can connect to servers that are unicode-aware.

Cons

  • Non-unicode-aware servers will allow nicknames nicknames that look like duplicates, due to the encoding required and the server not being able to enforce the name preperation described above.
  • Possible duplication of names by encoding names that contain only irc-friendly characters (or otherwise, strict client-side checking that is likely to be misinterpreted or misimplemented).

Because of the security implications this would bring up, I think this is an extremely bad idea.

Encoding names so that non-unicode-aware clients can accept them

This suggestion revolves around the server encoding nicknames and channel names into currently IRC-friendly characters before it sends them to the client (allowing them to be accepted by every client out there). When receiving these encoded names, unicode-aware clients will decode them into their proper unicode counterpart before displaying them.

Pros

  • We can be assured that any client, including ones that can't do unicode, will be able to accept the names.

Cons

  • Due to the encoding, these encoded nicknames are not going to be easily readable by non-unicode-aware clients, and are going to appear as a blob of unreadable text.
  • Even if we only encode names that contain special characters, that complicates message sending in ways that's likely going to irritate server authors into not implementing this.
  • The decoding/encoding required by this (particularly if only certain names are encoded) complicates client programming in ways that is likely to be misimplemented.

I don't think this is required because getting this casemapping widely implemented will take time. By the time this casemapping gets into large enough use to warrant worrying about legacy clients, I think a large majority of the clients currently in use will support unicode names without issues. As well, a number of clients already successfully accept unicode names.

Because of the complexity this process adds and how much I see it as a non-issue at this point this is implemented, I don't think this should happen and believe it's more effort than it's worth I think this measure would just cause more problems than it would solve.

Show outdated Hide outdated documentation/rfc7700.md
* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.
* `('!', 0x21)` - Separates username from hostname.

This comment has been minimized.

@attilamolnar

attilamolnar Sep 15, 2016

Member

Fix: separates nickname from username

@attilamolnar

attilamolnar Sep 15, 2016

Member

Fix: separates nickname from username

Show outdated Hide outdated documentation/rfc7700.md
Nicknames cannot contain the following characters:
* `(' ', 0x20)` - Separates parameters.
* `(':', 0x3A)` - Separates trailing parameter.

This comment has been minimized.

@grawity

grawity Sep 15, 2016

Contributor

This & identical entries below seem unnecessary; : only has special meaning as the first character of a parameter.

@grawity

grawity Sep 15, 2016

Contributor

This & identical entries below seem unnecessary; : only has special meaning as the first character of a parameter.

This comment has been minimized.

@DanielOaks

DanielOaks Sep 15, 2016

Member

It does, but since it's already disallowed and I could see it causing possible confusion with libraries that split parameters strangely, figured it was better to disallow it. If we figure it's not required I can definitely remove it though.

@DanielOaks

DanielOaks Sep 15, 2016

Member

It does, but since it's already disallowed and I could see it causing possible confusion with libraries that split parameters strangely, figured it was better to disallow it. If we figure it's not required I can definitely remove it though.

This comment has been minimized.

@grawity

grawity Sep 15, 2016

Contributor

Might as well forbid : in privmsgs, topics, etc. A library blindly splitting on : is not "strange", it's outright buggy.

@grawity

grawity Sep 15, 2016

Contributor

Might as well forbid : in privmsgs, topics, etc. A library blindly splitting on : is not "strange", it's outright buggy.

This comment has been minimized.

@DanielOaks

DanielOaks Sep 15, 2016

Member

Hmm, that's fair. In that case, I can just note that the first letter of one can't be :? (since if i.e. a nickname started with : then you wouldn't be able to use it in normal messages)

@DanielOaks

DanielOaks Sep 15, 2016

Member

Hmm, that's fair. In that case, I can just note that the first letter of one can't be :? (since if i.e. a nickname started with : then you wouldn't be able to use it in normal messages)

This comment has been minimized.

@grawity

grawity Sep 15, 2016

Contributor

Yeah, it should be fine under the "first character" list. (For channels it's already implied by CHANTYPE.)

@grawity

grawity Sep 15, 2016

Contributor

Yeah, it should be fine under the "first character" list. (For channels it's already implied by CHANTYPE.)

Show outdated Hide outdated documentation/rfc7700.md
These steps MUST happen in the order shown, or else the restricted characters check may miss characters that should be legitimately restricted.
If a name does contain a restricted character (whether disallowed by the [Nickname profile](https://tools.ietf.org/html/rfc7700#section-2.2) or this document), it MUST be rejected by the server and MUST NOT be propogated to other clients. This is done through the appropriate numeric for the command which tried to set or use the invalid name such as `ERR_ERRONEUSNICKNAME`, `ERR_NOSUCHCHANNEL`, or whichever numeric is most appropriate.

This comment has been minimized.

@grawity

grawity Sep 15, 2016

Contributor

At this point better use the named link syntax:

The `rfc7700` casemapping uses the PRECIS [Nickname profile][precis] as defined in [Section 2 of RFC 7700][precis].

[precis]: https://tools.ietf.org/html/rfc7700#section-2
@grawity

grawity Sep 15, 2016

Contributor

At this point better use the named link syntax:

The `rfc7700` casemapping uses the PRECIS [Nickname profile][precis] as defined in [Section 2 of RFC 7700][precis].

[precis]: https://tools.ietf.org/html/rfc7700#section-2

This comment has been minimized.

@DanielOaks

DanielOaks Sep 15, 2016

Member

Cool, I changed the other links above to use the named link syntax since they were all referring to the same URL. The link here and the link to rfc7700 up the top differ from the others since they're linking to different sections (and the url's only replicated once throughout the doc).

@DanielOaks

DanielOaks Sep 15, 2016

Member

Cool, I changed the other links above to use the named link syntax since they were all referring to the same URL. The link here and the link to rfc7700 up the top differ from the others since they're linking to different sections (and the url's only replicated once throughout the doc).

Show outdated Hide outdated documentation/rfc7700.md
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.
* `('.', 0x2E)` - Denotes a server name.
* `('!', 0x21)` - Separates username from hostname.

This comment has been minimized.

@attilamolnar

attilamolnar Sep 15, 2016

Member

Same as previously

@attilamolnar

attilamolnar Sep 15, 2016

Member

Same as previously

Show outdated Hide outdated documentation/rfc7700.md
Hostnames cannot contain the following charactes:
* `(' ', 0x20)` - Separates parameters.
* `(':', 0x3A)` - Separates trailing parameter.

This comment has been minimized.

@attilamolnar

attilamolnar Sep 15, 2016

Member

IPv6 IPs need : in the hostname (spotted by @jobe1986)

@attilamolnar

attilamolnar Sep 15, 2016

Member

IPv6 IPs need : in the hostname (spotted by @jobe1986)

This comment has been minimized.

@jwheare

jwheare Sep 15, 2016

Member

They even technically can have it as the first character. Which is a bit problematic for 352 RPL_WHOREPLY and 311 RPL_WHOISUSER.

@jwheare

jwheare Sep 15, 2016

Member

They even technically can have it as the first character. Which is a bit problematic for 352 RPL_WHOREPLY and 311 RPL_WHOISUSER.

This comment has been minimized.

@attilamolnar

attilamolnar Sep 15, 2016

Member

Servers add a 0 prefix to IPv6 IPs beginning with : so that's not a problem.

@attilamolnar

attilamolnar Sep 15, 2016

Member

Servers add a 0 prefix to IPv6 IPs beginning with : so that's not a problem.

This comment has been minimized.

@DanielOaks

DanielOaks Sep 15, 2016

Member

My mistake, meant to remove those with another change. This has been removed.

@DanielOaks

DanielOaks Sep 15, 2016

Member

My mistake, meant to remove those with another change. This has been removed.

@attilamolnar

This comment has been minimized.

Show comment
Hide comment
@attilamolnar

attilamolnar Sep 15, 2016

Member

@DanielOaks Could you add some examples, particularly ones that illustrate how comparisons work?

Member

attilamolnar commented Sep 15, 2016

@DanielOaks Could you add some examples, particularly ones that illustrate how comparisons work?

Show outdated Hide outdated documentation/rfc7700.md
* `(' ', 0x20)` - Separates parameters.
* `(',', 0x2C)` - Used as a separator.
* `('*', 0x2A)` - Used in mask matching.
* `('?', 0x3F)` - Used in mask matching.

This comment has been minimized.

@jwheare

jwheare Sep 15, 2016

Member

Is mask matching in channel names a thing? * and ? are valid channel characters at the moment, this seems overly restrictive. (spotted by @jobe1986)

@jwheare

jwheare Sep 15, 2016

Member

Is mask matching in channel names a thing? * and ? are valid channel characters at the moment, this seems overly restrictive. (spotted by @jobe1986)

This comment has been minimized.

@DanielOaks

DanielOaks Sep 15, 2016

Member

Cool, removed those

@DanielOaks

DanielOaks Sep 15, 2016

Member

Cool, removed those

This comment has been minimized.

@SaberUK

SaberUK Sep 15, 2016

Contributor

We (InspIRCd) use glob matching on channel names in various places.

@SaberUK

SaberUK Sep 15, 2016

Contributor

We (InspIRCd) use glob matching on channel names in various places.

This comment has been minimized.

@jwheare

jwheare Sep 15, 2016

Member

@SaberUK Example? Do you also forbid those characters in channel names or is there just no way to specify them without accidentally over-globbing?

Actually, I just tested and was able to create a channel on Insp with both * and ?. I think the recommendation should probably not go against existing valid characters.

@jwheare

jwheare Sep 15, 2016

Member

@SaberUK Example? Do you also forbid those characters in channel names or is there just no way to specify them without accidentally over-globbing?

Actually, I just tested and was able to create a channel on Insp with both * and ?. I think the recommendation should probably not go against existing valid characters.

This comment has been minimized.

@SaberUK

SaberUK Sep 15, 2016

Contributor

@jwheare We don't presently forbid them although they are used in various places like e.g.

https://github.com/inspircd/inspircd/blob/master/docs/conf/modules.conf.example#L714

This does unfortunately result in some problems like what you mentioned though.

@SaberUK

SaberUK Sep 15, 2016

Contributor

@jwheare We don't presently forbid them although they are used in various places like e.g.

https://github.com/inspircd/inspircd/blob/master/docs/conf/modules.conf.example#L714

This does unfortunately result in some problems like what you mentioned though.

This comment has been minimized.

@grawity

This comment has been minimized.

Show comment
Hide comment
@grawity

grawity Sep 15, 2016

Contributor

For clarification: Does PRECIS affect only comparisons or display as well? If it affects display, does the PRECIS case-folding rule mean that it's impossible to use mixed-case nicknames (since they get mapped to lowercase)?

Contributor

grawity commented Sep 15, 2016

For clarification: Does PRECIS affect only comparisons or display as well? If it affects display, does the PRECIS case-folding rule mean that it's impossible to use mixed-case nicknames (since they get mapped to lowercase)?

@attilamolnar

This comment has been minimized.

Show comment
Hide comment
@attilamolnar

attilamolnar Sep 15, 2016

Member

@grawity As I understand it only affects comparisons, if adopting it meant losing upper case characters in nicks then it would be a step backwards.

Member

attilamolnar commented Sep 15, 2016

@grawity As I understand it only affects comparisons, if adopting it meant losing upper case characters in nicks then it would be a step backwards.

Show outdated Hide outdated documentation/rfc7700.md
* `('6', 0x36)` - Disallowed.
* `('7', 0x37)` - Disallowed.
* `('8', 0x38)` - Disallowed.
* `('9', 0x39)` - Disallowed.

This comment has been minimized.

@attilamolnar

attilamolnar Sep 15, 2016

Member

Not allowing numbers as the first char of a nick shouldn't be in the spec for these reasons:

  • Servers already change the nick of clients to nicks starting with a number e.g. in case of collision and with this restriction that is a violation of the spec.
  • Presently most (or all) servers don't allow nicks starting with numbers but in the future servers should be able to relax this restriction without updating the casemapping.
@attilamolnar

attilamolnar Sep 15, 2016

Member

Not allowing numbers as the first char of a nick shouldn't be in the spec for these reasons:

  • Servers already change the nick of clients to nicks starting with a number e.g. in case of collision and with this restriction that is a violation of the spec.
  • Presently most (or all) servers don't allow nicks starting with numbers but in the future servers should be able to relax this restriction without updating the casemapping.

This comment has been minimized.

@attilamolnar

attilamolnar Sep 15, 2016

Member

There's nothing stopping servers from accepting a subset of nicks allowed by this spec (they can send an invalid nick numeric for any nick they don't like) so servers can still disallow digits if they want but they cannot allow more nicks than what this spec allows. Also clients must be prepared to see nicks starting with digits.

@attilamolnar

attilamolnar Sep 15, 2016

Member

There's nothing stopping servers from accepting a subset of nicks allowed by this spec (they can send an invalid nick numeric for any nick they don't like) so servers can still disallow digits if they want but they cannot allow more nicks than what this spec allows. Also clients must be prepared to see nicks starting with digits.

@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Sep 15, 2016

Member

@grawity @attilamolnar Correct, PRECIS (and casemapping) does not affect display, similarly to how casemapping works currently.

Member

DanielOaks commented Sep 15, 2016

@grawity @attilamolnar Correct, PRECIS (and casemapping) does not affect display, similarly to how casemapping works currently.

@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Sep 15, 2016

Member

Made the Disallowed Characters section recommended instead of required, as suggested by @attilamolnar, threw * and ? back into the usernames section, various other minor edits of the copy.

Member

DanielOaks commented Sep 15, 2016

Made the Disallowed Characters section recommended instead of required, as suggested by @attilamolnar, threw * and ? back into the usernames section, various other minor edits of the copy.

@M2Ys4U

This comment has been minimized.

Show comment
Hide comment
@M2Ys4U

M2Ys4U Sep 15, 2016

Contributor

PRECIS (RFC 7564) defines two classes, IdentifierClass and FreeformClass, the Nickname profile (RFC 7700) builds upon the latter.

To quote from 7564 (with emphasis added by me):

IdentifierClass: a sequence of letters, numbers, and some symbols that is used to identify or address a network entity such as a user account, a venue (e.g., a chatroom), an information source (e.g., a data feed), or a collection of data (e.g., a file); the intent is that this class will minimize user confusion in a wide variety of application protocols, with the result that safety has been prioritized over expressiveness for this class.

FreeformClass: a sequence of letters, numbers, symbols, spaces, and other characters that is used for free-form strings, including passwords as well as display elements such as human-friendly nicknames for devices or for participants in a chatroom; the intent is that this class will allow nearly any Unicode character, with the result that expressiveness has been prioritized over safety for this class. Note well that protocol designers, application developers, service providers, and end users might not understand or be able to enter all of the characters that can be included in the FreeformClass -- see Section 12.3 for details.

With that context out of the way, here's my question:

Should we be re-using the Nickname profile (RFC 7700) for channel names as well as nicks and usernames?

It would make more sense to me to restrict channel names to the IdentifierClass, however I can see the appeal of using a single algorithm for all IRC identifiers.

Contributor

M2Ys4U commented Sep 15, 2016

PRECIS (RFC 7564) defines two classes, IdentifierClass and FreeformClass, the Nickname profile (RFC 7700) builds upon the latter.

To quote from 7564 (with emphasis added by me):

IdentifierClass: a sequence of letters, numbers, and some symbols that is used to identify or address a network entity such as a user account, a venue (e.g., a chatroom), an information source (e.g., a data feed), or a collection of data (e.g., a file); the intent is that this class will minimize user confusion in a wide variety of application protocols, with the result that safety has been prioritized over expressiveness for this class.

FreeformClass: a sequence of letters, numbers, symbols, spaces, and other characters that is used for free-form strings, including passwords as well as display elements such as human-friendly nicknames for devices or for participants in a chatroom; the intent is that this class will allow nearly any Unicode character, with the result that expressiveness has been prioritized over safety for this class. Note well that protocol designers, application developers, service providers, and end users might not understand or be able to enter all of the characters that can be included in the FreeformClass -- see Section 12.3 for details.

With that context out of the way, here's my question:

Should we be re-using the Nickname profile (RFC 7700) for channel names as well as nicks and usernames?

It would make more sense to me to restrict channel names to the IdentifierClass, however I can see the appeal of using a single algorithm for all IRC identifiers.

@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Sep 15, 2016

Member

That's a good point... Using multiple algorithms (one for chans, one for nicks, and/or something similar), imo is just begging for trouble but I'll certainly have a closer look into and read of that, thanks for pointing it out.

Member

DanielOaks commented Sep 15, 2016

That's a good point... Using multiple algorithms (one for chans, one for nicks, and/or something similar), imo is just begging for trouble but I'll certainly have a closer look into and read of that, thanks for pointing it out.

Show outdated Hide outdated documentation/rfc7700.md
With the large numbers of new characters allowed comes the risk of introducing confusion for users. The PRECIS framework (much like the earlier framework [stringprep](https://tools.ietf.org/html/rfc3454)) aims to avoid this through mapping confusable characters to a single base character, and by allowing specific known-good characters.
The PRECIS framework represents the most modern standardized solution today for doing this sort of mapping and handling of internationalized names, and should mitigate most of the issues around this.

This comment has been minimized.

@M2Ys4U

M2Ys4U Sep 15, 2016

Contributor

I think this is a highly misleading statement.

Reading Section 12.5 (Security Considerations - Visually Similar Characters) of RFC 7564 it says:

Because PRECIS-compliant strings can contain almost any properly encoded Unicode code point, it can be relatively easy to fake or mimic some strings in systems that use the PRECIS framework. The fact that some strings are easily confused introduces security vulnerabilities of the kind that have also plagued the World Wide Web, specifically the phenomenon known as phishing.

[...]

Because it is impossible to map visually similar characters without a great deal of context (such as knowing the font families used), the PRECIS framework does nothing to map similar-looking characters together, nor does it prohibit some characters because they look like others.

[...]

The challenges inherent in supporting the full range of Unicode code points have in the past led some to hope for a way to programmatically negotiate more restrictive ranges based on locale, script, or other relevant factors; to tag the locale associated with a particular string; etc. As a general-purpose internationalization technology, the PRECIS framework does not include such mechanisms.

@M2Ys4U

M2Ys4U Sep 15, 2016

Contributor

I think this is a highly misleading statement.

Reading Section 12.5 (Security Considerations - Visually Similar Characters) of RFC 7564 it says:

Because PRECIS-compliant strings can contain almost any properly encoded Unicode code point, it can be relatively easy to fake or mimic some strings in systems that use the PRECIS framework. The fact that some strings are easily confused introduces security vulnerabilities of the kind that have also plagued the World Wide Web, specifically the phenomenon known as phishing.

[...]

Because it is impossible to map visually similar characters without a great deal of context (such as knowing the font families used), the PRECIS framework does nothing to map similar-looking characters together, nor does it prohibit some characters because they look like others.

[...]

The challenges inherent in supporting the full range of Unicode code points have in the past led some to hope for a way to programmatically negotiate more restrictive ranges based on locale, script, or other relevant factors; to tag the locale associated with a particular string; etc. As a general-purpose internationalization technology, the PRECIS framework does not include such mechanisms.

This comment has been minimized.

@DanielOaks

DanielOaks Sep 15, 2016

Member

Unless I'm mistaken, I believe this would be covered by the rules of the Nickname profile itself here (specifically, 3+4+5). Regardless, I'll have another read over both those documents and probably adjust the text here to make it more clear exactly what I'm referring to, thanks for pointing this out.

@DanielOaks

DanielOaks Sep 15, 2016

Member

Unless I'm mistaken, I believe this would be covered by the rules of the Nickname profile itself here (specifically, 3+4+5). Regardless, I'll have another read over both those documents and probably adjust the text here to make it more clear exactly what I'm referring to, thanks for pointing this out.

Show outdated Hide outdated documentation/rfc7700.md
Names being prepared MUST apply the following rules in the order shown:
1. Preperation using the PRECIS [Nickname profile][precis].

This comment has been minimized.

@SaberUK

SaberUK Sep 15, 2016

Contributor

s/Preperation/Preparation/

@SaberUK

SaberUK Sep 15, 2016

Contributor

s/Preperation/Preparation/

Show outdated Hide outdated documentation/rfc7700.md
period: "2016"
email: "daniel@danieloaks.net"
---
This document describes a unicode-aware casemapping for IRC, based on the recommendations in [RFC 7700](https://tools.ietf.org/html/rfc7700).

This comment has been minimized.

@SaberUK

SaberUK Sep 15, 2016

Contributor

Unicode is a proper noun so it should be capitalised.

@SaberUK

SaberUK Sep 15, 2016

Contributor

Unicode is a proper noun so it should be capitalised.

@jwheare jwheare added the protocol label Jan 7, 2017

@jwheare jwheare modified the milestone: Roadmap Jan 7, 2017

@DanielOaks DanielOaks changed the title from Add document for rfc7700 casemapping to Add document for Unicode casemapping Jan 13, 2017

unicode_casemapping: 7700 -> 7613. Now using UsernameCaseMapped.
Using an IdentifierClass, as pointed out by @M2Ys4U, is much better than using a FreeformClass.
@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Jan 13, 2017

Member

Yo @M2Ys4U, now using UsernameCaseMapped (an IdentifierClass profile) for everything. In my tests... seems to work fine, and if it's better locked-down than the Nickname class then all the better.

Member

DanielOaks commented Jan 13, 2017

Yo @M2Ys4U, now using UsernameCaseMapped (an IdentifierClass profile) for everything. In my tests... seems to work fine, and if it's better locked-down than the Nickname class then all the better.

@CarrotCodes

This comment has been minimized.

Show comment
Hide comment
@CarrotCodes

CarrotCodes Jan 16, 2017

On IRC we discussed the use case of emoji in channel names (#🥕 for example) - irccloud and others allow this in production right now. It seems UsernameCaseMapped might disallow such channel names.

@DanielOaks is investigating the difficulty of a custom precis profile that permits such modifications to other profiles.

CarrotCodes commented Jan 16, 2017

On IRC we discussed the use case of emoji in channel names (#🥕 for example) - irccloud and others allow this in production right now. It seems UsernameCaseMapped might disallow such channel names.

@DanielOaks is investigating the difficulty of a custom precis profile that permits such modifications to other profiles.

@syzop

This comment has been minimized.

Show comment
Hide comment
@syzop

syzop Nov 19, 2017

Hmm. I can't find any C library that has PRECIS and those profiles. But that could also be my current lack of knowledge with regards to unicode (and utf8). In any case, the availability of a library or drop-in code that various IRCd's could use for checking "is this nick permitted?" and "are these nicks the same?" would make implementing this much more doable, possibly even crucial for success. And of course, not just for IRC servers but also for services and (I suppose) clients.

Also, I read that as of October 2017 RFC8265 obsoletes RFC7613 and RFC8266 obsoletes RFC7700.

syzop commented Nov 19, 2017

Hmm. I can't find any C library that has PRECIS and those profiles. But that could also be my current lack of knowledge with regards to unicode (and utf8). In any case, the availability of a library or drop-in code that various IRCd's could use for checking "is this nick permitted?" and "are these nicks the same?" would make implementing this much more doable, possibly even crucial for success. And of course, not just for IRC servers but also for services and (I suppose) clients.

Also, I read that as of October 2017 RFC8265 obsoletes RFC7613 and RFC8266 obsoletes RFC7700.

@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Nov 19, 2017

Member

Yeah, there's some trouble with this approach around confusable characters, so I've got this specification 'on hold' until I work out those issues. Once I've got those issues worked out I'll change this spec from 7613 to one of the newer RFC numbers.

To be specific, PRECIS doesn't in any way attempt to map confusable characters to a single codepoint. Well, it does, really, but only certain confusable characters, and not others. Which means you can actually get two nicknames that look exactly the same following this method. See also, section 12.5 - Visually Similar Characters of RFC 7564.

Member

DanielOaks commented Nov 19, 2017

Yeah, there's some trouble with this approach around confusable characters, so I've got this specification 'on hold' until I work out those issues. Once I've got those issues worked out I'll change this spec from 7613 to one of the newer RFC numbers.

To be specific, PRECIS doesn't in any way attempt to map confusable characters to a single codepoint. Well, it does, really, but only certain confusable characters, and not others. Which means you can actually get two nicknames that look exactly the same following this method. See also, section 12.5 - Visually Similar Characters of RFC 7564.

@syzop

This comment has been minimized.

Show comment
Hide comment
@syzop

syzop Nov 23, 2017

I only saw your edit just now:

To be specific, PRECIS doesn't in any way attempt to map confusable characters to a single codepoint. Well, it does, really, but only certain confusable characters, and not others. Which means you can actually get two nicknames that look exactly the same following this method.

That is disappointing. I guess I misunderstood what PRECIS does then (could be because I didn't read it :D). I must say that the Security Considerations in your draft gave me a bit of a false sense of security.. it starts with saying it has considerable security impact but then outlines the avoiding of confusing characters etc. etc... it sounded quite reassuring. So you may want to reword that or, better, see if a solution is possible (see next).

I think for something workable on IRC you would have to "solve" the problem of identical looking UTF8 nicks as well. Or give suggestions about what should be done in the IRCd. Don't you agree?
As an UTF8 noob I'm not really in the position to do this but perhaps suggesting only to allow certain scripts or only specific certain combinations...

syzop commented Nov 23, 2017

I only saw your edit just now:

To be specific, PRECIS doesn't in any way attempt to map confusable characters to a single codepoint. Well, it does, really, but only certain confusable characters, and not others. Which means you can actually get two nicknames that look exactly the same following this method.

That is disappointing. I guess I misunderstood what PRECIS does then (could be because I didn't read it :D). I must say that the Security Considerations in your draft gave me a bit of a false sense of security.. it starts with saying it has considerable security impact but then outlines the avoiding of confusing characters etc. etc... it sounded quite reassuring. So you may want to reword that or, better, see if a solution is possible (see next).

I think for something workable on IRC you would have to "solve" the problem of identical looking UTF8 nicks as well. Or give suggestions about what should be done in the IRCd. Don't you agree?
As an UTF8 noob I'm not really in the position to do this but perhaps suggesting only to allow certain scripts or only specific certain combinations...

@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Nov 24, 2017

Member

For sure, yeah. I assumed PRECIS protected against that as well (because it does map a fair few of those characters together, just not the identical-looking ones). I wrote up the spec, then someone demonstrated certain pairs of characters that look identical, but the PRECIS UsernameCaseMapped profile keeps separate, so yeah.

Don't you love Unicode?

I'm planning on something along those lines as well, similar to the PRECIS suggestions around possibly only allowing one script or similar (as much as that feels like a copout).

Member

DanielOaks commented Nov 24, 2017

For sure, yeah. I assumed PRECIS protected against that as well (because it does map a fair few of those characters together, just not the identical-looking ones). I wrote up the spec, then someone demonstrated certain pairs of characters that look identical, but the PRECIS UsernameCaseMapped profile keeps separate, so yeah.

Don't you love Unicode?

I'm planning on something along those lines as well, similar to the PRECIS suggestions around possibly only allowing one script or similar (as much as that feels like a copout).

@syzop

This comment has been minimized.

Show comment
Hide comment
@syzop

syzop Dec 6, 2017

@DanielOaks: I tried to contact you a while back via email (27 Nov 11:53 UTC) from syzop@vunscan.org. I could put part of that here in the open:

I've added experimental UTF8 support in set::allowed-nickchars in UnrealIRCd which allows the admin to allow certain utf8 characters in nick names. In the release notes I mention that, like the original set::allowed-nickchars, it does not do any special CASEMAPPING or "similar looking character detection", and summing up the known problems with the lack of such support. I also noticed that for example anope does not seem to allow such characters which further limits the current use.
So, I'm not happy with the present state. In practice for serious networks, it's not so much usable. It's more of an experimental thing so users can play around, hoping to get that UTF8 ball rolling a bit. It's sad to see that UTF8 nick name support is still lacking in IRC in 2017.

In my opinion the goal of IRCv3, or in any case the IRC community in general, should be to add a new CASEMAPPING in some standard way/library/tables so the same casemapping (and other stuff PRECIS does) is applied the same way to irc servers and services (and clients). If every software implementation is going to choose it's own casemapping it's rather annoying and confusing. This is especially notable in the servers vs services case where f.e. account names are compared. The spec is just as important as having common code/lib/implementations.

In my email to you I also ask for some technical suggestions with regards to that.
Just checking you received it. If you did and don't think you have anything useful to reply, don't want to or don't have time, that's fine too of course. Just checking.. would be a pity if an opportunity for collaboration would be missed just by some misunderstanding / some mail ending up in Junk mail.

syzop commented Dec 6, 2017

@DanielOaks: I tried to contact you a while back via email (27 Nov 11:53 UTC) from syzop@vunscan.org. I could put part of that here in the open:

I've added experimental UTF8 support in set::allowed-nickchars in UnrealIRCd which allows the admin to allow certain utf8 characters in nick names. In the release notes I mention that, like the original set::allowed-nickchars, it does not do any special CASEMAPPING or "similar looking character detection", and summing up the known problems with the lack of such support. I also noticed that for example anope does not seem to allow such characters which further limits the current use.
So, I'm not happy with the present state. In practice for serious networks, it's not so much usable. It's more of an experimental thing so users can play around, hoping to get that UTF8 ball rolling a bit. It's sad to see that UTF8 nick name support is still lacking in IRC in 2017.

In my opinion the goal of IRCv3, or in any case the IRC community in general, should be to add a new CASEMAPPING in some standard way/library/tables so the same casemapping (and other stuff PRECIS does) is applied the same way to irc servers and services (and clients). If every software implementation is going to choose it's own casemapping it's rather annoying and confusing. This is especially notable in the servers vs services case where f.e. account names are compared. The spec is just as important as having common code/lib/implementations.

In my email to you I also ask for some technical suggestions with regards to that.
Just checking you received it. If you did and don't think you have anything useful to reply, don't want to or don't have time, that's fine too of course. Just checking.. would be a pity if an opportunity for collaboration would be missed just by some misunderstanding / some mail ending up in Junk mail.

@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Dec 6, 2017

Member

Heyo @syzop! Sorry for not responding, emails have fallen behind a little with lots going on at work and home. I'm thinking of changing this proposal slightly to better integrate with existing servers that use something like CASEMAPPING=rfc1459, as well as clarifying more precisely in the spec the issues with this folding method and how to avoid those issues.

Totally agree with the standard way to do such a thing, that's been mostly the intention of this since it started coming up and since I threw this proposal in.

I'll respond to your email and either later this week or over the weekend throw those changes into this specification to clear things up and make it easier for existing servers to implement. Thanks for the push with this and I'm excited to see what comes of Unreal's new experimental char support :)

Member

DanielOaks commented Dec 6, 2017

Heyo @syzop! Sorry for not responding, emails have fallen behind a little with lots going on at work and home. I'm thinking of changing this proposal slightly to better integrate with existing servers that use something like CASEMAPPING=rfc1459, as well as clarifying more precisely in the spec the issues with this folding method and how to avoid those issues.

Totally agree with the standard way to do such a thing, that's been mostly the intention of this since it started coming up and since I threw this proposal in.

I'll respond to your email and either later this week or over the weekend throw those changes into this specification to clear things up and make it easier for existing servers to implement. Thanks for the push with this and I'm excited to see what comes of Unreal's new experimental char support :)

@syzop

This comment has been minimized.

Show comment
Hide comment
@syzop

syzop Dec 6, 2017

Great. And no problem at all! Glad to see your continued interest and look forward to working with you.

syzop commented Dec 6, 2017

Great. And no problem at all! Glad to see your continued interest and look forward to working with you.

DanielOaks added some commits Dec 25, 2017

unicode_casemapping: 7613 -> 8265. UTF8MAPPING to allow better compat…
…ibility with servers, laid out Visually Similar Characters section
@jwheare

This comment has been minimized.

Show comment
Hide comment
@jwheare

jwheare Jan 3, 2018

Member

s/preperation/preparation/g

Should there be a way to specify an allowed list of characters/sets as described in the visually similar section? Another ISUPPORT token?

Member

jwheare commented Jan 3, 2018

s/preperation/preparation/g

Should there be a way to specify an allowed list of characters/sets as described in the visually similar section? Another ISUPPORT token?

@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Jan 3, 2018

Member

Thanks, will fix that spelling issue.

Nah, no way to do something like that. It'd end up being a multi-KB (maybe even MB) blob sent to the client on connection/registration, and we'd be inventing the format from scratch. Since the server can just refuse that channel/nick name they can just refuse it with an appropriate error message in the numeric if they want to.

Member

DanielOaks commented Jan 3, 2018

Thanks, will fix that spelling issue.

Nah, no way to do something like that. It'd end up being a multi-KB (maybe even MB) blob sent to the client on connection/registration, and we'd be inventing the format from scratch. Since the server can just refuse that channel/nick name they can just refuse it with an appropriate error message in the numeric if they want to.

@jwheare

This comment has been minimized.

Show comment
Hide comment
@jwheare

jwheare Jan 3, 2018

Member

Fair enough. Just wondering if something like Unreal's allowed-nickchars labels might work instead of a huge list of chracters https://www.unrealircd.org/docs/Nick_Character_Sets

hebrew-utf8, etc aren't really standard labels but maybe something similar exists?

Member

jwheare commented Jan 3, 2018

Fair enough. Just wondering if something like Unreal's allowed-nickchars labels might work instead of a huge list of chracters https://www.unrealircd.org/docs/Nick_Character_Sets

hebrew-utf8, etc aren't really standard labels but maybe something similar exists?

@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Jan 3, 2018

Member

You could try, but they wouldn't be able to be used as anything more than a rough "Hey this is what we sorta allow".

In addition to standard character sets, since the issue is around confusable characters you'd also need to have the ability to list explicit characters as well, which combinations of characters specifically are allowed and which combinations aren't allowed together.

For instance, in my (admittedly fairly lax) server I'm looking at doing an interesting approach around the confusables lists distributed by Unicode and somewhat-heuristically along with character sets determining whether specific names are allowed. Those are the sorta details you can't really codify.

Given the above, I think just plain leaving it to the server and them explicitly telling the user why that nick/channel name isn't allowed would be the best option.

Member

DanielOaks commented Jan 3, 2018

You could try, but they wouldn't be able to be used as anything more than a rough "Hey this is what we sorta allow".

In addition to standard character sets, since the issue is around confusable characters you'd also need to have the ability to list explicit characters as well, which combinations of characters specifically are allowed and which combinations aren't allowed together.

For instance, in my (admittedly fairly lax) server I'm looking at doing an interesting approach around the confusables lists distributed by Unicode and somewhat-heuristically along with character sets determining whether specific names are allowed. Those are the sorta details you can't really codify.

Given the above, I think just plain leaving it to the server and them explicitly telling the user why that nick/channel name isn't allowed would be the best option.

@DanielOaks

This comment has been minimized.

Show comment
Hide comment
@DanielOaks

DanielOaks Feb 12, 2018

Member

As a note, there's a running implementation of this on testnet.oragono.io for anyone interested in giving it a shot and seeing how it works.

It should be noted that aside from the banned characters in the spec, it doesn't implement any sort of additional protections recommended (it's planned, just a lot of work to either build up whitelists or build up some sort of blacklist system based around Unicode's confusables list). This means it's pretty similar to get similar-looking nicks using this homograph attack generator, but that's just what you get when you don't implement proper protections. I don't consider this a spec issue because the spec leaves those specific protection mechanisms up for debate. We can't legislate a single good one because there isn't any defined good one. Even browsers change how they do it regularly, and they've had to deal with it a lot longer than us.

Member

DanielOaks commented Feb 12, 2018

As a note, there's a running implementation of this on testnet.oragono.io for anyone interested in giving it a shot and seeing how it works.

It should be noted that aside from the banned characters in the spec, it doesn't implement any sort of additional protections recommended (it's planned, just a lot of work to either build up whitelists or build up some sort of blacklist system based around Unicode's confusables list). This means it's pretty similar to get similar-looking nicks using this homograph attack generator, but that's just what you get when you don't implement proper protections. I don't consider this a spec issue because the spec leaves those specific protection mechanisms up for debate. We can't legislate a single good one because there isn't any defined good one. Even browsers change how they do it regularly, and they've had to deal with it a lot longer than us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment