Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input format of idn-email and idn-hostname (and presumably irn and irn-reference, too) #247

Open
danini-the-panini opened this issue Nov 1, 2021 · 5 comments
Labels
question Further information is requested

Comments

@danini-the-panini
Copy link
Contributor

I'm wondering, what is the expected input for any of the internationalized email/hostname/url types?

I'm busy building the type annotation parsing for kdl-rb and I'm not really sure how to go about this.

I started with taking idn-hostname in as the punycode format (e.g. xn--9ckb.com) and converting it to unicode, but then when I got to email, it seems the local part is kept as-is, regardless of the domain. So I'm wondering, which of the following would we expect to see in a document?

ASCII format?

node (idn-hostname)"xn--9ckb.com"

or Unicode format?

node (idn-hostname)"ツッ.com"
@danini-the-panini danini-the-panini added the question Further information is requested label Nov 1, 2021
@tabatkins
Copy link
Contributor

tabatkins commented Nov 4, 2021

I skipped these built-ins for kdl-py partially because I wasn't sure how to parse them, but reviewing https://datatracker.ietf.org/doc/html/rfc5890, I think the most reasonable answer is that it should allow both a-labels (all ascii, with the xn-- prefix) and u-labels (unicode NFC, must contain at least one non-ASCII character). If your lang doesn't have a built-in for this type (which, uh, none that I know of do), I'd encourage your result class to support either form in the output.

Meanwhile, the hostname type should allow exactly the NR-LDH labels (that is, all-ASCII, with the third and fourth characters not being "--", and the length of each segment <= 63 characters).

@tabatkins
Copy link
Contributor

I've been looking into the email RFCs and, uh, I can't make heads nor tails of them.

I presume they're saying that email vs idn-email use the same distinction for the part after the @, but I haven't a clue about the part before.

@rustonaut
Copy link

rustonaut commented Nov 6, 2022

but I haven't a clue about the part before.

The local part is pretty much up to the interpretation of the email server and should be forwarded "as-is" (as utf8 string).

Theoretically there are some syntax limitations but also a lot more things are allowed then people realize:

  1. Assuming internationalized mail is supported: any non us-ascii character is allowed in any place alphabetic characters are allowed in the next two points, this include non-us-ascii white spaces.

  2. without quoting any char of ALPHA, DIGIT, !#$"%&'*+-/=?^_``{|}~ is allowed, dot (.) is allowed too, but not at first or last character and it must not repeat (e.g. no a..b, i.e. only a single dot in between to non-dot characters)

  3. with quoting it starts and ends with a " and allows any ascii printable or space () except \ and " and has a quoting mechanism using \ (followed by any ascii printable or space or backslash or double quotes).

There is no (relevant) standard for encoding utf8 local parts into us-ascii, some servers did support puny code in local parts but that is non standard and is equivalent to a mail alias in a similar way of how for some, but not all, mail servers foo.bar@... is the same as foobar@....

Furthermore as far as I remember servers don't have to treat "a"@... and a@... as semantically equivalent.

Practically things are also much more messy, quite a bunch of software has arbitrary non standard conform limitations on the local part, for example:

  1. some systems don't accept utf8 email addresses, kinda insane given that they are very common in asia

  2. some systems reject quoted local parts, I have never seen quoted local parts having any relevance so this is fine, might even bee a good idea

  3. some systems reject some of the allowed characters e.g. any of !#$"%&'*/=?^``{|} which probably is fine too, maybe. Through rejecting +, _, - or ~ is inviting trouble.

  4. some systems accept things things not part of the standard, it's normally bug but still exists

Additionally for the host name any (syntax) valid domain is an option but converting from/to puny code on the fly can lead to problems for in-transit edge cases and as such shouldn't be done for such cases, but should be done for displaying the domain to users. Also while top-level domain mail addresses are rare they still do exist.

Lastly theoretically mail allows having a host part which is not a domain, e.g. you can send a mail to an IP address. This was used in some data-center internal networking systems, but has lost most(?) relevance today, so it's often times fine to ignore that syntax.

All in all for many systems it's best to treat mail addresses as an opaque string where you might check if the part after the last @ is a syntactically valid domain and might puny-decode it for displaying it to users (but only then). And then do mail validation like always, i.e. send a "click this link to validate mail" link.

Hope that helps.

@rustonaut
Copy link

rustonaut commented Nov 6, 2022

So yes: "(ノ°Д°)ノ︵ ┻━┻"@[127.1.2.3] is theoretically a valid mail address 😉

Edit: Or (ノ°Д°)ノ︵┻━┻@[127.1.2.3] using unicode brackets to avoid needing quoting.

@rustonaut
Copy link

The relevant rfcs are:

Also theoretically the mail syntax used by MIME (the email paylod format) allows even more unusual edge cases no one cares about. But any mail which isn't compatible with SMTP can't be send so it doesn't matter.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants