Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IDN support #85

Open
Johann150 opened this issue Nov 7, 2021 · 5 comments
Open

IDN support #85

Johann150 opened this issue Nov 7, 2021 · 5 comments
Labels

Comments

@Johann150
Copy link
Contributor

IDN (International Domain Name) is not yet implemented in parsing domain names.

However, there is the concern that it can break existing posts for example the following could be parsed incorrectly:

@syuilo@misskey.ioありがとう 

An idea would be to recognize IDNs only if there is a space after it, but it might not always work.

I am not sure how to best solve this, but I think not implementing IDNs is a very inelegant solution that needs to be fixed. I have already seen a few people that were irritated by the missing proper support for IDNs.

@Johann150
Copy link
Contributor Author

see also misskey-dev/misskey#5826

@marihachi
Copy link
Contributor

marihachi commented Nov 9, 2021

今の仕様ではメンションのホスト名には非ASCII文字は直接使用できません。
非ASCII文字を表現するにはPunycodeに変換することが必要になります。

ホスト名の部分をUnicodeでも記述できるように変更するということですかね?

@Johann150
Copy link
Contributor Author

Yes, when parsing Unicode should be understood. But I think it would make sense if the output contains punycoded domains if necessary.
So, for example:

@somebody@みすきー.テスト

MENTION('somebody', 'xn--w8jxa7itv.xn--zckzah', '@somebody@みすきー.テスト')

@Johann150
Copy link
Contributor Author

I looked to Mastodon how they are doing it, and these are the regular expressions they use to recognize a mention:

USERNAME_RE   = /[a-z0-9_]+([a-z0-9_\.-]+[a-z0-9_]+)?/i
MENTION_RE    = /(?<=^|[^\/[:word:]])@((#{USERNAME_RE})(?:@[[:word:]\.\-]+[[:word:]]+)?)/i

https://github.com/mastodon/mastodon/blob/1114935e6486caaae6e4ba98b51ab803317acb03/app/models/account.rb#L61-L62

Since pegjs does not support \p{L} or \p{N} which would be needed to represent the same meaning as [:word:] in Ruby, it might be simpler to handle mentions with IDNs before the parser starts and convert them into punycoded domains. Then the parser itself would not have to be changed.

Mastodon's regular expression for mentions translates to Javascript as (replacing [:word:] with the Javascript equivalent \p{L}\p{N}_, adding the u Unicode flag and removing unnecessary parentheses)

/(?<=^|[^\/\p{L}\p{N}_])@[a-z0-9_]+(?:[a-z0-9_\.-]+[a-z0-9_]+)?(?:@[\p{L}\p{N}_\.-]+[\p{L}\p{N}_]+)?/iu

@marihachi
Copy link
Contributor

https://github.com/mathiasbynens/idn-allowed-code-points-regex
IDNAの実装見つけた

@Johann150 Johann150 added the enhancement New feature or request label Feb 5, 2022
@marihachi marihachi added Feature and removed enhancement New feature or request labels Feb 9, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants