diff --git a/api/client-server/registration.yaml b/api/client-server/registration.yaml index 9bd6aa75062..a01d25599b2 100644 --- a/api/client-server/registration.yaml +++ b/api/client-server/registration.yaml @@ -45,6 +45,16 @@ paths: If the client does not supply a ``device_id``, the server must auto-generate one. + The server SHOULD register an account with a User ID based on the + ``username`` provided, if any. Note that the grammar of Matrix User ID + localparts is restricted, so the server MUST either map the provided + ``username`` onto a ``user_id`` in a logical manner, or reject + ``username``\s which do not comply to the grammar, with + ``M_INVALID_USERNAME``. + + Matrix clients MUST NOT assume that localpart of the registered + ``user_id`` matches the provided ``username``. + The returned access token must be associated with the ``device_id`` supplied by the client or generated by the server. The server may invalidate any access token previously associated with that device. See @@ -86,7 +96,7 @@ paths: username: type: string description: |- - The local part of the desired Matrix ID. If omitted, + The basis for the localpart of the desired Matrix ID. If omitted, the homeserver MUST generate a Matrix ID local part. example: cheeky_monkey password: @@ -121,7 +131,11 @@ paths: properties: user_id: type: string - description: The fully-qualified Matrix ID that has been registered. + description: |- + The fully-qualified Matrix user ID (MXID) that has been registered. + + Any user ID returned by this API must conform to the grammar given in the + `Matrix specification `_. access_token: type: string description: |- diff --git a/changelogs/client_server.rst b/changelogs/client_server.rst index 03cd2fddaa0..546bf37a866 100644 --- a/changelogs/client_server.rst +++ b/changelogs/client_server.rst @@ -92,6 +92,9 @@ - Add some clarifying notes on the behaviour of rooms with no ``m.room.power_levels`` event (`#1026 `_). + - Clarify the relationship between ``username`` and ``user_id`` in the + ``/register`` API + (`#1032 `_). r0.2.0 ====== diff --git a/specification/appendices/identifier_grammar.rst b/specification/appendices/identifier_grammar.rst new file mode 100644 index 00000000000..d53f61ac938 --- /dev/null +++ b/specification/appendices/identifier_grammar.rst @@ -0,0 +1,225 @@ +.. Copyright 2016 Openmarket Ltd. +.. Copyright 2017 New Vector Ltd. +.. +.. Licensed under the Apache License, Version 2.0 (the "License"); +.. you may not use this file except in compliance with the License. +.. You may obtain a copy of the License at +.. +.. http://www.apache.org/licenses/LICENSE-2.0 +.. +.. Unless required by applicable law or agreed to in writing, software +.. distributed under the License is distributed on an "AS IS" BASIS, +.. WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +.. See the License for the specific language governing permissions and +.. limitations under the License. + +Identifier Grammar +------------------ + +Server Name +~~~~~~~~~~~ + +A homeserver is uniquely identified by its server name. This value is used in a +number of identifiers, as described below. + +The server name represents the address at which the homeserver in question can +be reached by other homeservers. The complete grammar is:: + + server_name = dns_name [ ":" port] + dns_name = host + port = *DIGIT + +where ``host`` is as defined by `RFC3986, section 3.2.2 +`_. + +Examples of valid server names are: + +* ``matrix.org`` +* ``matrix.org:8888`` +* ``1.2.3.4`` (IPv4 literal) +* ``1.2.3.4:1234`` (IPv4 literal with explicit port) +* ``[1234:5678::abcd]`` (IPv6 literal) +* ``[1234:5678::abcd]:5678`` (IPv6 literal with explicit port) + + +Common Identifier Format +~~~~~~~~~~~~~~~~~~~~~~~~ + +The Matrix protocol uses a common format to assign unique identifiers to a +number of entities, including users, events and rooms. Each identifier takes +the form:: + + &localpart:domain + +where ``&`` represents a 'sigil' character; ``domain`` is the `server name`_ of +the homeserver which allocated the identifier, and ``localpart`` is an +identifier allocated by that homeserver. + +The sigil characters are as follows: + +* ``@``: User ID +* ``!``: Room ID +* ``$``: Event ID +* ``#``: Room alias + +The precise grammar defining the allowable format of an identifier depends on +the type of identifier. + +User Identifiers +++++++++++++++++ + +Users within Matrix are uniquely identified by their Matrix user ID. The user +ID is namespaced to the homeserver which allocated the account and has the +form:: + + @localpart:domain + +The ``localpart`` of a user ID is an opaque identifier for that user. It MUST +NOT be empty, and MUST contain only the characters ``a-z``, ``0-9``, ``.``, +``_``, ``=``, ``-``, and ``/``. + +The ``domain`` of a user ID is the `server name`_ of the homeserver which +allocated the account. + +The length of a user ID, including the ``@`` sigil and the domain, MUST NOT +exceed 255 characters. + +The complete grammar for a legal user ID is:: + + user_id = "@" user_id_localpart ":" server_name + user_id_localpart = 1*user_id_char + user_id_char = DIGIT + / %x61-7A ; a-z + / "-" / "." / "=" / "_" / "/" + +.. admonition:: Rationale + + A number of factors were considered when defining the allowable characters + for a user ID. + + Firstly, we chose to exclude characters outside the basic US-ASCII character + set. User IDs are primarily intended for use as an identifier at the protocol + level, and their use as a human-readable handle is of secondary + benefit. Furthermore, they are useful as a last-resort differentiator between + users with similar display names. Allowing the full unicode character set + would make very difficult for a human to distinguish two similar user IDs. The + limited character set used has the advantage that even a user unfamiliar with + the Latin alphabet should be able to distinguish similar user IDs manually, if + somewhat laboriously. + + We chose to disallow upper-case characters because we do not consider it + valid to have two user IDs which differ only in case: indeed it should be + possible to reach ``@user:matrix.org`` as ``@USER:matrix.org``. However, + user IDs are necessarily used in a number of situations which are inherently + case-sensitive (notably in the ``state_key`` of ``m.room.member`` + events). Forbidding upper-case characters (and requiring homeservers to + downcase usernames when creating user IDs for new users) is a relatively simple + way to ensure that ``@USER:matrix.org`` cannot refer to a different user to + ``@user:matrix.org``. + + Finally, we decided to restrict the allowable punctuation to a very basic set + to reduce the possibility of conflicts with special characters in various + situations. For example, "*" is used as a wildcard in some APIs (notably the + filter API), so it cannot be a legal user ID character. + + The length restriction is derived from the limit on the length of the + ``sender`` key on events; since the user ID appears in every event sent by the + user, it is limited to ensure that the user ID does not dominate over the actual + content of the events. + +Matrix user IDs are sometimes informally referred to as MXIDs. + +Historical User IDs +<<<<<<<<<<<<<<<<<<< + +Older versions of this specification were more tolerant of the characters +permitted in user ID localparts. There are currently active users whose user +IDs do not conform to the permitted character set, and a number of rooms whose +history includes events with a ``sender`` which does not conform. In order to +handle these rooms successfully, clients and servers MUST accept user IDs with +localparts from the expanded character set:: + + extended_user_id_char = %x21-7E + +Mapping from other character sets +<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< + +In certain circumstances it will be desirable to map from a wider character set +onto the limited character set allowed in a user ID localpart. Examples include +a homeserver creating a user ID for a new user based on the username passed to +``/register``, or a bridge mapping user ids from another protocol. + +.. TODO-spec + + We need to better define the mechanism by which homeservers can allow users + to have non-Latin login credentials. The general idea is for clients to pass + the non-Latin in the ``username`` field to ``/register`` and ``/login``, and + the HS then maps it onto the MXID space when turning it into the + fully-qualified ``user_id`` which is returned to the client and used in + events. + +Implementations are free to do this mapping however they choose. Since the user +ID is opaque except to the implementation which created it, the only +requirement is that the implemention can perform the mapping +consistently. However, we suggest the following algorithm: + +1. Encode character strings as UTF-8. + +2. Convert the bytes ``A-Z`` to lower-case. + + * In the case where a bridge must be able to distinguish two different users + with ids which differ only by case, escape upper-case characters by + prefixing with ``_`` before downcasing. For example, ``A`` becomes + ``_a``. Escape a real ``_`` with a second ``_``. + +3. Encode any remaining bytes outside the allowed character set, as well as + ``=``, as their hexadecimal value, prefixed with ``=``. For example, ``#`` + becomes ``=23``; ``á`` becomes ``=c3=a1``. + +.. admonition:: Rationale + + The suggested mapping is an attempt to preserve human-readability of simple + ASCII identifiers (unlike, for example, base-32), whilst still allowing + representation of *any* character (unlike punycode, which provides no way to + encode ASCII punctuation). + + +Room IDs and Event IDs +++++++++++++++++++++++ + +A room has exactly one room ID. A room ID has the format:: + + !opaque_id:domain + +An event has exactly one event ID. An event ID has the format:: + + $opaque_id:domain + +The ``domain`` of a room/event ID is the `server name`_ of the homeserver which +created the room/event. The domain is used only for namespacing to avoid the +risk of clashes of identifiers between different homeservers. There is no +implication that the room or event in question is still available at the +corresponding homeserver. + +Event IDs and Room IDs are case-sensitive. They are not meant to be human +readable. + +.. TODO-spec + What is the grammar for the opaque part? https://matrix.org/jira/browse/SPEC-389 + +Room Aliases +++++++++++++ + +A room may have zero or more aliases. A room alias has the format:: + + #room_alias:domain + +The ``domain`` of a room alias is the `server name`_ of the homeserver which +created the alias. Other servers may contact this homeserver to look up the +alias. + +Room aliases MUST NOT exceed 255 bytes (including the ``#`` sigil and the +domain). + +.. TODO-spec + - Need to specify precise grammar for Room Aliases. https://matrix.org/jira/browse/SPEC-391 diff --git a/specification/index.rst b/specification/index.rst index ae6611c06e9..5739ce09782 100644 --- a/specification/index.rst +++ b/specification/index.rst @@ -27,17 +27,17 @@ Voice over IP (VoIP) signalling, Internet of Things (IoT) communication, and bri together existing communication silos - providing the basis of a new open real-time communication ecosystem. -`Introduction to Matrix `_ provides a full introduction to Matrix and the spec. - Matrix APIs ----------- -The following APIs are documented in this specification: +The specification consists of the following parts: + +`Introduction to Matrix `_ provides a full introduction to Matrix and the spec. {{apis}} -`Appendices `_ with supplemental information not specific to -one of the above APIs are also available. +The `Appendices `_ contain supplemental information not specific to +one of the above APIs. Specification Versions ---------------------- diff --git a/specification/intro.rst b/specification/intro.rst index 1d3dfafe304..1c795cea1ef 100644 --- a/specification/intro.rst +++ b/specification/intro.rst @@ -157,9 +157,8 @@ allocated the account and has the form:: @localpart:domain -See the `Identifier Grammar`_ section for full details of the structure of -user IDs. - +See the `appendices `_ for full details of +the structure of user IDs. Devices ~~~~~~~ @@ -242,8 +241,8 @@ There is exactly one room ID for each room. Whilst the room ID does contain a domain, it is simply for globally namespacing room IDs. The room does NOT reside on the domain specified. -See the `Identifier Grammar`_ section for full details of the structure of -a room ID. +See the `appendices `_ for full details of +the structure of a room ID. The following conceptual diagram shows an ``m.room.message`` event being sent to the room ``!qporfwt:matrix.org``:: @@ -318,8 +317,8 @@ Each room can also have multiple "Room Aliases", which look like:: #room_alias:domain -See the `Identifier Grammar`_ section for full details of the structure of -a room alias. +See the `appendices `_ for full details of +the structure of a room alias. A room alias "points" to a room ID and is the human-readable label by which rooms are publicised and discovered. The room ID the alias is pointing to can @@ -387,221 +386,6 @@ dedicated API. The API is symmetrical to managing Profile data. Would it really be overengineered to use the same API for both profile & private user data, but with different ACLs? - -Identifier Grammar ------------------- - -Server Name -~~~~~~~~~~~ - -A homeserver is uniquely identified by its server name. This value is used in a -number of identifiers, as described below. - -The server name represents the address at which the homeserver in question can -be reached by other homeservers. The complete grammar is:: - - server_name = dns_name [ ":" port] - dns_name = host - port = *DIGIT - -where ``host`` is as defined by `RFC3986, section 3.2.2 -`_. - -Examples of valid server names are: - -* ``matrix.org`` -* ``matrix.org:8888`` -* ``1.2.3.4`` (IPv4 literal) -* ``1.2.3.4:1234`` (IPv4 literal with explicit port) -* ``[1234:5678::abcd]`` (IPv6 literal) -* ``[1234:5678::abcd]:5678`` (IPv6 literal with explicit port) - - -Common Identifier Format -~~~~~~~~~~~~~~~~~~~~~~~~ - -The Matrix protocol uses a common format to assign unique identifiers to a -number of entities, including users, events and rooms. Each identifier takes -the form:: - - &localpart:domain - -where ``&`` represents a 'sigil' character; ``domain`` is the `server name`_ of -the homeserver which allocated the identifier, and ``localpart`` is an -identifier allocated by that homeserver. - -The sigil characters are as follows: - -* ``@``: User ID -* ``!``: Room ID -* ``$``: Event ID -* ``#``: Room alias - -The precise grammar defining the allowable format of an identifier depends on -the type of identifier. - -User Identifiers -++++++++++++++++ - -Users within Matrix are uniquely identified by their Matrix user ID. The user -ID is namespaced to the homeserver which allocated the account and has the -form:: - - @localpart:domain - -The ``localpart`` of a user ID is an opaque identifier for that user. It MUST -NOT be empty, and MUST contain only the characters ``a-z``, ``0-9``, ``.``, -``_``, ``=``, and ``-``. - -The ``domain`` of a user ID is the `server name`_ of the homeserver which -allocated the account. - -The length of a user ID, including the ``@`` sigil and the domain, MUST NOT -exceed 255 characters. - -The complete grammar for a legal user ID is:: - - user_id = "@" user_id_localpart ":" server_name - user_id_localpart = 1*user_id_char - user_id_char = DIGIT - / %x61-7A ; a-z - / "-" / "." / "=" / "_" - -.. admonition:: Rationale - - A number of factors were considered when defining the allowable characters - for a user ID. - - Firstly, we chose to exclude characters outside the basic US-ASCII character - set. User IDs are primarily intended for use as an identifier at the protocol - level, and their use as a human-readable handle is of secondary - benefit. Furthermore, they are useful as a last-resort differentiator between - users with similar display names. Allowing the full unicode character set - would make very difficult for a human to distinguish two similar user IDs. The - limited character set used has the advantage that even a user unfamiliar with - the Latin alphabet should be able to distinguish similar user IDs manually, if - somewhat laboriously. - - We chose to disallow upper-case characters because we do not consider it - valid to have two user IDs which differ only in case: indeed it should be - possible to reach ``@user:matrix.org`` as ``@USER:matrix.org``. However, - user IDs are necessarily used in a number of situations which are inherently - case-sensitive (notably in the ``state_key`` of ``m.room.member`` - events). Forbidding upper-case characters (and requiring homeservers to - downcase usernames when creating user IDs for new users) is a relatively simple - way to ensure that ``@USER:matrix.org`` cannot refer to a different user to - ``@user:matrix.org``. - - Finally, we decided to restrict the allowable punctuation to a very basic set - to ensure that the identifier can be used as-is in as wide a number of - situations as possible, without requiring escaping. For instance, allowing - "%" or "/" would make it harder to use a user ID in a URI. "*" is used as a - wildcard in some APIs (notably the filter API), so it also cannot be a legal - user ID character. - - The length restriction is derived from the limit on the length of the - ``sender`` key on events; since the user ID appears in every event sent by the - user, it is limited to ensure that the user ID does not dominate over the actual - content of the events. - -Matrix user IDs are sometimes informally referred to as MXIDs. - -Historical User IDs -<<<<<<<<<<<<<<<<<<< - -Older versions of this specification were more tolerant of the characters -permitted in user ID localparts. There are currently active users whose user -IDs do not conform to the permitted character set, and a number of rooms whose -history includes events with a ``sender`` which does not conform. In order to -handle these rooms successfully, clients and servers MUST accept user IDs with -localparts from the expanded character set:: - - extended_user_id_char = %x21-7E - -Mapping from other character sets -<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<< - -In certain circumstances it will be desirable to map from a wider character set -onto the limited character set allowed in a user ID localpart. Examples include -a homeserver creating a user ID for a new user based on the username passed to -``/register``, or a bridge mapping user ids from another protocol. - -.. TODO-spec - - We need to better define the mechanism by which homeservers can allow users - to have non-Latin login credentials. The general idea is for clients to pass - the non-Latin in the ``username`` field to ``/register`` and ``/login``, and - the HS then maps it onto the MXID space when turning it into the - fully-qualified ``user_id`` which is returned to the client and used in - events. - -Implementations are free to do this mapping however they choose. Since the user -ID is opaque except to the implementation which created it, the only -requirement is that the implemention can perform the mapping -consistently. However, we suggest the following algorithm: - -1. Encode character strings as UTF-8. - -2. Convert the bytes ``A-Z`` to lower-case. - - * In the case where a bridge must be able to distinguish two different users - with ids which differ only by case, escape upper-case characters by - prefixing with ``_`` before downcasing. For example, ``A`` becomes - ``_a``. Escape a real ``_`` with a second ``_``. - -3. Encode any remaining bytes outside the allowed character set, as well as - ``=``, as their hexadecimal value, prefixed with ``=``. For example, ``#`` - becomes ``=23``; ``á`` becomes ``=c3=a1``. - -.. admonition:: Rationale - - The suggested mapping is an attempt to preserve human-readability of simple - ASCII identifiers (unlike, for example, base-32), whilst still allowing - representation of *any* character (unlike punycode, which provides no way to - encode ASCII punctuation). - - -Room IDs and Event IDs -++++++++++++++++++++++ - -A room has exactly one room ID. A room ID has the format:: - - !opaque_id:domain - -An event has exactly one event ID. An event ID has the format:: - - $opaque_id:domain - -The ``domain`` of a room/event ID is the `server name`_ of the homeserver which -created the room/event. The domain is used only for namespacing to avoid the -risk of clashes of identifiers between different homeservers. There is no -implication that the room or event in question is still available at the -corresponding homeserver. - -Event IDs and Room IDs are case-sensitive. They are not meant to be human -readable. - -.. TODO-spec - What is the grammar for the opaque part? https://matrix.org/jira/browse/SPEC-389 - -Room Aliases -++++++++++++ - -A room may have zero or more aliases. A room alias has the format:: - - #room_alias:domain - -The ``domain`` of a room alias is the `server name`_ of the homeserver which -created the alias. Other servers may contact this homeserver to look up the -alias. - -Room aliases MUST NOT exceed 255 bytes (including the ``#`` sigil and the -domain). - -.. TODO-spec - - Need to specify precise grammar for Room Aliases. https://matrix.org/jira/browse/SPEC-391 - - License ------- diff --git a/specification/targets.yaml b/specification/targets.yaml index 0d64815b183..fb68e13d4e4 100644 --- a/specification/targets.yaml +++ b/specification/targets.yaml @@ -34,6 +34,7 @@ targets: - appendices.rst - appendices/base64.rst - appendices/signing_json.rst + - appendices/identifier_grammar.rst - appendices/threat_model.rst - appendices/test_vectors.rst groups: # reusable blobs of files when prefixed with 'group:'