Clarify that size limits are in bytes #1234

deepbluev7 · 2022-09-09T11:34:42Z

In almost all instances we are already talking about UTF-8 encoded ASCII. As such this is not a behavioural change apart from in theory legacy identifiers, that never were part of the spec.

fixes #1001

Signed-off-by: Nicolas Werner nicolas.werner@hotmail.de

See also:

matrix-org/synapse#13710
matrix-org/gomatrixserverlib#338

Preview: https://pr1234--matrix-spec-previews.netlify.app

In almost all instances we are already talking about UTF-8 encoded ASCII. As such this is not a behavioural change apart from in theory legacy identifiers, that never were part of the spec. fixes matrix-org#1001 Signed-off-by: Nicolas Werner <nicolas.werner@hotmail.de>

deepbluev7 · 2022-09-09T11:35:40Z

I probably missed some and some of these are basically reundant, because it says "Needs to be [a-z]" or similar right in front of it, but some people still disagree and bytes should be clear enough (UTF-8 encoding is implicit by the spec already).

deepbluev7 · 2022-09-09T11:38:34Z

(I also probably missed some)

Signed-off-by: Nicolas Werner <nicolas.werner@hotmail.de>

Sorunome · 2022-09-11T12:05:43Z

This looks like it should be an MSC for a future room version

deepbluev7 · 2022-09-11T20:12:58Z

It depends on how the spec clarification turns out, but in general this shouldn't break anything, considering in pretty much all instances it restricts the character set to ASCII before talking about codepoints or characters. In which case bytes and characters are equivalent.

turt2live

just ftr: we're looking into these to determine if they're all able to be changed to bytes as some are not immediately obvious or likely require an MSC on their own. However, some do look valid, which is good.

It will take a while to go through all of these - please be patient and understanding with us.

deepbluev7 · 2022-09-12T21:39:17Z

From a pure spec perspective these would be fine, imo, since usually the charset is constrained to ASCII before it is talked about codepoints. Now if that works in reality with implementations in the wild is a different story, so take the time you need.

turt2live · 2022-09-12T21:44:08Z

Some of the usages of "characters" and such look very deliberate is all.

richvdh

[@deepbluev7] some of these are basically reundant

yes.

[@turt2live] Some of the usages of "characters" and such look very deliberate is all.

also this.

I feel like I'd be happier reviewing this if we removed the noise from places where characters and bytes are the same thing (possibly to a separate PR), so that we can better focus on the places it makes a difference.

In general, where we are actually changing things, I don't think its sufficient to simply s/characters/bytes/ without specifying which encoding we are talking about. Generally we need to say "the UTF-8 encoding must not exceed 255 bytes" or words to that effect.

Also: it would be good to add something like this to each change:

{{% changed-in v="1.10" %}} Clarified that length limit is in bytes, not codepoints.

richvdh · 2024-02-27T13:26:15Z

data/api/server-server/definitions/event-schemas/m.direct_to_device.yaml

            description: Unique ID for the message, used for idempotence.  Arbitrary
-              utf8 string, of maximum length 32 codepoints.
+              utf8 string, of maximum length 32 bytes.


this is a pre-existing problem, but message_id should not be a UTF-8 string (or rather, it does not need to be one and should not be interpreted as one). It should be a regular unicode string, just like all the other strings in matrix. Assuming it is then sent over an HTTP request which uses a JSON body with UTF-8 encoding, it will be encoded as UTF-8.

In particular: if it were a UTF-8 string, then bytes such as 0xC0 would be invalid. In practice, 0xC0 is a valid value in this string; it is "LATIN CAPITAL LETTER A WITH GRAVE".

In short, this text needs rewording, rather than simply s/codepoints/bytes/:

Suggested change

description: Unique ID for the message, used for idempotence. Arbitrary

utf8 string, of maximum length 32 codepoints.

utf8 string, of maximum length 32 bytes.

description: Unique ID for the message, used for idempotence. Arbitrary

string, whose UTF-8 encoding must not exceed 32 bytes.

richvdh · 2024-05-24T09:29:00Z

@deepbluev7 are you still interested in working on this?

richvdh · 2024-06-12T08:16:36Z

I think that the important bits of #1001 have now been solved by #1850, and, since this is bit-rotting anyway, I'm going to go ahead and close it.

As ever, there are bits of the spec that could still use clarification, so please do feel free to open new PRs!

deepbluev7 requested a review from a team as a code owner September 9, 2022 11:34

Add changelogs

a2d4f1b

Signed-off-by: Nicolas Werner <nicolas.werner@hotmail.de>

ghost approved these changes Sep 9, 2022

View reviewed changes

turt2live added the blocked Something needs to be done before action can be taken on this PR/issue. label Sep 11, 2022

turt2live reviewed Sep 12, 2022

View reviewed changes

richvdh requested changes Feb 27, 2024

View reviewed changes

richvdh removed the blocked Something needs to be done before action can be taken on this PR/issue. label Feb 27, 2024

richvdh closed this Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Clarify that size limits are in bytes #1234

Clarify that size limits are in bytes #1234

deepbluev7 commented Sep 9, 2022 •

edited by github-actions bot

Loading

deepbluev7 commented Sep 9, 2022

deepbluev7 commented Sep 9, 2022

Sorunome commented Sep 11, 2022

deepbluev7 commented Sep 11, 2022

turt2live left a comment

deepbluev7 commented Sep 12, 2022

turt2live commented Sep 12, 2022

richvdh left a comment

richvdh Feb 27, 2024

richvdh commented May 24, 2024

richvdh commented Jun 12, 2024

Clarify that size limits are in bytes #1234

Clarify that size limits are in bytes #1234

Conversation

deepbluev7 commented Sep 9, 2022 • edited by github-actions bot Loading

deepbluev7 commented Sep 9, 2022

deepbluev7 commented Sep 9, 2022

Sorunome commented Sep 11, 2022

deepbluev7 commented Sep 11, 2022

turt2live left a comment

Choose a reason for hiding this comment

deepbluev7 commented Sep 12, 2022

turt2live commented Sep 12, 2022

richvdh left a comment

Choose a reason for hiding this comment

richvdh Feb 27, 2024

Choose a reason for hiding this comment

richvdh commented May 24, 2024

richvdh commented Jun 12, 2024

deepbluev7 commented Sep 9, 2022 •

edited by github-actions bot

Loading