Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify that size limits are in bytes #1234

Closed

Conversation

deepbluev7
Copy link
Contributor

@deepbluev7 deepbluev7 commented Sep 9, 2022

In almost all instances we are already talking about UTF-8 encoded ASCII. As such this is not a behavioural change apart from in theory legacy identifiers, that never were part of the spec.

fixes #1001

Signed-off-by: Nicolas Werner nicolas.werner@hotmail.de

See also:

matrix-org/synapse#13710
matrix-org/gomatrixserverlib#338

Preview: https://pr1234--matrix-spec-previews.netlify.app

In almost all instances we are already talking about UTF-8 encoded
ASCII. As such this is not a behavioural change apart from in theory
legacy identifiers, that never were part of the spec.

fixes matrix-org#1001

Signed-off-by: Nicolas Werner <nicolas.werner@hotmail.de>
@deepbluev7 deepbluev7 requested a review from a team as a code owner September 9, 2022 11:34
@deepbluev7
Copy link
Contributor Author

I probably missed some and some of these are basically reundant, because it says "Needs to be [a-z]" or similar right in front of it, but some people still disagree and bytes should be clear enough (UTF-8 encoding is implicit by the spec already).

@deepbluev7
Copy link
Contributor Author

(I also probably missed some)

Signed-off-by: Nicolas Werner <nicolas.werner@hotmail.de>
@Sorunome
Copy link
Contributor

This looks like it should be an MSC for a future room version

@deepbluev7
Copy link
Contributor Author

It depends on how the spec clarification turns out, but in general this shouldn't break anything, considering in pretty much all instances it restricts the character set to ASCII before talking about codepoints or characters. In which case bytes and characters are equivalent.

@turt2live turt2live added the blocked Something needs to be done before action can be taken on this PR/issue. label Sep 11, 2022
Copy link
Member

@turt2live turt2live left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just ftr: we're looking into these to determine if they're all able to be changed to bytes as some are not immediately obvious or likely require an MSC on their own. However, some do look valid, which is good.

It will take a while to go through all of these - please be patient and understanding with us.

@deepbluev7
Copy link
Contributor Author

From a pure spec perspective these would be fine, imo, since usually the charset is constrained to ASCII before it is talked about codepoints. Now if that works in reality with implementations in the wild is a different story, so take the time you need.

@turt2live
Copy link
Member

Some of the usages of "characters" and such look very deliberate is all.

Copy link
Member

@richvdh richvdh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[@deepbluev7] some of these are basically reundant

yes.

[@turt2live] Some of the usages of "characters" and such look very deliberate is all.

also this.

I feel like I'd be happier reviewing this if we removed the noise from places where characters and bytes are the same thing (possibly to a separate PR), so that we can better focus on the places it makes a difference.

In general, where we are actually changing things, I don't think its sufficient to simply s/characters/bytes/ without specifying which encoding we are talking about. Generally we need to say "the UTF-8 encoding must not exceed 255 bytes" or words to that effect.

Also: it would be good to add something like this to each change:

{{% changed-in v="1.10" %}} Clarified that length limit is in bytes, not codepoints.

Comment on lines 45 to +46
description: Unique ID for the message, used for idempotence. Arbitrary
utf8 string, of maximum length 32 codepoints.
utf8 string, of maximum length 32 bytes.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a pre-existing problem, but message_id should not be a UTF-8 string (or rather, it does not need to be one and should not be interpreted as one). It should be a regular unicode string, just like all the other strings in matrix. Assuming it is then sent over an HTTP request which uses a JSON body with UTF-8 encoding, it will be encoded as UTF-8.

In particular: if it were a UTF-8 string, then bytes such as 0xC0 would be invalid. In practice, 0xC0 is a valid value in this string; it is "LATIN CAPITAL LETTER A WITH GRAVE".

In short, this text needs rewording, rather than simply s/codepoints/bytes/:

Suggested change
description: Unique ID for the message, used for idempotence. Arbitrary
utf8 string, of maximum length 32 codepoints.
utf8 string, of maximum length 32 bytes.
description: Unique ID for the message, used for idempotence. Arbitrary
string, whose UTF-8 encoding must not exceed 32 bytes.

@richvdh richvdh removed the blocked Something needs to be done before action can be taken on this PR/issue. label Feb 27, 2024
@richvdh
Copy link
Member

richvdh commented May 24, 2024

@deepbluev7 are you still interested in working on this?

@richvdh
Copy link
Member

richvdh commented Jun 12, 2024

I think that the important bits of #1001 have now been solved by #1850, and, since this is bit-rotting anyway, I'm going to go ahead and close it.

As ever, there are bits of the spec that could still use clarification, so please do feel free to open new PRs!

@richvdh richvdh closed this Jun 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Pick a unit for describing string length
4 participants