Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The specific limits on keys and values are unclear. #3706

Closed
xorgy opened this issue Sep 20, 2022 · 11 comments
Closed

The specific limits on keys and values are unclear. #3706

xorgy opened this issue Sep 20, 2022 · 11 comments

Comments

@xorgy
Copy link

xorgy commented Sep 20, 2022

In #2025 I tried asking for what the specific limitations on key and value sizes are. These limits are often quoted as "255 characters" or "255 chars", which is ambigous. Some people interpret this as 255 UTF-16 code units, some interpret it as 255 UTF-8 bytes, some interpret it as 255 Unicode Scalar Values.

@mmd-osm points out that the function unicode_strlen in openstreetmap-cgimap is responsible for determining the length, which is checked against 255 in order to accept or reject a key or value. This raises issues because unicode_strlen relies on std::mbsrtowcs, which will give a length that is dependent on the type of wchar_t. My understanding is that wchar_t is UTF-16LE/uint16_t on Windows, but it is implementation-dependent, and can be a 32-bit quantity instead, representing a whole Unicode Scalar Value.

I would like to have a clear understanding of what specific limits are placed on the length and content of keys and values in changeset tags.

@mmd-osm
Copy link
Contributor

mmd-osm commented Sep 20, 2022

My understanding is that wchar_t is UTF-16LE/uint16_t on Windows

Code is actually running on Ubuntu 22.04 in production, and I'd prefer if we could check the assumption for that platform. We don't support Windows for cgimap, and have no plans to do so.

@xorgy
Copy link
Author

xorgy commented Sep 20, 2022

@mmd-osm I tried on your godbolt instance, https://cpp.godbolt.org/z/K1TbqbMhK and it shows that in your C++11 environment, sizeof(wchar_t) is 4 bytes. I believe C++11 std asks for wchar_t to be 4 bytes (a 32-bit unsigned integer representing a Unicode Scalar), and it may be 2 bytes (uint16_t/UTF-16) in other systems.

@xorgy
Copy link
Author

xorgy commented Sep 20, 2022

@tomhughes said:

I mean probably it works fine given the setlocale I've just always steered clear of them myself.

If it works as intended then it's counting codepoints though (which is good) not UTF16 values.

setlocale doesn't save you in this case, because std:mbsrtowcs behaviour depends on both the locale and the size of wchar_t. I think it is standardized in C++11 but implementation-defined before then. Outside C++11, wchar_t may be a UTF-16 code unit, whereas in C++11 I believe it will be a Unicode Scalar Value in a 4-byte native endian integer.

@mmd-osm
Copy link
Contributor

mmd-osm commented Sep 20, 2022

I should mention, we also have this test case: https://github.com/zerebubuth/openstreetmap-cgimap/blob/master/test/test_parse_osmchange_input.cpp#L445-L468 using 😎 in UTF-8, which is represented by a 4 byte character sequence.

I think it is standardized in C++11 but implementation-defined before then

We're on C++17.

@tomhughes
Copy link
Member

If this ticket is about cgimap then it's on the wrong repository... Let me see if I can move it.

@tomhughes
Copy link
Member

No I can't sadly, so I'll have to close it here as it's not a rails issue,

@tomhughes tomhughes closed this as not planned Won't fix, can't repro, duplicate, stale Sep 20, 2022
@xorgy
Copy link
Author

xorgy commented Sep 20, 2022

Well, there's also a rails validation, which I think goes by codepoints, the cgimap one seems to, as a result of the environment, go by codepoints as well.

@tomhughes
Copy link
Member

To be clear rails will count characters, which I'm pretty sure is codepoints (what I think you mean by "scalar values") which is I believe correct and is what cgimap is trying to do.

@xorgy
Copy link
Author

xorgy commented Sep 20, 2022

On a scale of 0-9, how unwelcome would a PR clarifying the language in comments around this be?

@mmd-osm
Copy link
Contributor

mmd-osm commented Sep 20, 2022

Are you asking for Rails port or CGImap? You can always send in PRs for CGImap, which is over at: https://github.com/zerebubuth/openstreetmap-cgimap

@xorgy
Copy link
Author

xorgy commented Sep 20, 2022

Are you asking for Rails port or CGImap? You can always send in PRs for CGImap, which is over at: https://github.com/zerebubuth/openstreetmap-cgimap

I'm thinking in both cases. If there are references to "characters" I would like to clarify that this is scalars/codepoints. The average person reading the word "character" is unlikely to sense that it's not a grapheme, or not a byte.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants