Skip to content

Conversation

iluuu1994
Copy link
Member

Alternative to GH-10870. Use atomic writes for adding the IS_STR_VALID_UTF8 flag to UTF-8-verified interned strings in ext-mbstring. x86 and other architectures guarantee atomic writes/reads for aligned variables up to size_t, which we already rely on, particularly for zend_op.handler being swapped out in the JIT. The atomic write is only needed here to not drop any other newly written bits (which there currently aren't any of). We use GCC and sync atomics because they don't require annotating the modified variable with the C11 _Atomic keyword.

Alternative to phpGH-10870. Use atomic writes for adding the IS_STR_VALID_UTF8 flag
to UTF-8-verified interned strings in ext-mbstring. x86 and other architectures
guarantee atomic writes/reads for aligned variables up to size_t, which we
already rely on, particularly for zend_op.handler being swapped out in the JIT.
The atomic write is only needed here to not drop any other newly written bits
(which there currently aren't any of). We use GCC and sync atomics because they
don't require annotating the modified variable with the C11 _Atomic keyword.
@bwoebi
Copy link
Member

bwoebi commented Oct 7, 2025

Could ... we export that as ZEND_API or inline header function in zend_string.h ... maybe?

I do have quite some interest in using this API from extension code.

@nielsdos
Copy link
Member

nielsdos commented Oct 7, 2025

I know it's early, but I believe that AcqRel consistency should be good enough and full sequential consistency is "too much".

@iluuu1994
Copy link
Member Author

Could ... we export that as ZEND_API or inline header function in zend_string.h ... maybe?

I'm open to it. The only gotcha is that mark_zstr_as_utf8() is currently not safe to call when SHM is unprotected (because it will protect it on exit).

I know it's early, but I believe that AcqRel consistency should be good enough and full sequential consistency is "too much".

I'm open to it. FWIU even relaxed should be enough, given this bit is effectively "fire and forget" and order isn't crucial in this or any other thread. But this should also be a very rare operation (max one per string per lifetime of process), so I didn't consider it particularly important.

@alexdowad
Copy link
Contributor

It's been a few years since I worked on issues related to the IS_STR_VALID_UTF8 flag. Can you remind me why we originally didn't set the flag on interned strings? (Except for the few canonical strings like the canonical empty string...)

@iluuu1994
Copy link
Member Author

@alexdowad Data races. Interned strings are shared across processes and threads. If two threads write to the same memory at the same time, the first write will be lost. It wouldn't be crucial in this case, and we actually don't even add any other bits, so a direct write would probably also be ok. Regardless, if we ever tried to do this with some other flag, it's better not to have to go hunt down data races.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants