Fix GH-21223: ext/mbstring replace alloca with safe_emalloc in mb_guess_encoding_for_strings by jordikroon · Pull Request #21224 · php/php-src

jordikroon · 2026-02-15T22:57:20Z

alexdowad · 2026-02-15T23:28:30Z

@jordikroon, thanks for the PR. Some questions:

Where did you first encounter this crash? Did you find it in actual, production code, or by fuzzing PHP to look for crashes? Does the crash only occur when running under ASAN, or also when just running PHP "normally" (i.e. not compiling with ASAN)?
On what platform did you observe the crash? Do you know if it can be reproduced on other platforms?
The reproducer uses an array with 50,000 strings. What is the actual "threshold" size at which crashes occur?
What is the performance impact?

If there is no performance impact, then I would definitely agree with moving to emalloc. If there is non-negligible performance impact, then I would be more interested in the possibility of de-duplicating the $encodings array.

Speaking of which... is there performance impact to running mb_guess_encoding with duplicate values in the $encodings array (apart from the possibility of getting a crash from ASAN)?

jordikroon · 2026-02-16T00:12:33Z

@alexdowad I hope this answers your questions.

I encountered this problem by fuzzing the mbstring specifically under ASAN. But running normally the given code example will give a Segmentation fault (core dumped).
Confirmed to be an issue on MacOS and Linux. So I assume this will hit on all platforms.
The threshold seems vary slightly. On my MacOS machine it seems to crash with a list of 208_629. On Linux 209_192. Not sure where the difference comes from.

Lastly as for the performance impact. It seems non-negligible and both before and after seem be even in performance. Sometimes one is faster than the other. But the longer the input, the slower it will become.

Execution time: 0.009391069~ (before)
Execution time: 0.009022951~ (after)

That said. Duplicate values do seem to affect performance. As in a single UTF-8 is notably faster than a list of 100.000.

alexdowad · 2026-02-16T00:17:55Z

@jordikroon Hmm. You said for macOS you found a threshold of 208,629? But your test code uses an array with 50,000 copies of the string "UTF-8".

jordikroon · 2026-02-16T02:51:50Z

Where did you see 50.000 copies? The test file and example show 500.000 copies.

alexdowad · 2026-02-16T04:36:42Z

Where did you see 50.000 copies? The test file and example show 500.000 copies.

Ah 😅 That explains it.
Sorry for the mistake.

When I have some time, I intend to benchmark this change myself as an additional point of reference to assess the impact.

replace alloca with safe_emalloc in mb_guess_encoding_for_strings

afeac38

jordikroon requested review from alexdowad and youkidearitai as code owners February 15, 2026 22:57

github-actions bot added the Extension: mbstring label Feb 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GH-21223: ext/mbstring replace alloca with safe_emalloc in mb_guess_encoding_for_strings#21224

Fix GH-21223: ext/mbstring replace alloca with safe_emalloc in mb_guess_encoding_for_strings#21224
jordikroon wants to merge 1 commit intophp:masterfrom
jordikroon:fix/gh-21223

jordikroon commented Feb 15, 2026

Uh oh!

alexdowad commented Feb 15, 2026

Uh oh!

jordikroon commented Feb 16, 2026

Uh oh!

alexdowad commented Feb 16, 2026

Uh oh!

jordikroon commented Feb 16, 2026

Uh oh!

alexdowad commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

jordikroon commented Feb 15, 2026

Uh oh!

alexdowad commented Feb 15, 2026

Uh oh!

jordikroon commented Feb 16, 2026

Uh oh!

alexdowad commented Feb 16, 2026

Uh oh!

jordikroon commented Feb 16, 2026

Uh oh!

alexdowad commented Feb 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants