Skip to content

Fix GH-21223: ext/mbstring replace alloca with safe_emalloc in mb_guess_encoding_for_strings#21224

Open
jordikroon wants to merge 1 commit intophp:masterfrom
jordikroon:fix/gh-21223
Open

Fix GH-21223: ext/mbstring replace alloca with safe_emalloc in mb_guess_encoding_for_strings#21224
jordikroon wants to merge 1 commit intophp:masterfrom
jordikroon:fix/gh-21223

Conversation

@jordikroon
Copy link
Contributor

Fixes #21223

@alexdowad
Copy link
Contributor

@jordikroon, thanks for the PR. Some questions:

  • Where did you first encounter this crash? Did you find it in actual, production code, or by fuzzing PHP to look for crashes? Does the crash only occur when running under ASAN, or also when just running PHP "normally" (i.e. not compiling with ASAN)?
  • On what platform did you observe the crash? Do you know if it can be reproduced on other platforms?
  • The reproducer uses an array with 50,000 strings. What is the actual "threshold" size at which crashes occur?
  • What is the performance impact?

If there is no performance impact, then I would definitely agree with moving to emalloc. If there is non-negligible performance impact, then I would be more interested in the possibility of de-duplicating the $encodings array.

Speaking of which... is there performance impact to running mb_guess_encoding with duplicate values in the $encodings array (apart from the possibility of getting a crash from ASAN)?

@jordikroon
Copy link
Contributor Author

@alexdowad I hope this answers your questions.

  • I encountered this problem by fuzzing the mbstring specifically under ASAN. But running normally the given code example will give a Segmentation fault (core dumped).
  • Confirmed to be an issue on MacOS and Linux. So I assume this will hit on all platforms.
  • The threshold seems vary slightly. On my MacOS machine it seems to crash with a list of 208_629. On Linux 209_192. Not sure where the difference comes from.

Lastly as for the performance impact. It seems non-negligible and both before and after seem be even in performance. Sometimes one is faster than the other. But the longer the input, the slower it will become.

Execution time: 0.009391069~ (before)
Execution time: 0.009022951~ (after)

That said. Duplicate values do seem to affect performance. As in a single UTF-8 is notably faster than a list of 100.000.

@alexdowad
Copy link
Contributor

@jordikroon Hmm. You said for macOS you found a threshold of 208,629? But your test code uses an array with 50,000 copies of the string "UTF-8".

@jordikroon
Copy link
Contributor Author

Where did you see 50.000 copies? The test file and example show 500.000 copies.

@alexdowad
Copy link
Contributor

Where did you see 50.000 copies? The test file and example show 500.000 copies.

Ah 😅 That explains it.
Sorry for the mistake.

When I have some time, I intend to benchmark this change myself as an additional point of reference to assess the impact.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

ext/mbstring: stack overflow in mb_guess_encoding called via mb_detect_encoding

2 participants