Optimize PHP html_entity_decode function #18092

ArtUkrainskiy · 2025-03-16T17:38:04Z

Improvements affect the C function traverse_for_entities:

Use memchr to search for '&' instead of scanning character by character.
Use memchr to locate ';' to determine potential entity boundaries instead of process_named_entity_html, avoiding unnecessary per-character validations.
Use memcpy instead of character-by-character copying.
Refactor code for improved structure and readability.

Benchmark branch - https://github.com/ArtUkrainskiy/php-src/tree/html_entity_decode/benchmark

------------------------------------------------------------------------------
|                      Test |     old avg(ns) |     new avg(ns) |    diff(%) |
------------------------------------------------------------------------------
|                      4k & |            5949 |           21115 |    -71.98% |
------------------------------------------------------------------------------
|             only entities |            8279 |           10202 |    -18.80% |
------------------------------------------------------------------------------
|        400 valid entities |            6439 |            5861 |      7.80% |
------------------------------------------------------------------------------
|        200 valid entities |            4891 |            3178 |     38.12% |
------------------------------------------------------------------------------
|        200 invalid entity |            4777 |            3181 |     37.29% |
------------------------------------------------------------------------------
|             200 ampersand |            4809 |            1221 |    198.35% |
------------------------------------------------------------------------------
|        100 valid entities |            4188 |            1777 |    124.49% |
------------------------------------------------------------------------------
|         50 valid entities |            2885 |             979 |    193.50% |
------------------------------------------------------------------------------
|        String ends with & |            2428 |             176 |   1221.69% |
------------------------------------------------------------------------------

As you can see, the speedup depends on the number of entities and & characters in the string — the fewer there are, the more noticeable the performance improvement.

In edge cases, where the string consists entirely of & characters or valid HTML entities, performance actually worsens. However, I don't think this is a common scenario.

Either way, I plan to continue optimizing and implement & scanning using SIMD instructions, which should significantly improve performance even in high-entity-density cases.

ext/standard/html.c

bukka · 2025-03-17T14:18:21Z

@Girgias are you going to review the logic as well? Just checking if I should look into this or if you are happy to handle it all?

Girgias · 2025-03-17T14:23:11Z

@Girgias are you going to review the logic as well? Just checking if I should look into this or if you are happy to handle it all?

Please do review the logic, I only had a cursory glance :)

bukka · 2025-03-17T14:32:04Z

Ok I will check it out next week if no one is quicker.

dragoonis · 2025-03-25T11:36:09Z

Nice idea @ArtUkrainskiy :-) 👍

bukka

I think it looks reasonable except that introduction of valid_entity boolean and checking that everywhere which doesn't look like performance optimization to me. I understand that it was probably meant to make code more readable but not sure if it's worth it in this case. Might be worth to check if it has any impact.

ext/standard/html.c

bukka

This looks better. The comments are pretty much only for minor issue / optimizations. Overall I think it looks good.

ext/standard/html.c

bukka · 2025-06-15T12:13:07Z

I went through this again and think it looks good. Doesn't make sense to hold it because of few NITs which I can easily address during the merge. I will try do a bit of testing in about two weeks time and merge it then.

Optimize scanning for '&' and ';' using memchr. Use memcpy instead of character-by-character copying language. Closes phpGH-18092

bukka · 2025-07-07T16:26:24Z

I have done a bit of testing. Also fixed few nits and squash / rebased it so think it should be good enough. I will do one last round of testing in a couple of weeks and if I don't find anything, I will merge it.

bukka · 2025-07-21T18:59:11Z

I did a bit more checking and it seems all fine so merged to master. Thanks for the contribution!

ArtUkrainskiy requested a review from bukka as a code owner March 16, 2025 17:38

github-actions bot added the Extension: standard label Mar 16, 2025

ArtUkrainskiy closed this Mar 16, 2025

ArtUkrainskiy reopened this Mar 16, 2025

ArtUkrainskiy force-pushed the html_entity_decode/improve-memchr branch from d166abe to 66f5709 Compare March 16, 2025 17:52

Girgias reviewed Mar 17, 2025

View reviewed changes

ext/standard/html.c Outdated Show resolved Hide resolved

ext/standard/html.c Outdated Show resolved Hide resolved

ArtUkrainskiy force-pushed the html_entity_decode/improve-memchr branch from 9b3e96d to f093c30 Compare March 17, 2025 16:30

ArtUkrainskiy requested a review from Girgias March 26, 2025 19:04

ArtUkrainskiy marked this pull request as draft March 26, 2025 19:05

ArtUkrainskiy marked this pull request as ready for review March 29, 2025 11:23

bukka reviewed Mar 29, 2025

View reviewed changes

ext/standard/html.c Outdated Show resolved Hide resolved

ext/standard/html.c Outdated Show resolved Hide resolved

ArtUkrainskiy requested a review from bukka March 30, 2025 18:43

bukka reviewed Apr 16, 2025

View reviewed changes

ext/standard/html.c Outdated Show resolved Hide resolved

ext/standard/html.c Outdated Show resolved Hide resolved

ext/standard/html.c Show resolved Hide resolved

ext/standard/html.c Outdated Show resolved Hide resolved

ext/standard/html.c Show resolved Hide resolved

Refactor traverse_for_entities for unescape_html_entities

10589dc

Optimize scanning for '&' and ';' using memchr. Use memcpy instead of character-by-character copying language. Closes phpGH-18092

bukka force-pushed the html_entity_decode/improve-memchr branch from 5f8363b to 10589dc Compare July 7, 2025 16:24

bukka closed this in e0c3f46 Jul 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimize PHP html_entity_decode function #18092

Optimize PHP html_entity_decode function #18092

Uh oh!

ArtUkrainskiy commented Mar 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

bukka commented Mar 17, 2025 •

edited

Loading

Uh oh!

Girgias commented Mar 17, 2025

Uh oh!

bukka commented Mar 17, 2025

Uh oh!

dragoonis commented Mar 25, 2025

Uh oh!

bukka left a comment

Uh oh!

Uh oh!

Uh oh!

bukka left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bukka commented Jun 15, 2025

Uh oh!

bukka commented Jul 7, 2025

Uh oh!

bukka commented Jul 21, 2025

Uh oh!

Uh oh!

Optimize PHP html_entity_decode function #18092

Optimize PHP html_entity_decode function #18092

Uh oh!

Conversation

ArtUkrainskiy commented Mar 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bukka commented Mar 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Girgias commented Mar 17, 2025

Uh oh!

bukka commented Mar 17, 2025

Uh oh!

dragoonis commented Mar 25, 2025

Uh oh!

bukka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

bukka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bukka commented Jun 15, 2025

Uh oh!

bukka commented Jul 7, 2025

Uh oh!

bukka commented Jul 21, 2025

Uh oh!

Uh oh!

ArtUkrainskiy commented Mar 16, 2025 •

edited

Loading

bukka commented Mar 17, 2025 •

edited

Loading