Skip to content

Conversation

nikic
Copy link
Member

@nikic nikic commented Mar 21, 2019

This is the implementation for https://bugs.php.net/bug.php?id=77744. If PREG_LENGTH_CAPTURE is used, captured strings are replaced with their length instead. Generally this is only useful in conjunction with PREG_OFFSET_CAPTURE, in which case the offset + length together allow you to extract the captured string manually.

The motivation is to avoid copying large captured substrings if not necessary.

@nikic nikic added the Feature label Mar 21, 2019
@nikic
Copy link
Member Author

nikic commented May 9, 2019

@cscott Based on your benchmarks, guess I can drop this one as not really worthwhile?

@cmb69
Copy link
Member

cmb69 commented Jul 19, 2021

What is the status here? @cscott?

@iluuu1994
Copy link
Member

Closing as there was no response.

@iluuu1994 iluuu1994 closed this Apr 18, 2022
@cscott
Copy link
Contributor

cscott commented Jul 3, 2025

Sorry for the unresponsiveness. PHP tokenizers still suffer a performance penalty compare to JS. One reason is that we don't have the equivalent of mb_ord_at($string, $offset) to allow multi-byte character comparisons to be done numerically without creating a bunch of 1- to 4-byte substrings.

But the preg_match interface could be improved, too. preg_match() is optimized if the match is omitted but the only way to use the $offset parameter is by providing a match array. PREG_LENGTH_CAPTURE would still help, by providing a way to avoid the more expensive part of creating a match array. Something like preg_match_at() might also help by avoiding the creation of the array, but you still have to advance "one character" if you have the match, and that requires knowing the match length for UTF-8 strings. So I think this option would still be very beneficial.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants