Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Different preg_match result with -d pcre.jit=0 #11374

Closed
mvorisek opened this issue Jun 5, 2023 · 16 comments
Closed

Different preg_match result with -d pcre.jit=0 #11374

mvorisek opened this issue Jun 5, 2023 · 16 comments

Comments

@mvorisek
Copy link
Contributor

mvorisek commented Jun 5, 2023

Description

related with PHP-CS-Fixer/PHP-CS-Fixer#6997

How to reproduce:

  1. clone https://github.com/PHP-CS-Fixer/PHP-CS-Fixer.git (latest master or 3.17 tag)
  2. run composer update
  3. run php -d pcre.jit=0 vendor/phpunit/phpunit/phpunit --filter TypeExpressionTest
  4. the tests will fail, and notice, with -d pcre.jit=1 the tests are passing

Expected result:

pcre.jit config should have no effect on the preg_match result

The problematic regex is probably https://github.com/PHP-CS-Fixer/PHP-CS-Fixer/blob/v3.17.0/src/DocBlock/TypeExpression.php#L32 - is there any easy workaround until php-src/PCRE is fixed?

PHP Version

PHP 7.4 - 8.2

Operating System

Windows and Unix

@damianwadley
Copy link
Member

Huh. It's backwards: normally these sorts of bugs happen with jit=1 and turning it off allows the regex to work...

That's a gigantic regex. Can you boil it down to something manageable that still reproduces?

And either way, does it also happen with the pcretest utility? If so then it's an upstream bug, which is my suspicion, and not PHP itself.
https://www.pcre.org/original/doc/html/pcretest.html

@damianwadley
Copy link
Member

damianwadley commented Jun 5, 2023

is there any easy workaround until php-src/PCRE is fixed?

Sure: turn on JIT 😉

Otherwise, maybe? You could experiment with the regex - add/remove optimizations, that sort of thing - to see if you can trigger some different internal code paths to bypass wherever the bug is. It'd be easier to do that if you could narrow down the source of the problem first, of course.

And there's always the option of breaking it down into smaller regexes... I mean really, that's 100+ lines of (formatted) regex in there. May or may not help, but it might be a nice gain regardless.

@Wirone
Copy link

Wirone commented Jun 5, 2023

Sure: turn on JIT 😉

In my case there's PCRE JIT Support => not compiled in so I can't enable it (asdf's PHP on M1 Mac). I can workaround it with Docker setup, but I just wanted to point that it's not always possible to use PCRE JIT.

@iluuu1994
Copy link
Member

iluuu1994 commented Jun 5, 2023

The issue is indeed cause by JIT, not the other way around. Simpler reproducible case:

$regex = '
    (?<types>
        (?:
            (?:\{ (?&types) \})
            | (a)
        )
        (\*?)
    )
';

var_dump(preg_match('{^' . $regex . '$}x', '{a}', $matches), $matches);

This is the correct output, with pcre.jit=0.

int(1)
array(5) {
  [0]=>
  string(3) "{a}"
  ["types"]=>
  string(3) "{a}"
  [1]=>
  string(3) "{a}"
  [2]=>
  string(1) "a"
  [3]=>
  string(0) ""
}

This is the incorrect output, with pcre.jit=1.

int(1)
array(5) {
  [0]=>
  string(3) "{a}"
  ["types"]=>
  string(3) "{a}"
  [1]=>
  string(3) "{a}"
  [2]=>
  string(0) ""
  [3]=>
  string(0) ""
}

Note the missing capture for group 2.

Looking at the offsets, without JIT:

0 - 3
0 - 3
1 - 2
3 - 3

With JIT:

0 - 3
0 - 3
18446744073709551615 - 18446744073709551615
3 - 3

So, the issue seems to originate from PCRE. I'll check if the issue was already fixed/reported upstream.

Edit: Things are a bit more complicated. Removing the (\*?) captures 2 groups instead of the original 4, which makes no sense to me.

@mvorisek
Copy link
Contributor Author

mvorisek commented Jun 5, 2023

Here is online repro - https://3v4l.org/W4UkB, problem is present somewhere between PCRE >8.41 and <= 10.32 versions.

@iluuu1994
Copy link
Member

Yeah so it does seem JIT is the odd one out after all, although the output doesn't make sense to me.

@remicollet
Copy link
Member

remicollet commented Jun 5, 2023

problem is present somewhere between PCRE >8.41 and <= 10.32 versions.

  • 8.x is libpcre used by PHP <= 7.2
  • 10.x is libpcre2 used by PHP >= 7.3

@iluuu1994
Copy link
Member

#include <stdio.h>
#include <string.h>
#include <stdbool.h>

#define PCRE2_CODE_UNIT_WIDTH 8
#include <pcre2.h>

void test(bool use_jit) {
	const char *pattern = "(?<all>(?:(?:a(?&all))|(b))(c?))";
	const char *subject = "aabc";

	int errnumber;
	PCRE2_SIZE erroffset;
	pcre2_code *re = pcre2_compile((PCRE2_SPTR)pattern, strlen(pattern), 0, &errnumber, &erroffset, NULL);
	if (use_jit) {
		pcre2_jit_compile(re, PCRE2_JIT_COMPLETE);
	}

	pcre2_match_data *match_data = pcre2_match_data_create_from_pattern(re, NULL);
	int count;
	if (!use_jit) {
		count = pcre2_match(re, (PCRE2_SPTR)subject, strlen(subject), 0, 0, match_data, NULL);
	} else {
		count = pcre2_jit_match(re, (PCRE2_SPTR)subject, strlen(subject), 0, 0, match_data, NULL);
	}

	PCRE2_SIZE *offsets = pcre2_get_ovector_pointer(match_data);
	printf("%s\n", subject);
	printf("%d\n", count);
	for (uint32_t i = 0; i < count * 2; i += 2) {
		printf("%lu - %lu\n", offsets[i], offsets[i + 1]);
	}

	pcre2_match_data_free(match_data);
	pcre2_code_free(re);
}

int main(void) {
	test(false);
	test(true);
	return 0;
}
aabc
4
0 - 4
0 - 4
2 - 3
4 - 4
aabc
4
0 - 4
0 - 4
18446744073709551615 - 18446744073709551615
4 - 4

The problem is reproducible with pcre-only.

@mvorisek
Copy link
Contributor Author

mvorisek commented Jun 6, 2023

@iluuu1994 like in https://3v4l.org/5YhJP I would like to test parsing /wo and /w PCRE JIT in phpunit tests, but I cannot change the regexes (as they are not defined in the tests) - what are the options I can convince php to not cache the regexes, can I clear the regex cache by some php/userland function?

@iluuu1994
Copy link
Member

iluuu1994 commented Jun 6, 2023

@mvorisek An implementation that proves there's no performance hit. That's very unlikely. A function to clear the cache might be reasonable. Alternatively we might incorporate PCRE_G(jit) in the cache-key, or use separate caches.

@mvorisek
Copy link
Contributor Author

mvorisek commented Jun 6, 2023

@mvorisek Alternatively we might incorporate PCRE_G(jit) in the cache-key, or use separate caches.

In general clearing cache should not be needed if the cache is not mutated during matching, but when the php.ini is changed, no cache with the old setting should be used. So either incorporate PCRE_G(jit) in the cache key or delete the current cache when pcre.jit php.init is changed. (should imply zero perfomance drop, as the php.ini is normally not changed during request)

For current php versions, I coded the cache clearing by dummy matching 4096 regexes - https://3v4l.org/NrsHc

--

the cache key is build in: https://github.com/php/php-src/blob/32968f8de0/ext/pcre/php_pcre.c#L615-L619

the pcre.jit ini is set in: https://github.com/php/php-src/blob/32968f8de0/ext/pcre/php_pcre.c#L358

@iluuu1994
Copy link
Member

@mvorisek As you have probably seen I had to revert the last PR due to bit performance implications. I also closed #11511 for the same reason. The performance drop is smaller but still more than I'd expect. The alternative approach would be to clear the cache, but that would require at least an e-mail to the mailing list. I don't think it's worth it, given this is an edge case and should be fixed in PCRE. If you'd like to start this discussion, please send a mail to the internals list. Here's a patch:

Patch

diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c
index 6ad0b6eb76..b4a94a6ee4 100644
--- a/ext/pcre/php_pcre.c
+++ b/ext/pcre/php_pcre.c
@@ -2961,6 +2961,30 @@ PHP_FUNCTION(preg_last_error_msg)
 }
 /* }}} */
 
+PHP_FUNCTION(preg_cache_clear)
+{
+	ZEND_PARSE_PARAMETERS_NONE();
+
+	zend_hash_clean(&PCRE_G(pcre_cache));
+}
+
+PHP_FUNCTION(preg_cache_remove)
+{
+	zend_string *regex;
+
+	ZEND_PARSE_PARAMETERS_START(1, 1)
+		Z_PARAM_STR(regex)
+	ZEND_PARSE_PARAMETERS_END();
+
+	zend_hash_del(&PCRE_G(pcre_cache), regex);
+
+	if (BG(ctype_string)) {
+		zend_string *key = zend_string_concat2(ZSTR_VAL(BG(ctype_string)), ZSTR_LEN(BG(ctype_string)), ZSTR_VAL(regex), ZSTR_LEN(regex));
+		zend_hash_del(&PCRE_G(pcre_cache), regex);
+		zend_string_release(key);
+	}
+}
+
 /* {{{ module definition structures */
 
 zend_module_entry pcre_module_entry = {
diff --git a/ext/pcre/php_pcre.stub.php b/ext/pcre/php_pcre.stub.php
index 1b06075885..de361bf1dc 100644
--- a/ext/pcre/php_pcre.stub.php
+++ b/ext/pcre/php_pcre.stub.php
@@ -140,3 +140,7 @@ function preg_grep(string $pattern, array $array, int $flags = 0): array|false {
 function preg_last_error(): int {}
 
 function preg_last_error_msg(): string {}
+
+function preg_cache_clear(): void {}
+
+function preg_cache_remove(string $pattern): void {}
diff --git a/ext/pcre/php_pcre_arginfo.h b/ext/pcre/php_pcre_arginfo.h
index a4132e28e5..b9cdbeb1f9 100644
--- a/ext/pcre/php_pcre_arginfo.h
+++ b/ext/pcre/php_pcre_arginfo.h
@@ -1,5 +1,5 @@
 /* This is a generated file, edit the .stub.php file instead.
- * Stub hash: 7f27807e45df9c9b5011aa20263c9789896acfbc */
+ * Stub hash: cd384188597be4586ac0df414a2d831ed5b31edd */
 
 ZEND_BEGIN_ARG_WITH_RETURN_TYPE_MASK_EX(arginfo_preg_match, 0, 2, MAY_BE_LONG|MAY_BE_FALSE)
 	ZEND_ARG_TYPE_INFO(0, pattern, IS_STRING, 0)
@@ -62,6 +62,13 @@ ZEND_END_ARG_INFO()
 ZEND_BEGIN_ARG_WITH_RETURN_TYPE_INFO_EX(arginfo_preg_last_error_msg, 0, 0, IS_STRING, 0)
 ZEND_END_ARG_INFO()
 
+ZEND_BEGIN_ARG_WITH_RETURN_TYPE_INFO_EX(arginfo_preg_cache_clear, 0, 0, IS_VOID, 0)
+ZEND_END_ARG_INFO()
+
+ZEND_BEGIN_ARG_WITH_RETURN_TYPE_INFO_EX(arginfo_preg_cache_remove, 0, 1, IS_VOID, 0)
+	ZEND_ARG_TYPE_INFO(0, pattern, IS_STRING, 0)
+ZEND_END_ARG_INFO()
+
 
 ZEND_FUNCTION(preg_match);
 ZEND_FUNCTION(preg_match_all);
@@ -74,6 +81,8 @@ ZEND_FUNCTION(preg_quote);
 ZEND_FUNCTION(preg_grep);
 ZEND_FUNCTION(preg_last_error);
 ZEND_FUNCTION(preg_last_error_msg);
+ZEND_FUNCTION(preg_cache_clear);
+ZEND_FUNCTION(preg_cache_remove);
 
 
 static const zend_function_entry ext_functions[] = {
@@ -88,6 +97,8 @@ static const zend_function_entry ext_functions[] = {
 	ZEND_FE(preg_grep, arginfo_preg_grep)
 	ZEND_FE(preg_last_error, arginfo_preg_last_error)
 	ZEND_FE(preg_last_error_msg, arginfo_preg_last_error_msg)
+	ZEND_FE(preg_cache_clear, arginfo_preg_cache_clear)
+	ZEND_FE(preg_cache_remove, arginfo_preg_cache_remove)
 	ZEND_FE_END
 };

@iluuu1994 iluuu1994 closed this as not planned Won't fix, can't repro, duplicate, stale Jun 27, 2023
@mvorisek
Copy link
Contributor Author

mvorisek commented Jun 27, 2023

What about:

diff --git a/ext/pcre/php_pcre.c b/ext/pcre/php_pcre.c
index 6249a80..c381706 100644
--- a/ext/pcre/php_pcre.c
+++ b/ext/pcre/php_pcre.c
@@ -638,10 +638,15 @@ PHPAPI pcre_cache_entry* pcre_get_compiled_regex_cache_ex(zend_string *regex, in
 	   back the compiled pattern, otherwise go on and compile it. */
 	zv = zend_hash_find(&PCRE_G(pcre_cache), key);
 	if (zv) {
-		if (key != regex) {
-			zend_string_release_ex(key, 0);
+		pcre_cache_entry *pce = (pcre_cache_entry*)Z_PTR_P(zv);
+		if (!(pce->preg_options & PREG_JIT) == !PCRE_G(jit)) {
+			if (key != regex) {
+				zend_string_release_ex(key, 0);
+			}
+			return (pcre_cache_entry*)Z_PTR_P(zv);
+		} else {
+			zend_hash_del(&PCRE_G(pcre_cache), key);
 		}
-		return (pcre_cache_entry*)Z_PTR_P(zv);
 	}
 
 	p = ZSTR_VAL(regex);

It should have almost zero performance effect and will keep cache for usecases which switch the PCRE JIT flag for one regex and then restore the PCRE JIT flag back.

@iluuu1994
Copy link
Member

if (!(pce->preg_options & PREG_JIT) == !PCRE_G(jit)) {

This will repeatedly try jitting regexes that have failed JIT compilation. You need a separate flag as in #11511 to avoid that.

Let's see if this is faster than #11511. It avoids an additional PCRE_G(jit) check, so it might be slightly faster.

@mvorisek
Copy link
Contributor Author

This will repeatedly try jitting regexes that have failed JIT compilation.

as always/currently ;-) (PCRE JIT flag is enabled by default)

@iluuu1994
Copy link
Member

iluuu1994 commented Jun 27, 2023

as always/currently ;-) (PCRE JIT flag is enabled by default)

No. Currently, regexes are compiled, then JIT is attempted, and then executed with JIT if succeeded. Otherwise it is interpreted. Later executions don't retry compilation.

nielsdos pushed a commit that referenced this issue Oct 27, 2023
nielsdos added a commit that referenced this issue Oct 27, 2023
* PHP-8.1:
  Fix GH-11374: Different preg_match result with -d pcre.jit=0
nielsdos added a commit that referenced this issue Oct 27, 2023
* PHP-8.2:
  Fix GH-11374: Different preg_match result with -d pcre.jit=0
nielsdos added a commit that referenced this issue Oct 27, 2023
* PHP-8.3:
  Fix GH-11374: Different preg_match result with -d pcre.jit=0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants