-
Notifications
You must be signed in to change notification settings - Fork 7.9k
adds PREG_NO_UTF8_CHECK flag #2035
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@@ -32,6 +32,7 @@ | |||
#define PREG_PATTERN_ORDER 1 | |||
#define PREG_SET_ORDER 2 | |||
#define PREG_OFFSET_CAPTURE (1<<8) | |||
#define PREG_NO_UTF8_CHECK PCRE_NO_UTF8_CHECK |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I couldn't grasp the reasoning behind the various existing PREG_*
constants, so I just used PCRE's value.
Please tell me if this should be changed.
Fixes bug #72685. This commit adds a way to set the PCRE_NO_UTF8_CHECK flag for `preg_match`, `preg_match_all` and `preg_split` php functions. Without the ability to set this flag, matching repeatedly against the same input in utf-8 mode yields quadratic performance instead of linear, since PCRE then checks the whole subject for UTF-8 validity on every call.
8f07584
to
20fd688
Compare
@ju1ius thanks for the report. I would suggest to turn your PHP code into the test and add to the PR. I also think it is required to add tests with invalid UTF-8 strings. While this option can be of course useful, it looks dangerous to be allowed just as is. The PCRE man page warns explicitly about possible bad consequences for the case this option is used incorrectly. Not sure, maybe hashing the subject string before were an option, if utf-8 check is disabled. Or some other way to ensure same subject string is passed over again. Thanks. |
Yep, just doing that ATM 😉 |
@weltling Wouldn't hashing the string on each invocation kind of defeat the point ? Also, no pun intended but PHP already has an history of trying to be automagically safe, failing at that, thus creating security holes and (what's worth) long lasting bad habits among it's community (remember magic_quotes, safe_mode, et al.) We already have a function to check for UTF-8 validity: so IMO the right solution would be to document this flag with a big, bold, yellow warning like:
And include a small code sample like: if (!mb_check_encoding($input, 'utf-8')) {
throw new \InvalidArgumentException('Invalid UTF-8 string');
}
// Now match a sequence of whitespace separated words.
$pos = 0;
$words = [];
while(preg_match('/\G\s*(\w+)\s*/u', $input, $matches, PREG_NO_UTF8_CHECK, $pos)) {
$pos += strlen($matches[0]);
$words[] = $matches[1];
} What do you think? |
@weltling I just added a few tests based on what I could find inside the
I'm not sure how to do that, the above code is more a benchmark. Can you point me to existing performance tests ? |
Yeah, it is a clear that the validation has to happen before just once. What I'm talking about is the case where no validation happened before. PCRE promises undefined behavior or even a crash. Imho not a good idea just opening a rabbit hole without any defense. Documentation is a good thing, but maybe more could be done. Hashing was just the first thing that came into my mind - it iis supposed to be faster than utf-8 check. A smarter idea is welcome :) Thanks. |
With the benchmark test - you can pick a criteria. Say the time difference is expected to be not more that X ms, etc. Or even seconds. Just to be sure that even the rough expectations are ensured. Thanks. |
Yes, I think just ensuring a roughly linear complexity should be enough without slowing down the builds too much. |
could php-src somehow remember which strings already have been "validated" and set this flag automatically instead of defering this decision to userland? |
Just like @weltling I'm apprehensive about adding this. The documentation is pretty explicit on the point:
We cannot expose options that cause memory unsafety. If executing PHP code can cause memory errors (-so), this is always a bug in PHP and not a problem with the user doing something they're not supposed to do. @staabm In principle possible. PCRE can keep an HT from str addr to str of validated strings. This would allow skipping validation if the same string instance is passed. It's a tradeoff though, because this would be keeping alive strings. Would require a mechanism to ensure we don't keep too many strings etc. |
@nikic thx for the explanation. maybe we could add this capability to interned strings only (and add a flag to those or something). interned strings which tend to get re-used very often would benefit most from such a optimization anywhere, right? |
@staabm Interned strings likely wouldn't help for this particular situation though. Furthermore it would require doing an UTF-8 check on all interned strings upfront, as they live in protected memory. For non-interned strings, adding a (GC) flag directly on the string would be a possibility, but requires care to ensure that this flag is always reset if the string is modified. If we integrate this with forget_hash_val() it might work with little code modification. It feels ugly to include PCRE concerns into the string flags though :/ |
agree. I guess it would only make sense in case it makes a big difference in pcre perf. |
Also note that we'd still have to make sure that the position is valid -- but as UTF-8 is a self-synchronizing encoding this is cheap. |
Well, sorry guys, from this point I can't help anymore. The only little knowledge I have of the internals come from @nikic's (great) blog posts... There should be more of them! 😉 As you might have guessed, the reason I submitted this PR is for matching a lot of times (I mean tens of thousands of calls) against moderately large inputs (up to a few hundreds of kilobytes) for lexical analysis. I tried to use the So I sincerely hope somebody can come up with some wizardry (or redesign the whole preg utf8 api 😆) to make this happen! Have a nice day! ☀️ |
I like your proposal a lot. PCRE supports UTF-8 only. You may use
to validate UTF-8 string before processing. It's used by php_escape_html_entities_ex(), etc. It may be good idea validate encoding at first, then do not validate encoding in PCRE by default. We need a benchmark to enable pre UTF-8 encoding validation always. BTW, I'm thinking adding 2 new features to PHP that validates input encoding. If these are implemented, UTF-8 encoding validation in preg by default will be redundant. Therefore, I'm not 100% sure if pre UTF-8 encoding validation in preg is good to have. When preg functions are called multiple times on the same string, encoding is validated over and over. Anyway, one feature is string validation filter for filter module that validates
Another is type affinity
With these features, user may pass UTF-8 string safely to PCRE. One of the reason why I haven't implemented them is encoding check. These should support various encodings, but both intl/mbstring is external code and cannot use them always. We may say "we support only UTF-8 when default_charset is set to UTF-8", if it's ok :) |
@yohgaki I've been doing some smoke tests by throwing large amounts of random binary data to A short/mid-term solution would be to give a lot more love to the Oniguruma integration, and make it a real first-class citizen in future releases. The advantage being that php core devs can draw inspiration from the work that's already been done by Ruby (even maybe port the |
@ju1ius PCRE should work correctly when input string is valid UTF-8 and PCRE_NO_UTF8_CHECK, shouldn't it? It should not matter much if PCRE crashes on binary or not. All we need is huge sign of warning in the manual. IMO. Or it may validate string as UTF-8 then apply PCRE_NO_UTF8_CHECK, if input is invalid, return false just like when backtrack/stack limit is reached. |
It does indeed.
Crashing on binary files means it will crash on other multibytes encoding it doesn't recognize.
I tend to agree with you on this. After all nothing prevents you from calling
That's how it's done in Side note: |
I'm closing this because the proposed patch cannot address the issue safely. |
@ju1ius Just to make sure, did you check that the segfaults you saw were not caused by stack overflows? Those are "expected" and can happen whether or not the data is valid UTF-8. |
No 100% sure but given that it did not crash in ascii mode, nor did it crash in utf mode with non-backtracking patterns, and that oniguruma did not crash at all, I took that as a strong indication. |
@nikic Just for the record, could you elaborate on this? |
It sounds crashing by excessive number of recursive calls. If this is the case, crashes shouldn't matter.
PCRE has stack and backtrack limits. Excessive amount of backtracks should result in FALSE in PHP. You seem getting crashes because you are calling PCRE API from C directly. It seems you don't have to close this PR, but it seems good enough to be merged. |
Did you disable JIT? Did you limit stack and backtracks? Users MUST pass valid UTF-8 regardless of PCRE protections. I'm proposing RFC for this purpose. |
No, I used C for Oniguruma only (to bypass Without going into too much details, what I did was basically this: <?php
// words.php
$pattern = $argv[1];
$errors = [
PREG_NO_ERROR => "PREG_NO_ERROR",
PREG_INTERNAL_ERROR => "PREG_INTERNAL_ERROR",
PREG_BACKTRACK_LIMIT_ERROR => "PREG_BACKTRACK_LIMIT_ERROR",
PREG_RECURSION_LIMIT_ERROR => "PREG_RECURSION_LIMIT_ERROR",
PREG_BAD_UTF8_ERROR => "PREG_BAD_UTF8_ERROR",
PREG_BAD_UTF8_OFFSET_ERROR => "PREG_BAD_UTF8_OFFSET_ERROR",
];
do {
$str = fread(STDIN, 4096);
$res = preg_match_all($pattern, $str, $matches, PREG_NO_UTF8_CHECK);
if ($res === false) {
echo $errors[preg_last_error()], PHP_EOL;
}
} while(!feof(STDIN)); then #!/bin/bash
# test.sh
find "$1" -type f | while read name
do
./php-src/sapi/cli/php ./words.php "$2" < "$name" 1>>./preg_errors.txt
if [[ $? != 0 ]]; then
echo "Crashed on: $name"
fi
done $ ./test.sh /usr/bin '/.*\w{10}/u' Of course, I also tested without |
Well, the bug is still open 😉
But feel free to reopen if you think you should be ! |
We currently use the native stack for (non-JIT) PCRE, which can cause a stack overflow. If you have a pattern that causes excessive stack usage, you'll get an "expected" crash. But as you tested without /u and did not see a crash, that seems unlikely. |
Hi,
This PR is related to Bug #72685.
It adds a
PREG_NO_UTF8_CHECK
flag topreg_match
,preg_match_all
andpreg_split
functions to disable the implicit UTF-8 validity checking performed by PCRE by default when the/u
option is enabled.Without the ability to disable this check, matching repeatedly against the same input in utf-8 mode yields quadratic performance , instead of linear.
I'm targetting this branch because it fixes a bug and it is a BC change, but it works on master too.
This is my first PR to php-src, so feel free to point me in the right direction if needed.
Thanks.
Following is a small PHP script that can be ran against this branch to test the new behavior:
Sample output: