Skip to content

adds PREG_NO_UTF8_CHECK flag #2035

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from
Closed

Conversation

ju1ius
Copy link
Contributor

@ju1ius ju1ius commented Jul 27, 2016

Hi,
This PR is related to Bug #72685.
It adds a PREG_NO_UTF8_CHECK flag to preg_match, preg_match_all and preg_split functions to disable the implicit UTF-8 validity checking performed by PCRE by default when the /u option is enabled.

Without the ability to disable this check, matching repeatedly against the same input in utf-8 mode yields quadratic performance , instead of linear.

I'm targetting this branch because it fixes a bug and it is a BC change, but it works on master too.
This is my first PR to php-src, so feel free to point me in the right direction if needed.

Thanks.

Following is a small PHP script that can be ran against this branch to test the new behavior:

function do_test ($input_size, $flags = 0) {
  $str = str_repeat('a', $input_size);

  $start = microtime(true);
  $pos = 0;
  while(preg_match('/\G\w/', $str, $m, $flags, $pos)) ++$pos;
  $end = microtime(true);
  echo '>>> NO u flag: ', number_format(($end - $start)*1000, 6), 'ms', PHP_EOL;

  $start = microtime(true);
  $pos = 0;
  while(preg_match('/\G\w/u', $str, $m, $flags, $pos)) ++$pos;
  $end = microtime(true);
  echo '>>> WITH u flag: ', number_format(($end - $start)*1000, 6), 'ms', PHP_EOL;
}

echo 'WITHOUT PREG_NO_UTF8_CHECK', PHP_EOL;
echo 'Input size = 1e4 ', PHP_EOL;
do_test(1e4, 0);
echo 'Input size = 1e5 ', PHP_EOL;
do_test(1e5, 0);
echo str_repeat('-', 80), PHP_EOL;
echo 'WITH PREG_NO_UTF8_CHECK', PHP_EOL;
echo 'Input size = 1e4 ', PHP_EOL;
do_test(1e4, PREG_NO_UTF8_CHECK);
echo 'Input size = 1e5 ', PHP_EOL;
do_test(1e5, PREG_NO_UTF8_CHECK);

Sample output:

WITHOUT PREG_NO_UTF8_CHECK
Input size = 1e4
>>> NO u flag:    14.562130ms
>>> WITH u flag: 243.849993ms
Input size = 1e5
>>> NO u flag:      111.440897ms
>>> WITH u flag: 23,129.899025ms
--------------------------------------------------------------------------------
WITH PREG_NO_UTF8_CHECK
Input size = 1e4
>>> NO u flag:   11.296988ms
>>> WITH u flag: 11.363029ms
Input size = 1e5
>>> NO u flag:   110.867023ms
>>> WITH u flag: 111.830950ms

@@ -32,6 +32,7 @@
#define PREG_PATTERN_ORDER 1
#define PREG_SET_ORDER 2
#define PREG_OFFSET_CAPTURE (1<<8)
#define PREG_NO_UTF8_CHECK PCRE_NO_UTF8_CHECK
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I couldn't grasp the reasoning behind the various existing PREG_* constants, so I just used PCRE's value.
Please tell me if this should be changed.

Fixes bug #72685.
This commit adds a way to set the PCRE_NO_UTF8_CHECK flag for
`preg_match`, `preg_match_all` and `preg_split` php functions.

Without the ability to set this flag, matching repeatedly
against the same input in utf-8 mode yields quadratic performance
instead of linear, since PCRE then checks the whole subject
for UTF-8 validity on every call.
@ju1ius ju1ius force-pushed the pcre_no_utf_check branch from 8f07584 to 20fd688 Compare July 27, 2016 07:37
@weltling
Copy link
Contributor

@ju1ius thanks for the report. I would suggest to turn your PHP code into the test and add to the PR. I also think it is required to add tests with invalid UTF-8 strings.

While this option can be of course useful, it looks dangerous to be allowed just as is. The PCRE man page warns explicitly about possible bad consequences for the case this option is used incorrectly. Not sure, maybe hashing the subject string before were an option, if utf-8 check is disabled. Or some other way to ensure same subject string is passed over again.

Thanks.

@ju1ius
Copy link
Contributor Author

ju1ius commented Jul 27, 2016

I also think it is required to add tests with invalid UTF-8 strings.

Yep, just doing that ATM 😉

@ju1ius
Copy link
Contributor Author

ju1ius commented Jul 27, 2016

While this option can be of course useful, it looks dangerous to be allowed just as is.
Not sure, maybe hashing the subject string before were an option, if utf-8 check is disabled.

@weltling Wouldn't hashing the string on each invocation kind of defeat the point ?

Also, no pun intended but PHP already has an history of trying to be automagically safe, failing at that, thus creating security holes and (what's worth) long lasting bad habits among it's community (remember magic_quotes, safe_mode, et al.)

We already have a function to check for UTF-8 validity: mb_check_encoding! 🎉

so IMO the right solution would be to document this flag with a big, bold, yellow warning like:

Use this flag when you need to match repeatedly against the same input in utf-8 mode.
Be sure however to call mb_check_encoding($my_input, 'utf-8') before matching, otherwise things will blow-up and a fairy will die.

And include a small code sample like:

if (!mb_check_encoding($input, 'utf-8')) {
  throw new \InvalidArgumentException('Invalid UTF-8 string');
}
// Now match a sequence of whitespace separated words.
$pos = 0;
$words = [];
while(preg_match('/\G\s*(\w+)\s*/u', $input, $matches, PREG_NO_UTF8_CHECK, $pos)) {
  $pos += strlen($matches[0]);
  $words[] = $matches[1];
}

What do you think?
Thanks for your attention btw! 😉

@ju1ius
Copy link
Contributor Author

ju1ius commented Jul 27, 2016

@weltling I just added a few tests based on what I could find inside the ext/pcre/tests directory

I would suggest to turn your PHP code into the test and add to the PR.

I'm not sure how to do that, the above code is more a benchmark. Can you point me to existing performance tests ?

@weltling
Copy link
Contributor

Yeah, it is a clear that the validation has to happen before just once. What I'm talking about is the case where no validation happened before. PCRE promises undefined behavior or even a crash. Imho not a good idea just opening a rabbit hole without any defense. Documentation is a good thing, but maybe more could be done. Hashing was just the first thing that came into my mind - it iis supposed to be faster than utf-8 check. A smarter idea is welcome :)

Thanks.

@weltling
Copy link
Contributor

With the benchmark test - you can pick a criteria. Say the time difference is expected to be not more that X ms, etc. Or even seconds. Just to be sure that even the rough expectations are ensured.

Thanks.

@ju1ius
Copy link
Contributor Author

ju1ius commented Jul 27, 2016

With the benchmark test - you can pick a criteria. Say the time difference is expected to be not more that X ms, etc. Or even seconds. Just to be sure that even the rough expectations are ensured.

Yes, I think just ensuring a roughly linear complexity should be enough without slowing down the builds too much.

@staabm
Copy link
Contributor

staabm commented Jul 27, 2016

could php-src somehow remember which strings already have been "validated" and set this flag automatically instead of defering this decision to userland?

@nikic
Copy link
Member

nikic commented Jul 27, 2016

Just like @weltling I'm apprehensive about adding this. The documentation is pretty explicit on the point:

When PCRE_NO_UTF8_CHECK is set, the effect of passing an invalid string as a subject or an invalid value of startoffset is undefined. Your program may crash or loop.

We cannot expose options that cause memory unsafety. If executing PHP code can cause memory errors (-so), this is always a bug in PHP and not a problem with the user doing something they're not supposed to do.

@staabm In principle possible. PCRE can keep an HT from str addr to str of validated strings. This would allow skipping validation if the same string instance is passed. It's a tradeoff though, because this would be keeping alive strings. Would require a mechanism to ensure we don't keep too many strings etc.

@staabm
Copy link
Contributor

staabm commented Jul 27, 2016

@nikic thx for the explanation. maybe we could add this capability to interned strings only (and add a flag to those or something). interned strings which tend to get re-used very often would benefit most from such a optimization anywhere, right?

@nikic
Copy link
Member

nikic commented Jul 27, 2016

@staabm Interned strings likely wouldn't help for this particular situation though. Furthermore it would require doing an UTF-8 check on all interned strings upfront, as they live in protected memory.

For non-interned strings, adding a (GC) flag directly on the string would be a possibility, but requires care to ensure that this flag is always reset if the string is modified. If we integrate this with forget_hash_val() it might work with little code modification. It feels ugly to include PCRE concerns into the string flags though :/

@staabm
Copy link
Contributor

staabm commented Jul 27, 2016

It feels ugly to include PCRE concerns into the string flags though :/

agree. I guess it would only make sense in case it makes a big difference in pcre perf.
no sure whether a "this interned string is valid utf8" flag could make sense for other areas.

@nikic
Copy link
Member

nikic commented Jul 27, 2016

Also note that we'd still have to make sure that the position is valid -- but as UTF-8 is a self-synchronizing encoding this is cheap.

@ju1ius
Copy link
Contributor Author

ju1ius commented Jul 27, 2016

Well, sorry guys, from this point I can't help anymore. The only little knowledge I have of the internals come from @nikic's (great) blog posts... There should be more of them! 😉

As you might have guessed, the reason I submitted this PR is for matching a lot of times (I mean tens of thousands of calls) against moderately large inputs (up to a few hundreds of kilobytes) for lexical analysis.
Non-utf8 preg_match calls (without /u) allow to do that in a reasonable amount of time, but with the quadratic complexity imposed by repeated utf8 validation, it's just ridiculously slow.

I tried to use the mb_ereg_search api, which does just that: validating once in mb_ereg_search_init (using php_mb_check_encoding under the hood) so that you can call mb_ereg_search repeatedly at very a little cost. It's even slightly faster than non-utf8 preg_match in my benchmarks.
Unfortunately the Oniguruma integration is kinda flaky. I found a number of little bugs that render it unusable for what I need ATM. 💔

So I sincerely hope somebody can come up with some wizardry (or redesign the whole preg utf8 api 😆) to make this happen!
In the meantime I'll file bugs against the mbstring extension... 🙏

Have a nice day! ☀️

@yohgaki
Copy link
Contributor

yohgaki commented Jul 28, 2016

I like your proposal a lot. PCRE supports UTF-8 only. You may use

/* {{{ php_next_utf8_char
 * Public interface for get_next_char used with UTF-8 */
 PHPAPI unsigned int php_next_utf8_char(
        const unsigned char *str,
        size_t str_len,
        size_t *cursor,
        int *status)

to validate UTF-8 string before processing. It's used by php_escape_html_entities_ex(), etc. It may be good idea validate encoding at first, then do not validate encoding in PCRE by default. We need a benchmark to enable pre UTF-8 encoding validation always.

BTW, I'm thinking adding 2 new features to PHP that validates input encoding. If these are implemented, UTF-8 encoding validation in preg by default will be redundant. Therefore, I'm not 100% sure if pre UTF-8 encoding validation in preg is good to have. When preg functions are called multiple times on the same string, encoding is validated over and over.

Anyway, one feature is string validation filter for filter module that validates

  • disallow control chars by default
  • char encoding validation/conversion (Not an auto encoding detection & conversion)
  • char length (min/max)
  • used chars (numeric only, alpha numeric only , single line (i.e. no newline), etc)
  • I probably add validation filter list so that multiple validation filters could be applied to a string.

Another is type affinity

With these features, user may pass UTF-8 string safely to PCRE.

One of the reason why I haven't implemented them is encoding check. These should support various encodings, but both intl/mbstring is external code and cannot use them always. We may say "we support only UTF-8 when default_charset is set to UTF-8", if it's ok :)

@ju1ius
Copy link
Contributor Author

ju1ius commented Jul 29, 2016

@yohgaki I've been doing some smoke tests by throwing large amounts of random binary data to preg_match with PCRE_NO_UTF8_CHECK. It causes segfaults when the pattern involves a lot of backtracking, like /.*\w{10}/u. 😕
I was hoping secretly that it would just bail out internally, but no it can crash the whole process (although only in rather extreme conditions).

A short/mid-term solution would be to give a lot more love to the Oniguruma integration, and make it a real first-class citizen in future releases. The advantage being that php core devs can draw inspiration from the work that's already been done by Ruby (even maybe port the strscan module? ❤️).

@yohgaki
Copy link
Contributor

yohgaki commented Jul 30, 2016

@ju1ius PCRE should work correctly when input string is valid UTF-8 and PCRE_NO_UTF8_CHECK, shouldn't it? It should not matter much if PCRE crashes on binary or not. All we need is huge sign of warning in the manual. IMO. Or it may validate string as UTF-8 then apply PCRE_NO_UTF8_CHECK, if input is invalid, return false just like when backtrack/stack limit is reached.

@ju1ius
Copy link
Contributor Author

ju1ius commented Jul 30, 2016

PCRE should work correctly when input string is valid UTF-8 and PCRE_NO_UTF8_CHECK, shouldn't it?

It does indeed.

It should not matter much if PCRE crashes on binary or not.

Crashing on binary files means it will crash on other multibytes encoding it doesn't recognize.

All we need is huge sign of warning in the manual.

I tend to agree with you on this. After all nothing prevents you from calling eval on $_GET parameters without validation. However I totally understand @nikic concerns about memory safety.

Or it may validate string as UTF-8 then apply PCRE_NO_UTF8_CHECK, if input is invalid, return false just like when backtrack/stack limit is reached.

That's how it's done in preg_match_all currently. Thing is, the preg_match api isn't designed for that like the mb_ereg_search api is. When you call mb_ereg_search_init, the input string is validated and cached in a global struct. Subsequent calls to mb_ereg_search et al. do not revalidate.

Side note:
For my tests, I basically threw all the binaries in my /usr/bin and WINDOWS/system32 folders to both preg_match with PCRE_NO_UTF8_CHECK, and an oniguruma script written in C (to bypass the encoding validation of the mb_ereg functions). PCRE returned gibberish results and crashed when the pattern involved a fair amount of backtracking. Oniguruma never crashed and never errored either (although it was slow on some large files - but that was expected since it actually tries to return correct results).

@ju1ius
Copy link
Contributor Author

ju1ius commented Aug 1, 2016

I'm closing this because the proposed patch cannot address the issue safely.
IMO the correct solution is to improve the mb_ereg functions until they are on par with the preg_ functions.
I still think there should be a big warning in the docs about the quadratic behavior exposed here.

@ju1ius ju1ius closed this Aug 1, 2016
@nikic
Copy link
Member

nikic commented Aug 1, 2016

@ju1ius Just to make sure, did you check that the segfaults you saw were not caused by stack overflows? Those are "expected" and can happen whether or not the data is valid UTF-8.

@ju1ius
Copy link
Contributor Author

ju1ius commented Aug 1, 2016

Just to make sure, did you check that the segfaults you saw were not caused by stack overflows?

No 100% sure but given that it did not crash in ascii mode, nor did it crash in utf mode with non-backtracking patterns, and that oniguruma did not crash at all, I took that as a strong indication.

@ju1ius
Copy link
Contributor Author

ju1ius commented Aug 5, 2016

Those are "expected" and can happen whether or not the data is valid UTF-8.

@nikic Just for the record, could you elaborate on this?

@yohgaki
Copy link
Contributor

yohgaki commented Aug 8, 2016

Just to make sure, did you check that the segfaults you saw were not caused by stack overflows?
Those are "expected" and can happen whether or not the data is valid UTF-8.

@nikic Just for the record, could you elaborate on this?

It sounds crashing by excessive number of recursive calls. If this is the case, crashes shouldn't matter.

PCRE returned gibberish results and crashed when the pattern involved a fair amount of backtracking.

PCRE has stack and backtrack limits. Excessive amount of backtracks should result in FALSE in PHP. You seem getting crashes because you are calling PCRE API from C directly.

It seems you don't have to close this PR, but it seems good enough to be merged.

@yohgaki
Copy link
Contributor

yohgaki commented Aug 8, 2016

Just to make sure, did you check that the segfaults you saw were not caused by stack overflows?

No 100% sure but given that it did not crash in ascii mode, nor did it crash in utf mode with non-> >backtracking patterns, and that oniguruma did not crash at all, I took that as a strong indication.

Did you disable JIT? Did you limit stack and backtracks?
We are getting crash reports for PCRE on occasions. Hitting stack and backtrack limits are not problem at all.

Users MUST pass valid UTF-8 regardless of PCRE protections. I'm proposing RFC for this purpose.
https://wiki.php.net/rfc/add_validate_functions_to_filter

@ju1ius
Copy link
Contributor Author

ju1ius commented Aug 8, 2016

You seem getting crashes because you are calling PCRE API from C directly.

No, I used C for Oniguruma only (to bypass mb_check_encoding calls).

Without going into too much details, what I did was basically this:

<?php
// words.php
$pattern = $argv[1];
$errors = [
  PREG_NO_ERROR => "PREG_NO_ERROR",
  PREG_INTERNAL_ERROR => "PREG_INTERNAL_ERROR",
  PREG_BACKTRACK_LIMIT_ERROR => "PREG_BACKTRACK_LIMIT_ERROR",
  PREG_RECURSION_LIMIT_ERROR => "PREG_RECURSION_LIMIT_ERROR",
  PREG_BAD_UTF8_ERROR => "PREG_BAD_UTF8_ERROR",
  PREG_BAD_UTF8_OFFSET_ERROR => "PREG_BAD_UTF8_OFFSET_ERROR",
];

do {
  $str = fread(STDIN, 4096);
  $res = preg_match_all($pattern, $str, $matches, PREG_NO_UTF8_CHECK);
  if ($res === false) {
    echo $errors[preg_last_error()], PHP_EOL;
  }
} while(!feof(STDIN));

then

#!/bin/bash
# test.sh
find "$1" -type f | while read name
do
  ./php-src/sapi/cli/php ./words.php "$2" < "$name" 1>>./preg_errors.txt
  if [[ $? != 0 ]]; then
    echo "Crashed on: $name"
  fi
done
$ ./test.sh /usr/bin '/.*\w{10}/u'

Of course, I also tested without PREG_NO_UTF8_CHECK, and without /u, which gave me no segfaults. All with default php.ini (recursion_limit === backtrack_limit === 100000, no JIT since it's PHP5.6).

@ju1ius
Copy link
Contributor Author

ju1ius commented Aug 8, 2016

It seems you don't have to close this PR, but it seems good enough to be merged.

Well, the bug is still open 😉
I closed this PR because:

  1. I does not solve the issue on it's own. It needs either some internal magic, or your RFC. Or a new api similar to mb_ereg_search, in which case this PR is pointless.
  2. We have another regexp engine that handles this very well.

But feel free to reopen if you think you should be !

@nikic
Copy link
Member

nikic commented Aug 8, 2016

@nikic Just for the record, could you elaborate on this?

We currently use the native stack for (non-JIT) PCRE, which can cause a stack overflow. If you have a pattern that causes excessive stack usage, you'll get an "expected" crash. But as you tested without /u and did not see a crash, that seems unlikely.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants