Skip to content

Add support for PCRE callouts #3970

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

nikic
Copy link
Member

@nikic nikic commented Mar 21, 2019

This PR adds support for PCRE callouts, which are documented at https://www.pcre.org/current/doc/html/pcre2callout.html. Callouts provide a way to call a custom function during pattern matching, which can inspect the current matching state and has limited control over pattern matching.

A callout function is specified using preg_callout_function(), which also returns the previously set callout function. A typical usage would thus be:

$oldCallout = preg_callout_function(function($info) {
    var_dump($info);
    return 0; // Continue match
});
preg_match($regex, $subject);
preg_callout_function($oldCallout);

The callout function is passed a single array as argument which contains various information about the matching state. For the regular expression '/((foo)(bar))(?C42)baz/' against 'foobarbaz' the contents of $info are:

array(8) {
  ["callout"]=>
  int(42)
  ["subject"]=>
  string(9) "foobarbaz"
  ["captures"]=>
  array(4) {
    [0]=>
    array(2) {
      [0]=>
      NULL
      [1]=>
      int(-1)
    }
    [1]=>
    array(2) {
      [0]=>
      string(6) "foobar"
      [1]=>
      int(0)
    }
    [2]=>
    array(2) {
      [0]=>
      string(3) "foo"
      [1]=>
      int(0)
    }
    [3]=>
    array(2) {
      [0]=>
      string(3) "bar"
      [1]=>
      int(3)
    }
  }
  ["capture_last"]=>
  int(1)
  ["start_match"]=>
  int(0)
  ["current_position"]=>
  int(6)
  ["pattern_position"]=>
  int(18)
  ["next_item_length"]=>
  int(1)
}

The meaning of the contents is the same as described in the PCRE documentation. In particular this information is exposed: The callout name/number, the subject string, the captured subpatterns (to this point) including offsets, which one was captured last, at which position matching started, the current position in the subject, the current position in the pattern and the length of the current item in the pattern.

The callout function can exert limited control over pattern matching through the return value. This is an integer that is one of:

  • == 0: Continue matching as usual.
  • > 0: Abort current matching attempt, but still backtrack (equivalent to failing lookahead assertion).
  • < 0: Abort match entirely.

Additionally a new modifier /C is added, which enables AUTO_CALLOUT mode. This will insert callouts into many points in the pattern, allowing you to easily trace the pattern matching behavior. See callout_example_1.phpt for an example of a simple tracing function that can be used.

PCRE2_SIZE *ovector = block->offset_vector;
zval info, matches, retval;
uint32_t i;
(void) user_data;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should used_data be probably picked up as well? One could pass some user defined object there. Would require preg_callout_function() to be extended to save that. Could be complicated with GC, though.

Thanks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need user data arguments anymore nowadays, because you can just use() it in the closure instead.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack, makes sense :) Thanks.

zend_fcall_info fci = {0};
zend_fcall_info_cache fcc;

if (zend_parse_parameters(ZEND_NUM_ARGS(), "|f!", &fci, &fcc) == FAILURE) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious: Why don't you use the ZEND_PARSE_PARAMETERS_* macros?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a rarely used and not performance sensitive API, so using zend_parse_parameters avoids unnecessary codesize increase.

@@ -3001,6 +3111,10 @@ ZEND_BEGIN_ARG_INFO_EX(arginfo_preg_grep, 0, 0, 2)
ZEND_ARG_INFO(0, flags)
ZEND_END_ARG_INFO()

ZEND_BEGIN_ARG_INFO_EX(arginfo_preg_callout_function, 0, 0, 0)
ZEND_ARG_INFO(0, new_function)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you think to type the $new_function argument, with ZEND_ARG_CALLABLE_INFO for instance?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm holding off on adding arginfo types until we can start adding them wholesale, to avoid behavior discrepancies. https://wiki.php.net/rfc/consistent_type_errors was the largest piece of preparation for that, but there's some more work necessary before we can start doing that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@derickr
Copy link
Member

derickr commented Mar 21, 2019

I like the idea, but not that it requires a global state to set the callout function. I think it would be better if it would be a specific option per preg_match call, so that people can't forget to disable it again?

@nikic
Copy link
Member Author

nikic commented Mar 21, 2019

@derickr This functionality is not specific to preg_match, it can also be used in conjunction with all the other preg_* functions. We'd have to add a new argument to all functions, and the signatures are already pushing on the limits of what's reasonable.

@weltling
Copy link
Contributor

It would probably make more sense if pcre had an OOP interface, then a callout function could be attached to a concrete object. Another option could be to attach to the pattern structs, but that would increase the pattern cache size and also likely bring issues with GC, so not an option.

Thanks.

@nikic nikic added the Feature label Mar 21, 2019
@colinodell
Copy link
Contributor

This feature would be extremely helpful for identifying and optimizing poorly-performing regular expressions! Because PHP does not currently expose this information to userland, I'm currently resorting to capturing the patterns and subjects manually and running those through external tools to calculate the number of steps 😧 I would love if I could do all of this natively in PHP.

Any chance that efforts on this feature might resume at some point? I'd be willing to help however I can, although I don't know C very well so I'd need a good bit of guidance and support.

@iluuu1994
Copy link
Member

This seems to have gone stale. Feel free to reopen if you'd like to pick this back up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants