Skip to content
This repository has been archived by the owner on Sep 20, 2021. It is now read-only.

attempt to allow look-behind assertions in tokens #6

Closed
wants to merge 1 commit into from

Conversation

CircleCode
Copy link
Member

instead of consuming matched text, offset is increased
and regex matching is tested from offset.

Note: since we remove the '^' anchor in the preg_match test
this can lead to degraded performance.

this allows for example following pp :

%skip       SPACE       \s
%token      FOO         foo(?= bar)
%token      BAR         bar
%token      BAZ         (?<=bar )baz

#doc:
    <FOO> <BAR> <BAZ>

with following string :

foo bar baz

maybe this can help #5

@hoaproject
Copy link
Collaborator

Hello :-),

Actually, preg_match has an $offset argument but it does not behave as we expect. We would like to keep the ^ anchor to save memory and CPU while lexing. Here, we have to use the PREG_OFFSET_CAPTURE constant and then compare the result offset to 0 to see if it matches the beginning of the text. This was nearly the situation before the patch 39e8900 (well, we used strpos in addition, clearly not a good idea). But, because we do not create n substrings, it could be possible that we save more memory and CPU. Could you estimate the balance?

@CircleCode
Copy link
Member Author

used following tests :

offset.php

<?php

$repetitions = 1000;

$start_string = str_repeat("foo", 10000);
$time_start = microtime(true);

for ($i=$repetitions; $i >0; $i--) {
    $offset = 0;
    $string = $start_string;
    while ($offset < strlen($string)) {
        $found = preg_match('#(?:' . 'foo' . ')#u', $string, $matches, PREG_OFFSET_CAPTURE, $offset);
        $success = $offset === $matches[0][1];
        $offset += strlen($matches[0][0]);
    }
}

$time_end = microtime(true);
$time = $time_end - $time_start;

$memory = memory_get_peak_usage(true);

function convert($size) {
    $unit=array('b','kb','mb','gb','tb','pb');
    return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];
}

$memory_2 = convert($memory);

print "test with offset\n";
print "time for $repetitions executions :\t$time seconds\n";
print "memory for $repetitions executions :\t$memory bytes ($memory_2)\n";

substr.php

<?php

$repetitions = 1000;

$start_string = str_repeat("foo", 10000);
$time_start = microtime(true);

for ($i=$repetitions; $i >0; $i--) {
    $string = $start_string;
    while (0 < strlen($string)) {
        $found = preg_match('#^(?:' . 'foo' . ')#u', $string, $matches);
        $string = substr($string, strlen($matches[0]));
    }
}

$time_end = microtime(true);
$time = $time_end - $time_start;

$memory = memory_get_peak_usage(true);

function convert($size) {
    $unit=array('b','kb','mb','gb','tb','pb');
    return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];
}

$memory_2 = convert($memory);

print "test with substring\n";
print "time for $repetitions executions :\t$time seconds\n";
print "memory for $repetitions executions :\t$memory bytes ($memory_2)\n";

results :

$ php -v
PHP 5.4.16 (cli) (built: Jun  6 2013 09:20:50) 
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.4.0, Copyright (c) 1998-2013 Zend Technologies
    with Xdebug v2.2.3, Copyright (c) 2002-2013, by Derick Rethans
$ php test/offset.php && php test/substr.php 
test with offset
time for 1000 executions : 366.7652630806 seconds
memory for 1000 executions :   524288 bytes (512 kb)
test with substring
time for 1000 executions : 210.43934512138 seconds
memory for 1000 executions :   524288 bytes (512 kb)

even if these tests are not totally realistic, they tend to confirm that using substring and anchor is really faster (around 75% faster).

I am not really surprised by the anchor being faster, but the difference seems huge.

@CircleCode
Copy link
Member Author

btw, it seems my test does not show if there is memory consumption difference since I suppose the peak is due to the string… maybe these scripts could be updated to show memory consumption

@hoaproject
Copy link
Collaborator

Curiously, I have significant different results. Please, see the following benchmarks:

<?php

require '/usr/local/lib/Hoa/Core/Core.php';

from('Hoa')
-> import('Bench.~');

$bench  = new Hoa\Bench();
$memory = array();
$string = str_repeat('foo', 10000);

$_string          = $string;
$memory['substr'] = memory_get_usage();
$bench->substr->start();

while(0 < strlen($_string)) {

    preg_match('#^(?:foo)#u', $_string, $matches);
    $_string = mb_substr($_string, mb_strlen($matches[0]));
}

$bench->substr->stop();
$memory['substr'] = memory_get_usage() - $memory['substr'];

unset($matches);
unset($_string);

$_string = $string;
$offset  = 0;
$memory['offset'] = memory_get_usage();
$bench->offset->start();

while($offset < strlen($_string)) {

    preg_match('#(?:foo)#u', $_string, $matches, PREG_OFFSET_CAPTURE, $offset);
    $offset += mb_strlen($matches[0][0]);
}

$bench->offset->stop();
$memory['offset'] = memory_get_usage() - $memory['offset'];


echo $bench;
print_r($memory);

Here is the result:

substr  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||   830ms, 100.0%
offset  ||||||||||||||||                                           233ms,  28.1%
Array
(
    [substr] => 35448
    [offset] => 1704
)

Using offset saves a lot of CPU and memory. Please note that I used mb_* functions to be in a real situation. And here is my PHP version:

PHP 5.6.0-dev (cli) (built: Jul 16 2013 08:44:18)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.6.0-dev, Copyright (c) 1998-2013 Zend Technologies

Thoughts?

@hoaproject
Copy link
Collaborator

Other stats from @vonglasow: https://gist.github.com/vonglasow/6007324.
And here are some other stats on a lot of PHP versions (80+): http://3v4l.org/ifgWi.

With this last link, we see that the offset bench gives always better results than the substr bench. But, for PHP5.3.*, the offset bench uses more memory than substr. This situations is reversed with PHP5.4 and 5.5 (and 5.6, as my own test shows), where offset uses significantly less memory than substr.

In a version without Hoa, the results was very different. This is because Hoa\Core defines mb_internal_encoding('UTF-8') and mb_regex_encoding('UTF-8'), to force ext/mbstring to compute with UTF-8 in minds, which simulates real-world usecases.

I think we can conclude?

@hoaproject
Copy link
Collaborator

Oh, I have added a new optimisation for the offset benchmark. I use $maxoffset instead of strlen($offset) each time. Here is the new code:

$bench  = new Hoa\Bench();
$memory = array();
$string = str_repeat('foo', 10000);

$_string          = $string;
$memory['substr'] = memory_get_usage();
$bench->substr->start();

while(0 < strlen($_string)) {

    preg_match('#^(?:foo)#u', $_string, $matches);
    $_string = mb_substr($_string, mb_strlen($matches[0]));
}

$bench->substr->stop();
$memory['substr'] = memory_get_usage() - $memory['substr'];

unset($matches);
unset($_string);

$_string          = $string;
$offset           = 0;
$maxoffset        = strlen($_string);
$memory['offset'] = memory_get_usage();
$bench->offset->start();

while($offset < $maxoffset) {

    preg_match('#(?:foo)#u', $_string, $matches, PREG_OFFSET_CAPTURE, $offset);
    $offset += mb_strlen($matches[0][0]);
}

$bench->offset->stop();
$memory['offset'] = memory_get_usage() - $memory['offset'];


echo $bench;
print_r($memory);

My result:

substr  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||   819ms, 100.0%
offset  ||||||||||||||||                                           229ms,  27.9%
Array
(
    [substr] => 2448
    [offset] => 1704
)

And the new big benchmark: http://3v4l.org/mN9Ir. Results are quite similar but offset are a little bit faster (around 40ms, which is not negligible).

@hoaproject
Copy link
Collaborator

Here are the results from @osaris: http://pastie.org/private/splp4ujcgu4kadt67zlwa. They reinforce our previous results.

But one last think. Here we test the same tokens. In a grammar with 10 tokens, 1 will match. What is the cost of 9 fails? I propose you make a patch and we will test on real-world data.

@hoaproject
Copy link
Collaborator

I have added a new test with more than one tokens:

<?php

require '/usr/local/lib/Hoa/Core/Core.php';

from('Hoa')
-> import('Bench.~');

$bench  = new Hoa\Bench();
$memory = array();
$string = null;
$tokens = array('foo', 'bar', 'baz', 'qux', 'gordon', 'freeman');
$_      = count($tokens) - 1;

for($i = 0; $i < 10000; ++$i)
    $string .= $tokens[mt_rand(0, $_)];


$_string          = $string;
$memory['substr'] = memory_get_usage();
$bench->substr->start();

while(0 < strlen($_string)) {

    foreach($tokens as $token) {

        if(0 === preg_match('#^(?:' . $token . ')#u', $_string, $matches))
            continue;

        $_string = mb_substr($_string, mb_strlen($matches[0]));
        break;
    }
}

$bench->substr->stop();
$memory['substr'] = memory_get_usage() - $memory['substr'];

unset($matches);
unset($_string);

$_string          = $string;
$offset           = 0;
$maxoffset        = mb_strlen($_string);
$memory['offset'] = memory_get_usage();
$bench->offset->start();

while($offset < $maxoffset) {

    foreach($tokens as $token) {

        if(0 === preg_match('#(?:' . $token . ')#u', $_string, $matches, PREG_OFFSET_CAPTURE, $offset))
            continue;

        if($offset !== $matches[0][1])
            continue;

        $offset += mb_strlen($matches[0][0]);
        break;
    }
}

$bench->offset->stop();
$memory['offset'] = memory_get_usage() - $memory['offset'];


echo $bench;
print_r($memory);

Here is my result:

substr  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||  1571ms, 100.0%
offset  ||||||||||||||||||||||||||||||||||||||||                  1132ms,  72.0%
Array
(
    [substr] => 35856
    [offset] => 1704
)

Results are quite similar. Even if the difference between offset and substr is not bigger than before, the saved memory is still giant.

We can notice that with a small $string (set $i to 1000 or 100), computing times are very close but memory is still the same (better for offset).

@hoaproject
Copy link
Collaborator

I have optimized the substr algorithm. I have replaced all mb_* by *, because it does not matter after all. Regular expressions supports Unicode but if we consider the strings as a regular array of char, we can use strlen and substr. In this case, the substr algorithm is twice faster than offset but it uses 20 times more memory than offset! Now I am trying to optimize the offset algorithm. I have considerably reduced the time of offset by using the following regex: ^.{offset}(?:token) (so we replace the $offset argument of preg_match by using an anchor ^) but preg_match(…, …, $matches) adds the substring from the beginning to the offset at the index 0, so it increases memory.

@hoaproject
Copy link
Collaborator

I'm writing and experiencing many algorithms and ways to solve issues for many hours… but all the memory and benchs are false. Using get_memory_usage slow done all the stuff and give wrong numbers for the first algorithm, so I added an empty benchmark to initialize all the stuff.

So, let's sum up. I have removed the bench about memory, since I don't understand why numbers didn't change even if the string changes. So, let's talk about computing time for now.

substr  |||||||||||||||||||||||||||||                              561ms,  51.3%
offset  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||  1094ms, 100.0%

What we see? substr is twice faster than offset. This is for a string of length around 41k. Let's take a look at a “classical” strings, let's say Praspel contracts, rule expressions or even Mathematical expressions (resp. for Hoa\Praspel, Hoa\Ruler or Hoa\Math). Such a string has a length around 500. Here are the new statistics:

substr  ||||||||||||||||||||||||||||||||||||||                       1ms,  67.2%
offset  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||     1ms, 100.0%

substr is always faster than offset but by how many? Even not 1ms. So this is totally negligible, isn't?

And even if the numbers didn't change for memory, here they are:

Array
(
    [substr] => 1400
    [offset] => 1704
)

For offset, I use the following expression: (?|token) + $offset. I was able to consume less memory (1430) by using ^(?:.{offset})\K(token) - $offset, but we have a giant increase of computing time (around 5 times).

So the question, finally, is: how to measure the memory? I will search tomorrow.

@CircleCode
Copy link
Member Author

see https://github.com/CircleCode/perfotests for some tests based on xhprof.
It seems the difference is really small between the two approaches.

@hoaproject
Copy link
Collaborator

Thanks for the benchmark!
Here is the result (with default parameters):

  • anchored: 1759ms (CPU) and 6632b (memory);
  • offset: 1290ms (CPU) and 5240b (memory).

Can we conclude that offset is better?

Edit: my php -v:

PHP 5.6.0-dev (cli) (built: Sep  4 2013 17:00:16)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.6.0-dev, Copyright (c) 1998-2013 Zend Technologies
    with Xdebug v2.3.0dev, Copyright (c) 2002-2013, by Derick Rethans

@CircleCode
Copy link
Member Author

before concluding, I think it is required that we launch these tests in different envs…

it would be cool if some people could run the test under different envs (php version essentially) and post their results here

@osaris
Copy link
Member

osaris commented Sep 11, 2013

My results :

PHP 5.4.19 (cli) (built: Sep 11 2013 13:43:44)

  • offset
Total Incl. Wall Time (microsec):   1,176,995 microsecs
Total Incl. CPU (microsecs):    1,176,815 microsecs
Total Incl. MemUse (bytes): 5,224 bytes
Number of Function Calls:   44,909
  • anchored
Total Incl. Wall Time (microsec):   1,535,765 microsecs
Total Incl. CPU (microsecs):    1,535,295 microsecs
Total Incl. MemUse (bytes): 6,632 bytes
Number of Function Calls:   64,909

PHP 5.5.3 (cli) (built: Sep 11 2013 13:58:04)

  • offset
Total Incl. Wall Time (microsec):   1,314,553 microsecs
Total Incl. CPU (microsecs):    1,314,387 microsecs
Total Incl. MemUse (bytes): 5,224 bytes
Number of Function Calls:   44,961
  • anchored
Total Incl. Wall Time (microsec):   1,678,744 microsecs
Total Incl. CPU (microsecs):    1,678,382 microsecs
Total Incl. MemUse (bytes): 6,632 bytes
Number of Function Calls:   64,961

@CircleCode
Copy link
Member Author

I have some strange results regarding memory with 5.5.3…

PHP 5.3.10-1ubuntu3.4 with Suhosin-Patch (cli) (built: Sep 12 2012 18:59:41)
  • offset
Total Incl. Wall Time (microsec):   1,207,569 microsecs
Total Incl. CPU (microsecs):    1,207,275 microsecs
Total Incl. MemUse (bytes): 7,456 bytes
Number of Function Calls:   45,019
  • anchored
Total Incl. Wall Time (microsec):   1,665,031 microsecs
Total Incl. CPU (microsecs):    1,663,304 microsecs
Total Incl. MemUse (bytes): -32,255 bytes
Number of Function Calls:   65,019

@hoaproject
Copy link
Collaborator

Absolutely normal :-].

@CircleCode
Copy link
Member Author

nevertheless, the ratio seems to stay identical so we can conclude that offset is a better approach.

I'll update my PR to match tip from origin, and update the doc accordingly

@CircleCode
Copy link
Member Author

with small data (the parsed string is between 200 and 250 chars) :

./test.sh 10 50
PHP 5.5.3 (cli) (built: Aug 22 2013 05:36:52) 
  • offset
Total Incl. Wall Time (microsec):   2,065 microsecs
Total Incl. CPU (microsecs):    1,900 microsecs
Total Incl. MemUse (bytes): 5,240 bytes
Number of Function Calls:   228
  • anchored
Total Incl. Wall Time (microsec):   2,117 microsecs
Total Incl. CPU (microsecs):    1,900 microsecs
Total Incl. MemUse (bytes): 6,667 bytes
Number of Function Calls:   328

updated

@CircleCode
Copy link
Member Author

interesting (or not) question : does the overhead come from the number of functions call, or from the regex itself ?

@hoaproject
Copy link
Collaborator

In what situations?
With a lot of data, we have a lot of mb_* calls (with UTF-8 enabled), so it lags. In this case, regex is much faster. But with a small amount of data, regex is still better, if only for memory. Moreover, the regex is compiled and cached…

@CircleCode
Copy link
Member Author

updated commit (doc is still to be done) so that the new code can be tested (for history, old commit is still available : CircleCode/Compiler@46dda6f )

Instead of consuming matched text, offset is increased
and regex matching is tested from offset.

if(null === $nextToken)
throw new \Hoa\Compiler\Exception\UnrecognizedToken(
'Unrecognized token "%s" at line 1 and column %d:' .
"\n" . '%s' . "\n" . str_repeat(' ', $offset) . '↑',
0, array(mb_substr($this->_text, 0, 1), $offset + 1, $text),
0, array(substr($this->_text, $this->_text[$offset], 1), $offset + 1, $text),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

substr and not mb_substr because $offset is computed in raw length and not in UTF8 length

@CircleCode
Copy link
Member Author

I've updated my PR, and added some comments in it for remaining problems.

Following pp was used for tests :

%skip       SPACE       \s
%token      FOO         foo
// note the Lookbehind in BAR :-)
%token      BAR         (?<=foo )bar
%token      BAZ         baz
%token      CHECK       ✓
%token      POUET       pouet
%token      UNEXPECTED  unexpected

#doc:
    <FOO> <BAR> <CHECK> <POUET> <BAZ>

with following strings :

  • valid one : foo bar ✓ pouet baz
  • invalid one (to see offset errors) : foo bar ✓ unexpected baz

@hoaproject
Copy link
Collaborator

Nice!

My own test:

%token foo  foo|FOO
%token fbar (?<=foo)bar
%token Fbar (?<=FOO)bar
%token baz  baz

#root:
  <foo> ( <fbar> #lower | <Fbar> #upper ) <baz>

This is not a real use case, but it tests the lexer.

Please, see the patch 4dad944. You can close your PR (resolved on IRC).
Thanks a lot for this issue! Now we support look-behind tokens and we have a new lexer algorithm!

@CircleCode
Copy link
Member Author

seems ok for me, thanks for the implementation :)

@CircleCode CircleCode closed this Oct 14, 2013
@Savageman
Copy link
Member

Hey, I'm just posting now about a 4-month old message. Here it is:

The first comment of @hoaproject in this pull request relates an "issue" with the ^ assertion and the $offset argument of preg_match(). Couldn't using the \G instead of the ^ be the appropriate solution for this?

More information can be found in this page of the documentation: http://www.php.net/manual/en/regexp.reference.escape.php

Maybe it's not relevant anymore, because the PR has been closed since.

@hoaproject
Copy link
Collaborator

Does \G allows to use look-behind syntax?

@Savageman
Copy link
Member

I just did some testing and yes, you can use look-behind to perform check on the characters prior to the provided offset.
Another good news is that when using \G, the $0 match does NOT include the characters from the beginning of the string! The $0 match starts at the offset specified :)
I will now do some benchmark.

@CircleCode
Copy link
Member Author

For your benchmark, you can use perfotest (see previous comments) to compare with previous and current implementations.

@hoaproject
Copy link
Collaborator

Seems promising!

@Savageman
Copy link
Member

Thanks, but I'm currently on Windows with an old PHP version. We'll need to wait more until I can test this properly.

Still, from what I saw, the computation time is identical, while the memory consumption is better.

@Hywan
Copy link
Member

Hywan commented Nov 3, 2013

Do we reopen the issue or open a new one?

@CircleCode
Copy link
Member Author

I ran the tests with perfotests (branch anchor_g) ant it seems that it decrease a little bit memory consumption and computation time, so this seems good for me (php 5.5)

@Hywan
Copy link
Member

Hywan commented Nov 25, 2013

Can you test with PHP5.3.3?

@CircleCode
Copy link
Member Author

thanks to @camael24 , we have the following results with php 5.3.27:

52931a2c95221.offset_assertion_10.xhprof 2013-11-25 10:40:40

Overall Summary
Total Incl. Wall Time (microsec): 4,520,651 microsecs
Total Incl. CPU (microsecs): 4,524,283 microsecs
Total Incl. MemUse (bytes): 3,568 bytes
Number of Function Calls: 55,170

52931a2c93f31.offset_aggregate_10.xhprof 2013-11-25 10:40:31

Overall Summary
Total Incl. Wall Time (microsec): 4,551,622 microsecs
Total Incl. CPU (microsecs): 4,543,084 microsecs
Total Incl. MemUse (bytes): 3,896 bytes
Number of Function Calls: 55,170

52931a2c947a2.anchored_aggregate_10.xhprof 2013-11-25 10:38:26

Overall Summary
Total Incl. Wall Time (microsec): 3,136,697 microsecs
Total Incl. CPU (microsecs): 3,128,996 microsecs
Total Incl. MemUse (bytes): -33,858 bytes
Number of Function Calls: 75,170

php -v
PHP 5.3.27-1~dotdeb.0 with Suhosin-Patch (cli) (built: Jul 25 2013 20:17:25)
Copyright (c) 1997-2013 The PHP Group
Zend Engine v2.3.0, Copyright (c) 1998-2013 Zend Technologies

@Hywan Hywan reopened this Nov 30, 2013
@Hywan
Copy link
Member

Hywan commented Nov 30, 2013

Hello :-),

Thank you for this performance update. Can you make a PR (on Hoa\Compiler, not perfotest ;-))?

@Savageman
Copy link
Member

Yes, I'm not so sure about the patch, would this commit be correct?
Savageman@f20b2fc

@Hywan
Copy link
Member

Hywan commented Dec 1, 2013

Seems ok for me. /cc @CircleCode

@CircleCode
Copy link
Member Author

sounds ok for me too

@Hywan
Copy link
Member

Hywan commented May 14, 2014

What about this PR then?

@Hywan
Copy link
Member

Hywan commented Aug 22, 2014

I think we can close this PR, no?

@CircleCode
Copy link
Member Author

Sorry, I do not remember exactly what have been done…
as far as I remember, this feature has not been merged, even in another PR, so I don't think this can be closed.

@Hywan
Copy link
Member

Hywan commented Aug 25, 2014

But we are able to look behind now in tokens. Make a new test, it works.

@Hywan Hywan closed this Mar 27, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants