attempt to allow look-behind assertions in tokens #6
Conversation
Hello :-), Actually, |
used following tests : offset.php <?php
$repetitions = 1000;
$start_string = str_repeat("foo", 10000);
$time_start = microtime(true);
for ($i=$repetitions; $i >0; $i--) {
$offset = 0;
$string = $start_string;
while ($offset < strlen($string)) {
$found = preg_match('#(?:' . 'foo' . ')#u', $string, $matches, PREG_OFFSET_CAPTURE, $offset);
$success = $offset === $matches[0][1];
$offset += strlen($matches[0][0]);
}
}
$time_end = microtime(true);
$time = $time_end - $time_start;
$memory = memory_get_peak_usage(true);
function convert($size) {
$unit=array('b','kb','mb','gb','tb','pb');
return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];
}
$memory_2 = convert($memory);
print "test with offset\n";
print "time for $repetitions executions :\t$time seconds\n";
print "memory for $repetitions executions :\t$memory bytes ($memory_2)\n"; substr.php <?php
$repetitions = 1000;
$start_string = str_repeat("foo", 10000);
$time_start = microtime(true);
for ($i=$repetitions; $i >0; $i--) {
$string = $start_string;
while (0 < strlen($string)) {
$found = preg_match('#^(?:' . 'foo' . ')#u', $string, $matches);
$string = substr($string, strlen($matches[0]));
}
}
$time_end = microtime(true);
$time = $time_end - $time_start;
$memory = memory_get_peak_usage(true);
function convert($size) {
$unit=array('b','kb','mb','gb','tb','pb');
return @round($size/pow(1024,($i=floor(log($size,1024)))),2).' '.$unit[$i];
}
$memory_2 = convert($memory);
print "test with substring\n";
print "time for $repetitions executions :\t$time seconds\n";
print "memory for $repetitions executions :\t$memory bytes ($memory_2)\n"; results :
even if these tests are not totally realistic, they tend to confirm that using substring and anchor is really faster (around 75% faster). I am not really surprised by the anchor being faster, but the difference seems huge. |
btw, it seems my test does not show if there is memory consumption difference since I suppose the peak is due to the string… maybe these scripts could be updated to show memory consumption |
Curiously, I have significant different results. Please, see the following benchmarks: <?php
require '/usr/local/lib/Hoa/Core/Core.php';
from('Hoa')
-> import('Bench.~');
$bench = new Hoa\Bench();
$memory = array();
$string = str_repeat('foo', 10000);
$_string = $string;
$memory['substr'] = memory_get_usage();
$bench->substr->start();
while(0 < strlen($_string)) {
preg_match('#^(?:foo)#u', $_string, $matches);
$_string = mb_substr($_string, mb_strlen($matches[0]));
}
$bench->substr->stop();
$memory['substr'] = memory_get_usage() - $memory['substr'];
unset($matches);
unset($_string);
$_string = $string;
$offset = 0;
$memory['offset'] = memory_get_usage();
$bench->offset->start();
while($offset < strlen($_string)) {
preg_match('#(?:foo)#u', $_string, $matches, PREG_OFFSET_CAPTURE, $offset);
$offset += mb_strlen($matches[0][0]);
}
$bench->offset->stop();
$memory['offset'] = memory_get_usage() - $memory['offset'];
echo $bench;
print_r($memory); Here is the result:
Using offset saves a lot of CPU and memory. Please note that I used mb_* functions to be in a real situation. And here is my PHP version:
Thoughts? |
Other stats from @vonglasow: https://gist.github.com/vonglasow/6007324. With this last link, we see that the In a version without Hoa, the results was very different. This is because I think we can conclude? |
Oh, I have added a new optimisation for the $bench = new Hoa\Bench();
$memory = array();
$string = str_repeat('foo', 10000);
$_string = $string;
$memory['substr'] = memory_get_usage();
$bench->substr->start();
while(0 < strlen($_string)) {
preg_match('#^(?:foo)#u', $_string, $matches);
$_string = mb_substr($_string, mb_strlen($matches[0]));
}
$bench->substr->stop();
$memory['substr'] = memory_get_usage() - $memory['substr'];
unset($matches);
unset($_string);
$_string = $string;
$offset = 0;
$maxoffset = strlen($_string);
$memory['offset'] = memory_get_usage();
$bench->offset->start();
while($offset < $maxoffset) {
preg_match('#(?:foo)#u', $_string, $matches, PREG_OFFSET_CAPTURE, $offset);
$offset += mb_strlen($matches[0][0]);
}
$bench->offset->stop();
$memory['offset'] = memory_get_usage() - $memory['offset'];
echo $bench;
print_r($memory); My result:
And the new big benchmark: http://3v4l.org/mN9Ir. Results are quite similar but |
Here are the results from @osaris: http://pastie.org/private/splp4ujcgu4kadt67zlwa. They reinforce our previous results. But one last think. Here we test the same tokens. In a grammar with 10 tokens, 1 will match. What is the cost of 9 fails? I propose you make a patch and we will test on real-world data. |
I have added a new test with more than one tokens: <?php
require '/usr/local/lib/Hoa/Core/Core.php';
from('Hoa')
-> import('Bench.~');
$bench = new Hoa\Bench();
$memory = array();
$string = null;
$tokens = array('foo', 'bar', 'baz', 'qux', 'gordon', 'freeman');
$_ = count($tokens) - 1;
for($i = 0; $i < 10000; ++$i)
$string .= $tokens[mt_rand(0, $_)];
$_string = $string;
$memory['substr'] = memory_get_usage();
$bench->substr->start();
while(0 < strlen($_string)) {
foreach($tokens as $token) {
if(0 === preg_match('#^(?:' . $token . ')#u', $_string, $matches))
continue;
$_string = mb_substr($_string, mb_strlen($matches[0]));
break;
}
}
$bench->substr->stop();
$memory['substr'] = memory_get_usage() - $memory['substr'];
unset($matches);
unset($_string);
$_string = $string;
$offset = 0;
$maxoffset = mb_strlen($_string);
$memory['offset'] = memory_get_usage();
$bench->offset->start();
while($offset < $maxoffset) {
foreach($tokens as $token) {
if(0 === preg_match('#(?:' . $token . ')#u', $_string, $matches, PREG_OFFSET_CAPTURE, $offset))
continue;
if($offset !== $matches[0][1])
continue;
$offset += mb_strlen($matches[0][0]);
break;
}
}
$bench->offset->stop();
$memory['offset'] = memory_get_usage() - $memory['offset'];
echo $bench;
print_r($memory); Here is my result:
Results are quite similar. Even if the difference between We can notice that with a small |
I have optimized the |
I'm writing and experiencing many algorithms and ways to solve issues for many hours… but all the memory and benchs are false. Using So, let's sum up. I have removed the bench about memory, since I don't understand why numbers didn't change even if the string changes. So, let's talk about computing time for now.
What we see?
And even if the numbers didn't change for memory, here they are:
For So the question, finally, is: how to measure the memory? I will search tomorrow. |
see https://github.com/CircleCode/perfotests for some tests based on xhprof. |
Thanks for the benchmark!
Can we conclude that Edit: my
|
before concluding, I think it is required that we launch these tests in different envs… it would be cool if some people could run the test under different envs (php version essentially) and post their results here |
My results :
|
I have some strange results regarding memory with 5.5.3…
|
Absolutely normal :-]. |
nevertheless, the ratio seems to stay identical so we can conclude that offset is a better approach. I'll update my PR to match tip from origin, and update the doc accordingly |
with small data (the parsed string is between 200 and 250 chars) :
updated |
interesting (or not) question : does the overhead come from the number of functions call, or from the regex itself ? |
In what situations? |
updated commit (doc is still to be done) so that the new code can be tested (for history, old commit is still available : CircleCode/Compiler@46dda6f ) |
Instead of consuming matched text, offset is increased and regex matching is tested from offset.
|
||
if(null === $nextToken) | ||
throw new \Hoa\Compiler\Exception\UnrecognizedToken( | ||
'Unrecognized token "%s" at line 1 and column %d:' . | ||
"\n" . '%s' . "\n" . str_repeat(' ', $offset) . '↑', | ||
0, array(mb_substr($this->_text, 0, 1), $offset + 1, $text), | ||
0, array(substr($this->_text, $this->_text[$offset], 1), $offset + 1, $text), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
substr
and not mb_substr
because $offset
is computed in raw length and not in UTF8 length
I've updated my PR, and added some comments in it for remaining problems. Following pp was used for tests :
with following strings :
|
Nice! My own test:
This is not a real use case, but it tests the lexer. Please, see the patch 4dad944. You can close your PR (resolved on IRC). |
seems ok for me, thanks for the implementation :) |
Hey, I'm just posting now about a 4-month old message. Here it is: The first comment of @hoaproject in this pull request relates an "issue" with the ^ assertion and the $offset argument of preg_match(). Couldn't using the \G instead of the ^ be the appropriate solution for this? More information can be found in this page of the documentation: http://www.php.net/manual/en/regexp.reference.escape.php Maybe it's not relevant anymore, because the PR has been closed since. |
Does |
I just did some testing and yes, you can use look-behind to perform check on the characters prior to the provided offset. |
For your benchmark, you can use perfotest (see previous comments) to compare with previous and current implementations. |
Seems promising! |
Thanks, but I'm currently on Windows with an old PHP version. We'll need to wait more until I can test this properly. Still, from what I saw, the computation time is identical, while the memory consumption is better. |
Do we reopen the issue or open a new one? |
I ran the tests with perfotests (branch anchor_g) ant it seems that it decrease a little bit memory consumption and computation time, so this seems good for me (php 5.5) |
Can you test with PHP5.3.3? |
thanks to @camael24 , we have the following results with php 5.3.27: 52931a2c95221.offset_assertion_10.xhprof 2013-11-25 10:40:40 Overall Summary 52931a2c93f31.offset_aggregate_10.xhprof 2013-11-25 10:40:31 Overall Summary 52931a2c947a2.anchored_aggregate_10.xhprof 2013-11-25 10:38:26 Overall Summary php -v |
Hello :-), Thank you for this performance update. Can you make a PR (on |
Yes, I'm not so sure about the patch, would this commit be correct? |
Seems ok for me. /cc @CircleCode |
sounds ok for me too |
What about this PR then? |
I think we can close this PR, no? |
Sorry, I do not remember exactly what have been done… |
But we are able to look behind now in tokens. Make a new test, it works. |
instead of consuming matched text, offset is increased
and regex matching is tested from offset.
Note: since we remove the '^' anchor in the preg_match test
this can lead to degraded performance.
this allows for example following pp :
with following string :
maybe this can help #5