Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize Dumper::encodeString for long strings #223

Closed
wants to merge 4 commits into from

Conversation

@schlndh
Copy link

schlndh commented Sep 18, 2016

  • feature
  • issues - none
  • documentation - not needed
  • BC break - yes

The purpose of this PR is to speedup dumping of large strings for reasonable maxLength. The performance problem is in preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) which scans the whole string. By preferring mb_substr it is possible to truncate the string before running it through the regex and thus potentially saving a lot of time for large strings. However this is a slight BC break, because now it won't recognize some binary strings it did recognize before, but since the binary part will be outside the truncated string I think it is worth it.

As you can see from the benchmark I made the performance impact for short strings or large strings with unlimited maxLength is minimal or none, however the performance gain for long strings with small maxLength is massive.

The benchmark compares current encodeString implementation (encodeStringOrig) with implementation from this PR (encodeStringChanged) and just running the preg_match on the whole string. The first column (N) is the number of characters and the values of other columns are average times for one call on given string in nanoseconds.

Benchmarking was done on Linux with PHP 7.0.10

@JanTvrdik

This comment has been minimized.

Copy link
Contributor

JanTvrdik commented Sep 18, 2016

How will mb_substr handle binary (not UTF-8) strings?


EDIT: I tried it myself and it seems to work fine.

$shortened = TRUE;
break;
}
} while (isset($s[++$i]));

This comment has been minimized.

Copy link
@JanTvrdik

JanTvrdik Sep 18, 2016

Contributor

This code block should be only done when the string was not already shortened by mb_substr, i.e.

} elseif ($maxLength && $s !== '') {

should be changed to

} elseif ($maxLength && $s !== '' && !function_exists('mb_substr')) {
if (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {
$s = strtr($s, $table);
}
} elseif (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {

This comment has been minimized.

Copy link
@JanTvrdik

JanTvrdik Sep 18, 2016

Contributor

The speed of the preg_match call be made faster by inverting the regexp, i.e. changing

preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()

to

!preg_match('#^[\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]*+\z#u', $s) || preg_last_error()

This comment has been minimized.

Copy link
@schlndh

schlndh Sep 18, 2016

Author

It is virtually the same for all test cases in my benchmark except for the test case where the whole string is 1 multibyte character repeated N times which is measurably faster with your regexp. I don't really understand why that is, but thanks.

}
if (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {
$s = strtr($s, $table);
}

This comment has been minimized.

Copy link
@JanTvrdik

JanTvrdik Sep 18, 2016

Contributor

This code duplication can be avoided if you change the following elseif back to if

if (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {
$s = strtr($s, $table);
}
} elseif (preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) || preg_last_error()) {
if ($shortened = ($maxLength && strlen($s) > $maxLength)) {
$s = substr($s, 0, $maxLength);
}
$s = strtr($s, $table);

This comment has been minimized.

Copy link
@JanTvrdik

JanTvrdik Sep 18, 2016

Contributor

After changing the elseif statement back to if, you wil need to change the $shortened variable only when the string is futher shortend, i.e.

if ($maxLength && strlen($s) > $maxLength) {
    $s = substr($s, 0, $maxLength);
    $shortened = TRUE;
}
$s = strtr($s, $table);
$shortened = $s !== $tmp;
} else {
$shortened = false;
}

This comment has been minimized.

Copy link
@JanTvrdik

JanTvrdik Sep 18, 2016

Contributor

This can be simplified by initializing $shortend variable first to FALSE (note the uppercase letters)

$shortened = FALSE;
if ($maxLength && strlen($s) > $maxLength && function_exists('mb_substr')) {
    $s = mb_substr($tmp = $s, 0, $maxLength, 'UTF-8');
    $shortened = $s !== $tmp;
}
@schlndh

This comment has been minimized.

Copy link
Author

schlndh commented Sep 18, 2016

@JanTvrdik Thanks for your comments. It didn't think of checking mb_substr with binary strings before, but it sort of works:

$lengths = [];
for ($i < 0; $i < 1000; ++$i) {
    $s = openssl_random_pseudo_bytes(1000);
    @$lengths[strlen(mb_substr($s, 0, 150, 'UTF-8'))]++;
}
ksort($lengths);
foreach ($lengths as $k => $val) {
    echo $k . " " . str_repeat('|', $val) . "\n";
}

It makes the substring mostly around 200 bytes long, which is fine I guess.

schlndh added 2 commits Sep 18, 2016
@schlndh

This comment has been minimized.

Copy link
Author

schlndh commented Sep 18, 2016

I updated the benchmark with latest changes to this PR.

}
} while (isset($s[++$i]));
}
} elseif ($maxLength && $s !== '' && !function_exists('mb_substr')) {

This comment has been minimized.

Copy link
@JanTvrdik

JanTvrdik Sep 18, 2016

Contributor

$s !== '' part of the condition can be changed to strlen($s) > $maxLength

@dg dg closed this in 5b83fa4 Sep 26, 2016
@schlndh schlndh deleted the schlndh:speedup-Dumper-encodeString branch Sep 27, 2016
@JanTvrdik

This comment has been minimized.

Copy link
Contributor

JanTvrdik commented Sep 27, 2016

@dg Your implementation is a lot slower for long UTF-8 strings.

@dg

This comment has been minimized.

Copy link
Member

dg commented Sep 27, 2016

As I understand, „the performance problem is in preg_match('#[^\x09\x0A\x0D\x20-\x7E\xA0-\x{10FFFF}]#u', $s) which scans the whole string“, my implementation scans only shortened strings. So it should be faster, or not?

@JanTvrdik

This comment has been minimized.

@schlndh

This comment has been minimized.

Copy link
Author

schlndh commented Sep 27, 2016

@dg I don't have time to test it right now, maybe in the evening, but I think the problem is that you still call preg_match('##u', $s) on the whole string which can be very long, that's why I truncated it first so that I can be sure that it is reasonably short before calling preg_match on it.

@dg

This comment has been minimized.

Copy link
Member

dg commented Sep 27, 2016

I see, your intention is to remove utf-8 checking at all before shortening.

@schlndh

This comment has been minimized.

Copy link
Author

schlndh commented Sep 27, 2016

@dg Yes.

dg added a commit that referenced this pull request Sep 27, 2016
…d string [Closes #223]

thanks to @schlndh and @JanTvrdik
@dg

This comment has been minimized.

Copy link
Member

dg commented Sep 27, 2016

Repushed, now it should be faster when even mb_substr is disabled.

dg added a commit that referenced this pull request Sep 27, 2016
…d string [Closes #223]

thanks to @schlndh and @JanTvrdik
@JanTvrdik

This comment has been minimized.

Copy link
Contributor

JanTvrdik commented Sep 27, 2016

Repushed, now it should be faster when even mb_substr is disabled.

Usually, but if you try string str_repeat("\x80", 1e5) with mb_substr disabled it is still slow. Simple fix is to not allow $i > $maxLength * 4

@dg

This comment has been minimized.

Copy link
Member

dg commented Sep 27, 2016

I know, it is acceptable.

@JanTvrdik

This comment has been minimized.

Copy link
Contributor

JanTvrdik commented Sep 27, 2016

Accepting issue which can be fixed with single line (e.g. $s = substr($s, 0, $maxLength << 2);) of code is unnecessary

dg added a commit that referenced this pull request Sep 27, 2016
…d string [Closes #223]

thanks to @schlndh and @JanTvrdik
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.