New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix #76827: Introduce a new function str_truncate to make fast truncation #3483

Open
wants to merge 1 commit into
base: master
from

Conversation

@Zhang-Junzhi

Zhang-Junzhi commented Aug 31, 2018

This function is provided as a far more sufficient alternative of
truncating a large string to a still-very-large string.
For example,
str_truncate($str, 1024 * 1024) vs $str = substr($str, 0, 1024 * 1024)
Obviously the former works far more sufficiently than the latter.

Fix #76827: Introduce a new function str_truncate to make fast trunca…
…tion

This function is provided as a far more sufficient alternative of
truncating a large string to a still-very-large string.
For example,
str_truncate($str, 1024 * 1024) vs $str = substr($str, 0, 1024 * 1024)
Obviously the former works far more sufficiently than the latter.
@carusogabriel

carusogabriel suggested changes Aug 31, 2018 edited

I believe this will need a small https://wiki.php.net/rfc.

Also, can you provide some benchmarks?

echo "\nDone";
?>
--EXPECTF--

This comment has been minimized.

@carusogabriel

carusogabriel Aug 31, 2018

Contributor

We can use EXPECT here

@carusogabriel

carusogabriel Aug 31, 2018

Contributor

We can use EXPECT here

@@ -2804,6 +2809,7 @@ static const zend_function_entry basic_functions[] = { /* {{{ */
PHP_FE(substr_compare, arginfo_substr_compare)
PHP_FE(utf8_encode, arginfo_utf8_encode)
PHP_FE(utf8_decode, arginfo_utf8_decode)
PHP_FE(str_truncate, arginfo_str_truncate)

This comment has been minimized.

@carusogabriel

carusogabriel Aug 31, 2018

Contributor

Miss aligned

@carusogabriel

carusogabriel Aug 31, 2018

Contributor

Miss aligned

@Majkl578

When you say "truncate", I immediately imagine a function that truncates the string and appends ellipsis when too long.
But there is ftruncate() which empties a file, so naming is consistent with this one, not behavior though (ftruncate appends null bytes filling up to the specified length if contents are shorter).
Right now I see three problems here: a) works by reference, b) bad naming, c) not multibyte-safe.

@Majkl578

This comment has been minimized.

Show comment
Hide comment
@Majkl578

Majkl578 Aug 31, 2018

Contributor
str_cut(string $s, int $len) : string

would make much more sense IMHO.

Contributor

Majkl578 commented Aug 31, 2018

str_cut(string $s, int $len) : string

would make much more sense IMHO.

@cmb69

This comment has been minimized.

Show comment
Hide comment
@cmb69

cmb69 Aug 31, 2018

Contributor

@Zhang-Junzhi Thanks for not only requesting a feature, but for also providing a respective pull request! However, this needs at least discussion on the internals@ mailing list, and likely an RFC. Please go ahead!

Wrt. the implementation: direct modification of the passed parameter causes issues; consider:

$str1 = $str2 = 'hello world';
str_truncate($str1, 5);
var_dump($str1, $str2);

See, for instance, array_sort() on how this can be done properly. Note, though, that a working implementation is not necessary to start the RFC process.

@Majkl578 The sole purpose of this function is to avoid copying a potentially large part of the string. Otherwise substr($str, 0, $len) would do the trick.

Contributor

cmb69 commented Aug 31, 2018

@Zhang-Junzhi Thanks for not only requesting a feature, but for also providing a respective pull request! However, this needs at least discussion on the internals@ mailing list, and likely an RFC. Please go ahead!

Wrt. the implementation: direct modification of the passed parameter causes issues; consider:

$str1 = $str2 = 'hello world';
str_truncate($str1, 5);
var_dump($str1, $str2);

See, for instance, array_sort() on how this can be done properly. Note, though, that a working implementation is not necessary to start the RFC process.

@Majkl578 The sole purpose of this function is to avoid copying a potentially large part of the string. Otherwise substr($str, 0, $len) would do the trick.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 1, 2018

Thank you all for your comments.

I am going to start a discussion on the mailing list and a RFC first, before actually updating the code and adding benchmarks.

As a non-native speaker, I also prefer the name |str_cut|, it looks short and pretty to me. So my proposal will go with |str_cut|.

And as @cmb69 said, the main purpose is to avoid copying a large part of a string. IMHO, if this function were to take care of mutlbyte safety of a large string, it would significantly lower efficiency, which might run counter to the function's main purpose.

Zhang-Junzhi commented Sep 1, 2018

Thank you all for your comments.

I am going to start a discussion on the mailing list and a RFC first, before actually updating the code and adding benchmarks.

As a non-native speaker, I also prefer the name |str_cut|, it looks short and pretty to me. So my proposal will go with |str_cut|.

And as @cmb69 said, the main purpose is to avoid copying a large part of a string. IMHO, if this function were to take care of mutlbyte safety of a large string, it would significantly lower efficiency, which might run counter to the function's main purpose.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 1, 2018

I tried to submit the subscription request of the internals@ mailing list twice two hours ago, but haven't yet received any confirmation email. Is it normal?

If it's abnormal, could somebody manually help me subscribe the internals@? My email address is zjz@zjz.name .

Thanks.

Zhang-Junzhi commented Sep 1, 2018

I tried to submit the subscription request of the internals@ mailing list twice two hours ago, but haven't yet received any confirmation email. Is it normal?

If it's abnormal, could somebody manually help me subscribe the internals@? My email address is zjz@zjz.name .

Thanks.

@cmb69

This comment has been minimized.

Show comment
Hide comment
@cmb69

cmb69 Sep 1, 2018

Contributor

Did you try to subscribe via the web form, or by sending a mail to internals-subscribe@? The former is likely still broken, but the latter worked at least until the mailing list server went down.

Contributor

cmb69 commented Sep 1, 2018

Did you try to subscribe via the web form, or by sending a mail to internals-subscribe@? The former is likely still broken, but the latter worked at least until the mailing list server went down.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 1, 2018

I tried it via the web form, I just sent a mail to internals-subscribe@, now it works. Thanks.

Zhang-Junzhi commented Sep 1, 2018

I tried it via the web form, I just sent a mail to internals-subscribe@, now it works. Thanks.

@cmb69

This comment has been minimized.

Show comment
Hide comment
@cmb69

cmb69 Sep 1, 2018

Contributor

Great! Note that broken webform has already been reported.

Contributor

cmb69 commented Sep 1, 2018

Great! Note that broken webform has already been reported.

@@ -2499,6 +2499,11 @@ ZEND_END_ARG_INFO()
ZEND_BEGIN_ARG_INFO_EX(arginfo_utf8_decode, 0, 0, 1)
ZEND_ARG_INFO(0, data)
ZEND_END_ARG_INFO()
ZEND_BEGIN_ARG_INFO_EX(arginfo_str_truncate, 0, 0, 2)
ZEND_ARG_INFO(0, str)

This comment has been minimized.

@nikic

nikic Sep 1, 2018

Member

The 0 here should be a 1 to pass the param by ref.

@nikic

nikic Sep 1, 2018

Member

The 0 here should be a 1 to pass the param by ref.

ZEND_PARSE_PARAMETERS_END();
if (new_len < ZSTR_LEN(str) && new_len >= 0) {
ZSTR_LEN(str) = new_len;

This comment has been minimized.

@nikic

nikic Sep 1, 2018

Member

This just sets the length, but will not reduce the allocation of the string. IIRC we have a zend_string_truncate API.

@nikic

nikic Sep 1, 2018

Member

This just sets the length, but will not reduce the allocation of the string. IIRC we have a zend_string_truncate API.

This comment has been minimized.

@sgolemon

sgolemon Sep 5, 2018

Contributor

Also, throw a warning on invalid value (e.g. less than 0, greater than current len)

@sgolemon

sgolemon Sep 5, 2018

Contributor

Also, throw a warning on invalid value (e.g. less than 0, greater than current len)

This comment has been minimized.

@cmb69

cmb69 Sep 6, 2018

Contributor

@nikic We have zend_string_truncate(), but it'll always copy the string, which would defeat the purpose of str_truncate().

@cmb69

cmb69 Sep 6, 2018

Contributor

@nikic We have zend_string_truncate(), but it'll always copy the string, which would defeat the purpose of str_truncate().

@nikic

This comment has been minimized.

Show comment
Hide comment
@nikic

nikic Sep 1, 2018

Member

It would be nice to see some real-world examples where this operation would provide a meaningful improvement. It's clear that it can provide a memory usage and performance improvement in theory, but it's not obvious in which context this would really make a practical difference. Especially once you take into account that the by-reference pass (which will involve the allocation and potentially deallocation of a reference wrapper) may easily make this function slower than substr for small strings.

Member

nikic commented Sep 1, 2018

It would be nice to see some real-world examples where this operation would provide a meaningful improvement. It's clear that it can provide a memory usage and performance improvement in theory, but it's not obvious in which context this would really make a practical difference. Especially once you take into account that the by-reference pass (which will involve the allocation and potentially deallocation of a reference wrapper) may easily make this function slower than substr for small strings.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 1, 2018

Thank you for your guide. @nikic

I actually encountered the real-world issue before I submitted the feature request and the PR.

I was doing command line job, In my use case, a large string was being built by repeatedly doing the concentrating operations(i.e. =.), however, it would also often encounter senarios that would require rolling back some trailing characters in the process of concentrations, and those senarios were not able to be predicted before hand.

So substr would be obviously undesirable in this case. One might suggest using some hacky workaround, I might be able to avoid using substr, for example, I could "fake-cut" the string and then continue "fake-concentrations" by byte-by-byte overwrite. but unfortunately, it was still not so efficient, and further, it just made the code look more complicated and counter-intuitive.

And one another usage of this function I currently have come up with is:
It can be used prevent the echo command from printing a over-long content.

Zhang-Junzhi commented Sep 1, 2018

Thank you for your guide. @nikic

I actually encountered the real-world issue before I submitted the feature request and the PR.

I was doing command line job, In my use case, a large string was being built by repeatedly doing the concentrating operations(i.e. =.), however, it would also often encounter senarios that would require rolling back some trailing characters in the process of concentrations, and those senarios were not able to be predicted before hand.

So substr would be obviously undesirable in this case. One might suggest using some hacky workaround, I might be able to avoid using substr, for example, I could "fake-cut" the string and then continue "fake-concentrations" by byte-by-byte overwrite. but unfortunately, it was still not so efficient, and further, it just made the code look more complicated and counter-intuitive.

And one another usage of this function I currently have come up with is:
It can be used prevent the echo command from printing a over-long content.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 1, 2018

Oh, and I just came up with a third usage: Delete some characters in-between a large string, and then re-concatenate the two disconnected parts.

// Supposing the $strip_from is large.

$str1 = substr($str, $strip_to);
str_cut($str, $strip_from);
$str .= $str1;

Zhang-Junzhi commented Sep 1, 2018

Oh, and I just came up with a third usage: Delete some characters in-between a large string, and then re-concatenate the two disconnected parts.

// Supposing the $strip_from is large.

$str1 = substr($str, $strip_to);
str_cut($str, $strip_from);
$str .= $str1;
@nikic

This comment has been minimized.

Show comment
Hide comment
@nikic

nikic Sep 2, 2018

Member

I was doing command line job, In my use case, a large string was being built by repeatedly doing the concentrating operations(i.e. =.), however, it would also often encounter senarios that would require rolling back some trailing characters in the process of concentrations, and those senarios were not able to be predicted before hand.

This makes me wonder if the better solution wouldn't be to introduce something akin to StringBuffer in Java. While a concatenation of the form $str .= $str2 can be reused, assuming that $str is not referenced elsewhere and that there is space to extend the allocation in-place, it would still be more efficient to directly allocate a larger buffer that is successively populated (what we do internally with smart_str). This class could also support in-place deletions (similar to StringBuffer again). This would also make it more obvious that the excess capacity will not be released.

This would be quite a bit more API surface though, I'm not sure how worthwhile it is.

Member

nikic commented Sep 2, 2018

I was doing command line job, In my use case, a large string was being built by repeatedly doing the concentrating operations(i.e. =.), however, it would also often encounter senarios that would require rolling back some trailing characters in the process of concentrations, and those senarios were not able to be predicted before hand.

This makes me wonder if the better solution wouldn't be to introduce something akin to StringBuffer in Java. While a concatenation of the form $str .= $str2 can be reused, assuming that $str is not referenced elsewhere and that there is space to extend the allocation in-place, it would still be more efficient to directly allocate a larger buffer that is successively populated (what we do internally with smart_str). This class could also support in-place deletions (similar to StringBuffer again). This would also make it more obvious that the excess capacity will not be released.

This would be quite a bit more API surface though, I'm not sure how worthwhile it is.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 2, 2018

I agree. Having StringBuffer or some such is definitely a better solution, I'd love to see it implemented, but yet will also be non-trival, and maybe in the predictable short term, something like StringBuffer won't be introduced into PHP, at least I personally haven't heard of any sign on that yet.

Right now, my proposal can be seem as a lightweight solution that can be implemented quickly.

Zhang-Junzhi commented Sep 2, 2018

I agree. Having StringBuffer or some such is definitely a better solution, I'd love to see it implemented, but yet will also be non-trival, and maybe in the predictable short term, something like StringBuffer won't be introduced into PHP, at least I personally haven't heard of any sign on that yet.

Right now, my proposal can be seem as a lightweight solution that can be implemented quickly.

@krakjoe

This comment has been minimized.

Show comment
Hide comment
@krakjoe

krakjoe Sep 5, 2018

Member

This can't be merged without discussion on the mailing list, please link back to the discussion here (via externals.io preferably) ...

If it transpires from discussion that an RFC is not required, that's fine, but for now, let's assume this will need an RFC.

Member

krakjoe commented Sep 5, 2018

This can't be merged without discussion on the mailing list, please link back to the discussion here (via externals.io preferably) ...

If it transpires from discussion that an RFC is not required, that's fine, but for now, let's assume this will need an RFC.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 5, 2018

Discussion was filed on the internals@ mailing list on Sep. 1st.

I am going to also create an RFC, but I am busy these days, I'll do it when I have time.

Zhang-Junzhi commented Sep 5, 2018

Discussion was filed on the internals@ mailing list on Sep. 1st.

I am going to also create an RFC, but I am busy these days, I'll do it when I have time.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi commented Sep 5, 2018

Discussion on internals@:

https://externals.io/message/103113

@cmb69

This comment has been minimized.

Show comment
Hide comment
@cmb69

cmb69 Sep 5, 2018

Contributor

FWIW: I've hacked together a StringBuilder extension which may serve as base for futher exploration (see the test/ directory on how to use).

Contributor

cmb69 commented Sep 5, 2018

FWIW: I've hacked together a StringBuilder extension which may serve as base for futher exploration (see the test/ directory on how to use).

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 5, 2018

Amazing work, I am excited to give it a test soon!

Zhang-Junzhi commented Sep 5, 2018

Amazing work, I am excited to give it a test soon!

@rybakit

This comment has been minimized.

Show comment
Hide comment
@rybakit

rybakit Sep 5, 2018

It would be nice to see some real-world examples where this operation would provide a meaningful improvement.

I think it can be helpful for parses/(de)coders which process large strings. For example, in msgpack.php I use substr here and here. I can imagine that replacing substr with something that doesn't copy the entire string would make these methods faster, but this needs to be proven by benchmarking various buffer lengths :)

rybakit commented Sep 5, 2018

It would be nice to see some real-world examples where this operation would provide a meaningful improvement.

I think it can be helpful for parses/(de)coders which process large strings. For example, in msgpack.php I use substr here and here. I can imagine that replacing substr with something that doesn't copy the entire string would make these methods faster, but this needs to be proven by benchmarking various buffer lengths :)

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 5, 2018

I just wrote a test, as a rough benchmark for the StringBuilder extension provided by @cmb69 in the previous comment

The results I produced on my computer were all exactly as expected, both in CLI and FPM.

I am interested to see if these expected results would still be reproduced on your computers.

<?php
// Test by adjusting total length
echo "Expected: The bigger the total length is, the more performance gain StringBuilder usually has.\n\n";
for ($total_mb = 1; $total_mb <= 5; $total_mb++)
{
	echo "Testcase: total length {$total_mb}MB; interval length 1KB; step 8B\n";
	measure($total_mb * 1024 * 1024, 1024, 8);
	echo "\n";
}
echo "-----------------------------------------------------------\n\n\n\n";

// Test by adjusting interval length
echo "Expected: The samller the interval length is, the more performance gain StringBuilder usually has.\n\n";
for ($interval_kb = 1; $interval_kb <= 16; $interval_kb *= 2)
{
	echo "Testcase: total length 1MB; interval length {$interval_kb}KB; step 8B\n";
	measure(1024 * 1024, $interval_kb * 1024, 8);
	echo "\n";
}
echo "-----------------------------------------------------------\n\n\n\n";

// Test by adjusting step length
echo "Expected: The bigger the step length is, the more performance gain StringBuilder usually has.\n\n";
for ($step_byte = 1; $step_byte <= 16; $step_byte *= 2)
{
	echo "Testcase: total length 1MB; interval length 1KB; step {$step_byte}B\n";
	measure(1024 * 1024, 1024, $step_byte);
	echo "\n";
}

/*
 * This builds a string which has $total_len bytes in total, using two methods:
 * 1) Regular string concatenation and chopping
 * 2) StringBuilder concatenation and chopping
 * And then prints and analyzes the time spent on each method.
 *
 * In the process of building, each concatenated string is of $step_len bytes.
 * Each time the string that's being built reaches a multiple of $interval_len
 * bytes, it makes a chopping.
 *
 * Expected rule:
 * The bigger $total_len is, the more performance gain StringBuilder has.
 * The samller $interval_len is, the more performance gain StringBuilder has.
 * The bigger $step_len is, the more performance gain StringBuilder has.
 * The third expected rule is caused by function calls which are typically more
 * expensive than native string concatenation operations.
 */
function measure(int $total_len, int $interval_len, int $step_len)
{
	// Check validity of the parameters.
	if ($step_len <= 0)
		throw new Exception('The step length can only be positive');
	if ($interval_len < $step_len)
		throw new Exception('The interval length must be bigger than the step length');
	if ($total_len < $interval_len)
		throw new Exception('The total length must be bigger than the interval length');
	if ($interval_len % $step_len)
		throw new Exception('The interval length must be a multiple of the step length');
	if ($total_len % $interval_len)
		throw new Exception('The total length must be a multiple of the interval length');

	$str_unit = str_pad('c', $step_len);
	$total_loop_count = $total_len / $step_len;
	$internal_loop_count = $interval_len / $step_len;

	// Measure time spent on regular string.
	$str = '';
	$start_time = microtime(true);
	for ($i = 0; $i <= $total_loop_count; $i++)
	{
		$str .= $str_unit;
		if (0 === $i % $internal_loop_count)
			$str = substr($str, 0, 0 - $step_len) . $str_unit;
	}
	$end_time = microtime(true);
	$time_on_regular = (int)(($end_time - $start_time) * 1000000);
	echo "Microseconds on regular string: $time_on_regular\n";

	// Measure time spent on regular string.
	$builder = new StringBuilder;
	$start_time = microtime(true);
	for ($i = 0; $i <= $total_loop_count; $i++)
	{
		$builder->append($str_unit);
		if (0 === $i % $internal_loop_count)
		{
			$builder->chop($step_len);
			$builder->append($str_unit);
		}
	}
	// This following line doesn't have meaning on its own,
	// but it's intentionally written here for the sake of fairness of comparison.
	$builder->toString();
	$end_time = microtime(true);
	$time_on_builder = (int)(($end_time - $start_time) * 1000000);
	echo "Microseconds on string builder: $time_on_builder\n";

	$efficiency_ratio = round($time_on_regular / $time_on_builder, 2);
	echo "Efficiency ratio beween regular string and string builder: 1 / $efficiency_ratio\n";
}

Zhang-Junzhi commented Sep 5, 2018

I just wrote a test, as a rough benchmark for the StringBuilder extension provided by @cmb69 in the previous comment

The results I produced on my computer were all exactly as expected, both in CLI and FPM.

I am interested to see if these expected results would still be reproduced on your computers.

<?php
// Test by adjusting total length
echo "Expected: The bigger the total length is, the more performance gain StringBuilder usually has.\n\n";
for ($total_mb = 1; $total_mb <= 5; $total_mb++)
{
	echo "Testcase: total length {$total_mb}MB; interval length 1KB; step 8B\n";
	measure($total_mb * 1024 * 1024, 1024, 8);
	echo "\n";
}
echo "-----------------------------------------------------------\n\n\n\n";

// Test by adjusting interval length
echo "Expected: The samller the interval length is, the more performance gain StringBuilder usually has.\n\n";
for ($interval_kb = 1; $interval_kb <= 16; $interval_kb *= 2)
{
	echo "Testcase: total length 1MB; interval length {$interval_kb}KB; step 8B\n";
	measure(1024 * 1024, $interval_kb * 1024, 8);
	echo "\n";
}
echo "-----------------------------------------------------------\n\n\n\n";

// Test by adjusting step length
echo "Expected: The bigger the step length is, the more performance gain StringBuilder usually has.\n\n";
for ($step_byte = 1; $step_byte <= 16; $step_byte *= 2)
{
	echo "Testcase: total length 1MB; interval length 1KB; step {$step_byte}B\n";
	measure(1024 * 1024, 1024, $step_byte);
	echo "\n";
}

/*
 * This builds a string which has $total_len bytes in total, using two methods:
 * 1) Regular string concatenation and chopping
 * 2) StringBuilder concatenation and chopping
 * And then prints and analyzes the time spent on each method.
 *
 * In the process of building, each concatenated string is of $step_len bytes.
 * Each time the string that's being built reaches a multiple of $interval_len
 * bytes, it makes a chopping.
 *
 * Expected rule:
 * The bigger $total_len is, the more performance gain StringBuilder has.
 * The samller $interval_len is, the more performance gain StringBuilder has.
 * The bigger $step_len is, the more performance gain StringBuilder has.
 * The third expected rule is caused by function calls which are typically more
 * expensive than native string concatenation operations.
 */
function measure(int $total_len, int $interval_len, int $step_len)
{
	// Check validity of the parameters.
	if ($step_len <= 0)
		throw new Exception('The step length can only be positive');
	if ($interval_len < $step_len)
		throw new Exception('The interval length must be bigger than the step length');
	if ($total_len < $interval_len)
		throw new Exception('The total length must be bigger than the interval length');
	if ($interval_len % $step_len)
		throw new Exception('The interval length must be a multiple of the step length');
	if ($total_len % $interval_len)
		throw new Exception('The total length must be a multiple of the interval length');

	$str_unit = str_pad('c', $step_len);
	$total_loop_count = $total_len / $step_len;
	$internal_loop_count = $interval_len / $step_len;

	// Measure time spent on regular string.
	$str = '';
	$start_time = microtime(true);
	for ($i = 0; $i <= $total_loop_count; $i++)
	{
		$str .= $str_unit;
		if (0 === $i % $internal_loop_count)
			$str = substr($str, 0, 0 - $step_len) . $str_unit;
	}
	$end_time = microtime(true);
	$time_on_regular = (int)(($end_time - $start_time) * 1000000);
	echo "Microseconds on regular string: $time_on_regular\n";

	// Measure time spent on regular string.
	$builder = new StringBuilder;
	$start_time = microtime(true);
	for ($i = 0; $i <= $total_loop_count; $i++)
	{
		$builder->append($str_unit);
		if (0 === $i % $internal_loop_count)
		{
			$builder->chop($step_len);
			$builder->append($str_unit);
		}
	}
	// This following line doesn't have meaning on its own,
	// but it's intentionally written here for the sake of fairness of comparison.
	$builder->toString();
	$end_time = microtime(true);
	$time_on_builder = (int)(($end_time - $start_time) * 1000000);
	echo "Microseconds on string builder: $time_on_builder\n";

	$efficiency_ratio = round($time_on_regular / $time_on_builder, 2);
	echo "Efficiency ratio beween regular string and string builder: 1 / $efficiency_ratio\n";
}
@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 5, 2018

My command line results:

Expected: The bigger the total length is, the more performance gain StringBuilder usually has.

Testcase: total length 1MB; interval length 1KB; step 8B
Microseconds on regular string: 33877
Microseconds on string builder: 5511
Efficiency ratio beween regular string and string builder: 1 / 6.15

Testcase: total length 2MB; interval length 1KB; step 8B
Microseconds on regular string: 125907
Microseconds on string builder: 11610
Efficiency ratio beween regular string and string builder: 1 / 10.84

Testcase: total length 3MB; interval length 1KB; step 8B
Microseconds on regular string: 517920
Microseconds on string builder: 18526
Efficiency ratio beween regular string and string builder: 1 / 27.96

Testcase: total length 4MB; interval length 1KB; step 8B
Microseconds on regular string: 1405775
Microseconds on string builder: 24434
Efficiency ratio beween regular string and string builder: 1 / 57.53

Testcase: total length 5MB; interval length 1KB; step 8B
Microseconds on regular string: 2452805
Microseconds on string builder: 31685
Efficiency ratio beween regular string and string builder: 1 / 77.41


Expected: The samller the interval length is, the more performance gain StringBuilder usually has.

Testcase: total length 1MB; interval length 1KB; step 8B
Microseconds on regular string: 34637
Microseconds on string builder: 5940
Efficiency ratio beween regular string and string builder: 1 / 5.83

Testcase: total length 1MB; interval length 2KB; step 8B
Microseconds on regular string: 19047
Microseconds on string builder: 5667
Efficiency ratio beween regular string and string builder: 1 / 3.36

Testcase: total length 1MB; interval length 4KB; step 8B
Microseconds on regular string: 12784
Microseconds on string builder: 5475
Efficiency ratio beween regular string and string builder: 1 / 2.33

Testcase: total length 1MB; interval length 8KB; step 8B
Microseconds on regular string: 8584
Microseconds on string builder: 5614
Efficiency ratio beween regular string and string builder: 1 / 1.53

Testcase: total length 1MB; interval length 16KB; step 8B
Microseconds on regular string: 7694
Microseconds on string builder: 5442
Efficiency ratio beween regular string and string builder: 1 / 1.41


Expected: The bigger the step length is, the more performance gain StringBuilder usually has.

Testcase: total length 1MB; interval length 1KB; step 1B
Microseconds on regular string: 69142
Microseconds on string builder: 44003
Efficiency ratio beween regular string and string builder: 1 / 1.57

Testcase: total length 1MB; interval length 1KB; step 2B
Microseconds on regular string: 48466
Microseconds on string builder: 23295
Efficiency ratio beween regular string and string builder: 1 / 2.08

Testcase: total length 1MB; interval length 1KB; step 4B
Microseconds on regular string: 38584
Microseconds on string builder: 11835
Efficiency ratio beween regular string and string builder: 1 / 3.26

Testcase: total length 1MB; interval length 1KB; step 8B
Microseconds on regular string: 33806
Microseconds on string builder: 5916
Efficiency ratio beween regular string and string builder: 1 / 5.71

Testcase: total length 1MB; interval length 1KB; step 16B
Microseconds on regular string: 31181
Microseconds on string builder: 2972
Efficiency ratio beween regular string and string builder: 1 / 10.49

Zhang-Junzhi commented Sep 5, 2018

My command line results:

Expected: The bigger the total length is, the more performance gain StringBuilder usually has.

Testcase: total length 1MB; interval length 1KB; step 8B
Microseconds on regular string: 33877
Microseconds on string builder: 5511
Efficiency ratio beween regular string and string builder: 1 / 6.15

Testcase: total length 2MB; interval length 1KB; step 8B
Microseconds on regular string: 125907
Microseconds on string builder: 11610
Efficiency ratio beween regular string and string builder: 1 / 10.84

Testcase: total length 3MB; interval length 1KB; step 8B
Microseconds on regular string: 517920
Microseconds on string builder: 18526
Efficiency ratio beween regular string and string builder: 1 / 27.96

Testcase: total length 4MB; interval length 1KB; step 8B
Microseconds on regular string: 1405775
Microseconds on string builder: 24434
Efficiency ratio beween regular string and string builder: 1 / 57.53

Testcase: total length 5MB; interval length 1KB; step 8B
Microseconds on regular string: 2452805
Microseconds on string builder: 31685
Efficiency ratio beween regular string and string builder: 1 / 77.41


Expected: The samller the interval length is, the more performance gain StringBuilder usually has.

Testcase: total length 1MB; interval length 1KB; step 8B
Microseconds on regular string: 34637
Microseconds on string builder: 5940
Efficiency ratio beween regular string and string builder: 1 / 5.83

Testcase: total length 1MB; interval length 2KB; step 8B
Microseconds on regular string: 19047
Microseconds on string builder: 5667
Efficiency ratio beween regular string and string builder: 1 / 3.36

Testcase: total length 1MB; interval length 4KB; step 8B
Microseconds on regular string: 12784
Microseconds on string builder: 5475
Efficiency ratio beween regular string and string builder: 1 / 2.33

Testcase: total length 1MB; interval length 8KB; step 8B
Microseconds on regular string: 8584
Microseconds on string builder: 5614
Efficiency ratio beween regular string and string builder: 1 / 1.53

Testcase: total length 1MB; interval length 16KB; step 8B
Microseconds on regular string: 7694
Microseconds on string builder: 5442
Efficiency ratio beween regular string and string builder: 1 / 1.41


Expected: The bigger the step length is, the more performance gain StringBuilder usually has.

Testcase: total length 1MB; interval length 1KB; step 1B
Microseconds on regular string: 69142
Microseconds on string builder: 44003
Efficiency ratio beween regular string and string builder: 1 / 1.57

Testcase: total length 1MB; interval length 1KB; step 2B
Microseconds on regular string: 48466
Microseconds on string builder: 23295
Efficiency ratio beween regular string and string builder: 1 / 2.08

Testcase: total length 1MB; interval length 1KB; step 4B
Microseconds on regular string: 38584
Microseconds on string builder: 11835
Efficiency ratio beween regular string and string builder: 1 / 3.26

Testcase: total length 1MB; interval length 1KB; step 8B
Microseconds on regular string: 33806
Microseconds on string builder: 5916
Efficiency ratio beween regular string and string builder: 1 / 5.71

Testcase: total length 1MB; interval length 1KB; step 16B
Microseconds on regular string: 31181
Microseconds on string builder: 2972
Efficiency ratio beween regular string and string builder: 1 / 10.49

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 5, 2018

@rybakit Your code seems to have to deal with sub strings from non-zero start positions.

Right now, both this PR by me and the string builder by @cmb69 aim to only those sub strings starting from zero position.

I believe in future, with more complicated algorithms, the string builder class can do the complicated work(i.e. provide a function like StringBuilder.Remove in JAVA). It would be very promising.

Zhang-Junzhi commented Sep 5, 2018

@rybakit Your code seems to have to deal with sub strings from non-zero start positions.

Right now, both this PR by me and the string builder by @cmb69 aim to only those sub strings starting from zero position.

I believe in future, with more complicated algorithms, the string builder class can do the complicated work(i.e. provide a function like StringBuilder.Remove in JAVA). It would be very promising.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 5, 2018

I am also wondering if the StringBuilder can go into the agenda of PHP interval. If so, is this str_cut still needed anyway?

Is StringBuilder just enough? Or both of them(StringBuilder and str_cut) are still okay to be provided?

Any thoughts?

I am happy to see either.

Zhang-Junzhi commented Sep 5, 2018

I am also wondering if the StringBuilder can go into the agenda of PHP interval. If so, is this str_cut still needed anyway?

Is StringBuilder just enough? Or both of them(StringBuilder and str_cut) are still okay to be provided?

Any thoughts?

I am happy to see either.

@cmb69

This comment has been minimized.

Show comment
Hide comment
@cmb69

cmb69 Sep 5, 2018

Contributor

I'm not (yet) convinced that a StringBuilder would be generally useful. I've seen similar performance improvements like those which you've measured when chopping, but only appending strings didn't make much of a difference (sometimes the performance was worse than using string concatenation). I tend to prefer such a class over the suggested str_cut() function, though. Yet another alternative might be to stick with what we already have (e.g. to use arrays of strings, and to build the final string with implode(), what needs more memory, but should be quite fast as well).

Wrt. the current implementation of StringBuilder: it uses smart_str which is (presently) mainly made for appending strings. I've added the ::chop() method so it could be compared for the suggested use case. It would certainly be possible to add further methods (such as ::insert(), ::delete() etc.) The implementation might not even use smart_str (but some other concept). However, I'm not sure if it would make much sense to have these for PHP since we're usually character set agnostic (contrary to Java).

Contributor

cmb69 commented Sep 5, 2018

I'm not (yet) convinced that a StringBuilder would be generally useful. I've seen similar performance improvements like those which you've measured when chopping, but only appending strings didn't make much of a difference (sometimes the performance was worse than using string concatenation). I tend to prefer such a class over the suggested str_cut() function, though. Yet another alternative might be to stick with what we already have (e.g. to use arrays of strings, and to build the final string with implode(), what needs more memory, but should be quite fast as well).

Wrt. the current implementation of StringBuilder: it uses smart_str which is (presently) mainly made for appending strings. I've added the ::chop() method so it could be compared for the suggested use case. It would certainly be possible to add further methods (such as ::insert(), ::delete() etc.) The implementation might not even use smart_str (but some other concept). However, I'm not sure if it would make much sense to have these for PHP since we're usually character set agnostic (contrary to Java).

// Should not take effect, it cannot untruncate back.
str_truncate($string, strlen('Hello'));
var_dump($string);

This comment has been minimized.

@sgolemon

sgolemon Sep 5, 2018

Contributor

Add a test case to ensure no CoW variables are modified unexpectedly.

$string = "Hello World";
$cow = $string;
str_tuncate($string, strlen("Hello"));
var_dump($string); // string(5) "Hello"
var_dump($cow); // string(11) "Hello World"

Spoiler alert; In your patch's current state, this will fail as $cow will have been modified as well.

@sgolemon

sgolemon Sep 5, 2018

Contributor

Add a test case to ensure no CoW variables are modified unexpectedly.

$string = "Hello World";
$cow = $string;
str_tuncate($string, strlen("Hello"));
var_dump($string); // string(5) "Hello"
var_dump($cow); // string(11) "Hello World"

Spoiler alert; In your patch's current state, this will fail as $cow will have been modified as well.

@sgolemon

This comment has been minimized.

Show comment
Hide comment
@sgolemon

sgolemon Sep 5, 2018

Contributor

@cmb69 Instead of using the smart_str API, which reallocs in slabs as the string grows, have you considered storing fragments as they're appended (the allocations already exist, so the push events are quick), then doing a final allocation and copy when the string is materialized? This generally just defers the load until later, but on a highly fragmented, very longs string could probably result in fewer total copies as those reallocs are avoided.

Contributor

sgolemon commented Sep 5, 2018

@cmb69 Instead of using the smart_str API, which reallocs in slabs as the string grows, have you considered storing fragments as they're appended (the allocations already exist, so the push events are quick), then doing a final allocation and copy when the string is materialized? This generally just defers the load until later, but on a highly fragmented, very longs string could probably result in fewer total copies as those reallocs are avoided.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 6, 2018

@cmb69 IMO, even if sometimes the performance could be worse because chopping doesn't take place, StringBuilder still makes sense. Because it's stable in time complexity which means you don't need to worry about safety of time complexity in corner cases, while substr is unstable in time complexity.

Array + implode is also time complexity stable like StringBuilder, but not very intuitive. Sometimes, finding a paticular substring to chop requiring iterating over the entire array, and things become complicated if different parts of the paticular substring cross over array elements.

As to multibyte safety, it's fortunate that UTF-8 is the predominant encoding nowadays, finding the UTF-8 character boundary for trailing characters isn't hard. Anyway, substr or array+implode gains no benefit in this respect.

These are my personal rationales for adding StringBuilder.

Zhang-Junzhi commented Sep 6, 2018

@cmb69 IMO, even if sometimes the performance could be worse because chopping doesn't take place, StringBuilder still makes sense. Because it's stable in time complexity which means you don't need to worry about safety of time complexity in corner cases, while substr is unstable in time complexity.

Array + implode is also time complexity stable like StringBuilder, but not very intuitive. Sometimes, finding a paticular substring to chop requiring iterating over the entire array, and things become complicated if different parts of the paticular substring cross over array elements.

As to multibyte safety, it's fortunate that UTF-8 is the predominant encoding nowadays, finding the UTF-8 character boundary for trailing characters isn't hard. Anyway, substr or array+implode gains no benefit in this respect.

These are my personal rationales for adding StringBuilder.

@Zhang-Junzhi

This comment has been minimized.

Show comment
Hide comment
@Zhang-Junzhi

Zhang-Junzhi Sep 6, 2018

@sgolemon Yes, good spot. @cmb69 also pointed it out in this comment.

The current code is rough and isn't right, will need to go through a revision.

For now, I am not sure if this proposal is going to be accepted, so let's just keep the PR unchanged for now.

Zhang-Junzhi commented Sep 6, 2018

@sgolemon Yes, good spot. @cmb69 also pointed it out in this comment.

The current code is rough and isn't right, will need to go through a revision.

For now, I am not sure if this proposal is going to be accepted, so let's just keep the PR unchanged for now.

@cmb69

This comment has been minimized.

Show comment
Hide comment
@cmb69

cmb69 Sep 6, 2018

Contributor

@sgolemon

Instead of using the smart_str API, which reallocs in slabs as the string grows, have you considered storing fragments as they're appended

That would be one of the “other concepts” I mentioned above (another would be to let the internal string grow by a factor, instead of linearly). Both concepts might be useful for smart_str, but since they're not implemented, they may not be desireable.

@Zhang-Junzhi

Anyway, substr or array+implode gains no benefit in this respect [multibyte safety].

Well, if programmers use array+implode “manually” they're in full control, and could use mb_substr() instead of substr(), etc.

Contributor

cmb69 commented Sep 6, 2018

@sgolemon

Instead of using the smart_str API, which reallocs in slabs as the string grows, have you considered storing fragments as they're appended

That would be one of the “other concepts” I mentioned above (another would be to let the internal string grow by a factor, instead of linearly). Both concepts might be useful for smart_str, but since they're not implemented, they may not be desireable.

@Zhang-Junzhi

Anyway, substr or array+implode gains no benefit in this respect [multibyte safety].

Well, if programmers use array+implode “manually” they're in full control, and could use mb_substr() instead of substr(), etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment