Skip to content

zend_string changes and additions #1353

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 2 commits into from

Conversation

flaupretre
Copy link
Contributor

This PR initially proposed to suppress the 'persistent' argument of zend_string realloc/extend/truncate functions. As this would break already-ported extensions, we now propose additions only.

Add zend_string_resize() function

This function is similar to zend_string_realloc() but doesn't take the 'persistent' arg, as a zend_string is always re-allocated in the memory space it was created in.

Add a collection of zend_string_append() functions

These function allow to append to a zend_string from :

  • another zend_string,
  • a pair (char *, len),
  • a literal string,
  • or a null-terminated string (using strlen()).

The 'append' functions are all based on zend_string_resize().

I also fixed what I consider as a bug, as long as none gives me a good reason to do so : when trucating a string via zend_string_realloc() or zend_string_truncate(), the resulting string is not null-terminated.

@flaupretre flaupretre force-pushed the zend_string_mods branch 2 times, most recently from f026bac to 1230765 Compare June 22, 2015 01:14
@weltling
Copy link
Contributor

IMO changing the existing API is already a breakage for 7.0, quite some extensions are already ported and could be broken again. Probably not what we want at the current stage. But adding some useful APIs/macros with consistent naming should be no issue.

Thanks.

@flaupretre flaupretre force-pushed the zend_string_mods branch 2 times, most recently from 6bbf0d4 to c8e1c20 Compare June 22, 2015 15:36
Also fix a bug: when trucating a zend_string, the resulting string is not null-terminated.
Add ZVAL_STR_RESIZE() macro
@flaupretre
Copy link
Contributor Author

Right. The PR was modified to propose only additions, without any BC break.

@weltling
Copy link
Contributor

I'm running some tests but how it currently looks there's no regression. I wouldn't push this into the next release as it's tagged tomorrow. But after alpha2 there'll be the last chance to add this, also we'll have the whole two weeks to ensure it's worky in master. Btw, and what about Z_STR_VAL() and co mentioned in the thread?

@dstogov how does the code look for you?

Thanks.

@flaupretre
Copy link
Contributor Author

About ZSTR_xxx() macros, I am currently restructuring the ‘zstrict’ PR (#1348) which will include this proposal for a clean zend_string API with separate getters/setters.

I hope I can propose something clean tomorrow (UTC). I’ll send a message to you as soon as there’s something to review.

@dstogov
Copy link
Member

dstogov commented Jun 23, 2015

I see no problem with renaming and additional "more consistent" API, but I don't like behavior changes, like setting s->val[s->len] = "\0"; Note, that the whole PHP sources work fine without this. It's just useless in most frequent use cases, because we use memcopy(). It also cause non-linear data access locality, and most probably will lead to CPU miss.

@nikic
Copy link
Member

nikic commented Jun 23, 2015

Please also implement usages of these new APIs, to show that these are actually common. Personally I have some doubts about the usefulness of mutable append APIs, as this is normally handled by smart_str. We do have code for (non-mutable) concats lying around, e.g. zend_concat3.

@flaupretre
Copy link
Contributor Author

@dstogov - I am not against the decision of considering that the zend_string API does not ensure that the strings it returns are null-terminated (even if it would probably cause many issues) but we must be consistent : The whole zend_string API takes special care to maintain the terminating null-byte except zend_string_truncate() and zend_string_realloc() (when trucating) which will always return a non-null-terminated string. So, do we consider that the API maintains this null byte, or is it up to the user to set it every time he receives a strings from a zend_string_xxx() function ? Actually, the question may be expressed as : how important is it that zval strings are null-terminated ?

The message of saying : "Some functions maintain the terminating null and others don't" is not the right one, IMHO.

@flaupretre
Copy link
Contributor Author

@nikic - smart_str have a set of append functions but, IMO, zend_string will gradually replace smart_str. smart_str is slightly slower and doesn't provide much more, except a set of append functions. So, once we implement a pair of missing features in zend_string, I don't see why both systems would coexist much longer than the time for extension maintainers to migrate their code. That's something that can disappear in 8.0.

About the fact that append is common or not : it was considered common enough for smart_str, why would it be different ? I think I'll even extend it to provide exactly the same options as smart_str.

@nikic
Copy link
Member

nikic commented Jun 23, 2015

@flaupretre zend_string and smart_str are for two different purposes. zend_string is essentially a finished string that will not be changed anymore. smart_str is what you use to get the zend_string based on multiple append operations. smart_str is specially designed around that usage, by using allocation policies that provide good performance if you need to do many appends to generate the final string.

@flaupretre
Copy link
Contributor Author

@nikic I don’t have enough time now but I am pretty sure we can combine the benefits of both systems without loosing performance (integrate smart_str mem allocation policy in zend_string).

Anyway, we already have zend_string realloc/extend/truncate. If I understand your point, these shouldn’t have been defined. The main point is that there is no easy tranformation between zval and smart_str, when zend_string is the underlying mechanism of zval. So, do you mean that a string zval is “essentially a finished string that will not be changed anymore” ? If this is the case, we should probably deprecate the zend_string realloc/extend/truncate functions. If a string zval can be modified, the zend_string layer can provide ways to perform these modifications.

This will go to another PR as it's unrelated to this branch's purpose
@flaupretre
Copy link
Contributor Author

@dstogov Just removed the additional null byte set in realloc/truncate. But I'll revive this in another PR :) because I still see it as a bug.

@nikic
Copy link
Member

nikic commented Jun 26, 2015

@flaupretre I'll try to clarify what I mean here. First off, smart_str is internally based on a zend_string. Note that there is also smart_string, which is essentially the same but working on a char* -- I'm not talking about that one.

Normally zend_strings will be allocated to be only as long as needed to store their content. In that situation doing append operations is inefficient because you would have to do a (slow) reallocation of the string on every single append.

smart_str improves this by preallocating a larger buffer (starting with approximately 256 characters) so multiple appends can happen without reallocation. If the buffer is exceeded if will be enlarged to approximately 4096 bytes and so on. The sizes are tweaked to integrate well into our custom allocator. When the string is finished it will usually either be copied into a new (precisely allocated) buffer or truncated.

You suggest that we might be able to combine the benefits of both approaches. That would be nice (by way of unifying the API) but I don't see any obvious way to do this, as the allocation policies (smallest possible vs overly large) are pretty contradictory.

You have correctly noted that the existence of realloc etc. functions goes somewhat against my point here. These are necessary for multiple reasons. E.g. smart_str itself uses zend_strings so requires a reallocation mechanism for it. But there are also other places that resize strings and where using smart_str is not reasonable. For example a simple concat $a .= $b that goes through concat_function will try to reuse the string buffer for $a if this is possible. As this is only an individual concat and we know the resulting length in advance, using smart_str here would be counterproductive. Quickly looking at a few results on LXR, another popular usage seems to encoding functions, which pessimistically overallocate at first and then truncate to the actual length.

I'm not saying that mutable append APIs are useless -- I don't know how useful they'd be -- which is why I asked for them to be used in php-src where applicable (it should be possible to find all possible application points by checking the usages of the functions you listed). This would both test your implementation and allow us to gauge how often these are useful. I would not want to add them without knowing this, especially if it might cause extension authors to use these in cases where using smart_str would have been more appropriate.

@flaupretre
Copy link
Contributor Author

@nikic Thanks for your reply. The more I think about it, the more I think the solution is to combine both. The challenge is to add the flexibility of smart_str to zend_string without loosing performance nor wasting too much memory. Today, the main problem I see with maintaining both APIs is that it is quite impractical and slow to use the smart_str API on a zval (must convert back and forth).

Just thinking out loud :

The first step is to add a zend_string field containing the allocated size.

Then, a zend_string can be marked as 'over-allocating' (using the smart_str allocation policy) at creation time or even after it is created. This will affect the way it is created and extended. Another function can over-allocate by an amount passed as argument (when we have an idea in advance of the memory we'll need).

When the string is considered as 'finished', an opposite function can mark it as 'strictly-allocated'. This function would also optionally reclaim the over-allocated memory. This would allow switching from one allocation scheme to the other as often as we want, providing a lot of flexibility.. This would cost a pair of additional checks but it would avoid realloc() calls and the compiler should be able to suppress most of them (these are all inlined functions and they are often called from the same function).

Anyway, that should be performance-tested.

@flaupretre flaupretre closed this Jan 2, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants