-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Efficient alternate representations for lists of characters (strings). #24
Comments
Should there be a specialized syntactic form for the introduction of partial strings? Something like |
if you want to have something special for it, then use a |
But most of the time there will be built-ins (like those for Another option is to produce a compact representation during garbage collection. But rather leave this for a much later phase. |
Ok, from what I was able to understand by reading this, #95 and the source code at different points in time, Ideally we could just interpret a pointer into the heap as either a A better way may be to represent a Footnotes
|
Partial strings were never allocated in the heap, no.. pointers in the heap owned string instances that were allocated with the global allocator. |
This makes more sense, thanks for the correction. I think this just adds to my point that implementing this as intended is hard. |
Or maybe that indicates that inlining the string is the wrong approach. I can see a lot of benefits that an indirection provides:
While I'm at it, I also think it would be good to store the partial string length somewhere, instead of representing it implicitly with a NUL byte. If we use a |
The key advantage of heap allocation is fast reclamation of memory on backtracking, using constant time! |
... and the other advantage of null-terminated strings on the heap is that a DCG for parsing does not require auxiliary memory. Not like #1714 |
Ok, now that I've researched it a bit I think it's completely legal in Rust to just straight up interpret a part of the heap as a And maybe this simple implementation is not actually enough, because it ties the lifetime of the Maybe it would be best to not create a reference here at all and just work with My proof of concept also has the unfortunate implementation detail that it needs to traverse the string twice, once for making a |
Thinking more about this, I think using |
The internal encoding should be fixed to UTF-8, since supporting different internal encoding variants for strings is too error-prone. Conversion to a Rust string may only be needed in comparatively few places (example use case: predicates from An interesting design question is: How can the code be structured so that as few places as possible need to take the specialized string representation into account? The more places are affected, the easier it is to accidentally forget that the representation must be handled. |
I was more thinking about "partial lists of bytes/octets", not alternative encodings for strings. This would be useful for things that need to operate on raw bytes. I agree that UTF-8 should be the only builtin encoding for strings. |
There is one aspect which caught my attention recently. The key element of partial strings is the use of null-terminated UTF-8 strings which begs the question how to represent the zero-character. So far, I believed that resorting to Prolog terms would be the best representation, but another option might be to use Modified UTF-8 which assigns some two-byte encoding to the null character. So that might make things a bit easier. |
Compared to using UTF-8 characters, this would reduce their overhead at best by a factor of two. Worth the effort? |
Personally, I think the space saving is not worth the effort, but there is a different interesting advantage that a dedicated encoding specifically for octets could yield, if it guarantees that every byte can be processed in the same amount of time: This could be a very useful property for cryptographic routines that must not reveal any properties of secret data in any way, not even by different amounts of times it takes to process it. If processing the byte This is only worth considering if the number of places in the Rust parts of Scryer Prolog that have to take such an encoding into account is kept to an acceptably small amount, certainly much smaller than it is currently and where even "normal" strings are not yet correctly implemented (#1969). |
Or we could just store the length of the inlined string somewhere and avoid this and many other problems automatically. Conceptual example for term
This simplifies this case, but a string with a NUL character is an edge case. However, having a length is very useful in many other places, so that we don't have to traverse the string and/or keep track of the length separately. NUL terminated strings are widely considered to be one of C's greatest defects, and Zig and Rust have builtin slices for this reason. I don't think that an extra 8 bytes of overhead in the representation of the partial string is too much of a cost to pay for this, especially for large strings which are the ones that benefit the most from this.
Another use would be to mmap a file as binary, so that we could efficiently reason about it's contents with DCGs. |
There is no such space. That is it would produce garbage all the time, like #1714 |
I don't understand, could you elaborate? Can't we encode it like I showed? |
Just look at
And now use your representation with "abc". In the first inference, Whereas the null-terminated version just increments the pointer (and checks that the next element is not \0). |
Ok, that makes complete sense, thanks for the explanation. I still wonder if there is a way to avoid sentinels without space concerns, but with this I'm convinced that inlining the length doesn't make sense. |
I just thought about an optimization: if we have an inlined partial string that contains zero bytes, we don't need two allocations, we can reuse that zero byte as a sentinel. For
Here I'm assuming that the "arguments" to "Cons" can be of arbitrary size, not just 2. I don't know how this is actually implemented or if there are any problems with this, but it allows skipping an indirection by having "PartialStringPointer" always be followed by the tail. This could make something #251 work with arbitrary UTF-8 strings (or even arbitrary mmaped binary files, as I talked about above). |
(I don't get your point) What happens with |
It avoids an indirection and the need for a
Well, in the example I gave with Also, the indexes in my example before were wrong, I corrected them. |
When does what operation happen? You are not referring to the |
One general comment about this issue, also directed to authors of other Prolog systems so that they can learn from the experience we gain here: Implementing or retaining an ad hoc workaround about such an elementary design issue may eventually cost more resources than actually solving it. As a recent example, and ca. 6 years after first discussing this issue, we have #2356 which again needed addressing on its own. With partial strings implemented correctly, all the writing can be moved to Prolog and the time can be used instead to further strengthen the core. |
Another such case is #2381. |
The gift that keeps on giving yields another interesting mistake due to so many special cases that currently occur and all need to be handled correctly in the code: ?- Xs = "abc", Xs = [F|_], assertz(F). error(type_error(callable,a),assertz/1), unexpected. See #2529. |
I stopped testing as it does not make sense with #1714 open. |
The current |
We should only close it when it really works correctly! There are a few remaining mistakes, notably #2554. |
Consider to add an internal string data type as an alternate representation for lists of characters:
Complete strings
The tagged value is a byte-aligned pointer to a null-terminated sequence of UTF-8 characters on the heap (copy stack). This means that the full address-space of 32/64 bits cannot be used but some (1 or two) bits less than that. In many systems this is not a problem since the operating systems already makes similar assumptions anyway.
A list of n characters
[C1, ..., CN]
would thus occupy between n+1 and n+8 bytes, depending on alignment, whereas the naive representation as a list of chars would take n · (3 · 8) bytes.In this manner, the "tail" of the strings can be rapidly (constantly) computed which is key for fast parsing with DCGs. Note that SWI7's built-in string requires for
sub_string(String, 1, _, 0, Tail)
time and space proportional to the length ofString
.Then, all operations from unification upwards need to be able to handle both lists of chars and that string type. That is indeed the more challenging part.
Partial strings
This would take the idea one step further. Instead of representing a list of known characters, a partial list of known characters would be represented compactly. E.g.
Xs0 = [a,b,c,d|Xs]
would be represented as a null-terminated string "abcd" plus padding to the next word which representsXs
. This would makelibrary(pio)
very fast!Just as a remark: Strings that contain the null-character can still be represented as a regular lists of characters.
The text was updated successfully, but these errors were encountered: