-
Notifications
You must be signed in to change notification settings - Fork 197
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed-length string with terminating character #13
Comments
Yeah, that might be a good idea - I like the approach that you've suggested. It also has a very important feature - thus data types can be both read (what we're doing now) and written (what we'll hopefully be doing in the future - i.e. serialization of objects into binary streams) without any ambiguility. |
Returning to this proposal: I'd like to suggest slight modification: do not reuse
The parsing idea is to read full field at once and then strip as many right-padding bytes (from the right side of the string) as we can. The generation (serialization) idea is to add right padding bytes to the given string's byte representation to make it up for full desired length of the field. The semantics are slightly different than |
I'd still prefer the terminator, myself. There's cases where the rest of the data in the buffer after the terminator is simply gibberish, consider that the string buffer would often be allocated using char* str = (char*) malloc(16);
strcpy(str, "ABC"); Possible result:
|
Ok, then we actually need both - they're different, albeit linked concepts. I don't really like the idea of mixing different string reading algorithms (i.e. - id: my_str
type: strz
# terminator: 0 is by default
encoding: UTF-8
size: 16 instead of: seq:
- id: my_str
type: terminated_str
size: 16
types:
seq:
- id: value
type: strz
# terminator: 0 is by default
encoding: UTF-8 |
Yeah, works for me. But I'm starting to see the two different string readers merge into one. If seq:
- id: my_str
type: str # performs the role of both str and strz
# at least one of size and terminator must be specified
size: 16 # optional if terminator is specified (strz behaviour)
terminator: 0 # doesn't default to 0 if not specified
# other str/strz options
encoding: ASCII
pad-right: 0x20
... While we're talking about strings, what do you think about adding a default encoding to the meta:
endian: le
encoding: UTF-8
seq:
- id: str1
type: str
size: 32
# use the default encoding
- id: str2
type: str
size: 64
encoding: ASCII # encoding can still be overridden |
I don't think that essential two string methods will ever merge, but it's probably a good idea to unify APIs. I support the idea of having def read_str_eos(encoding, right_pad)
buf = read_till_eof
postporcess_str(buf, encoding, right_pad)
end
def read_str(size, encoding, right_pad)
buf = read_buf(size)
postprocess_str(buf, encoding, right_pad)
end
def read_strz(encoding, term, include_term, consume_term, eos_error, right_pad)
buf = read_till_terminator(term, include_term, consume_term, eos_error)
postprocess_str(buf, encoding, right_pad)
end
def postprocess_str(buf, encoding, right_pad)
# 1. remove padding
# 2. convert byte buffer -> string using specified encoding
end
Great idea :) Added as #34. Yet another "while we're here" proposal: may be we should rethink
|
Out of interest, are UTF-16 and UTF-32 supported encodings for null-terminated strings? |
I haven't ever seen a real-life example of null-terminated strings in 2- or 4-byte per char encodings. The same actually goes to "string-limited-with-number-of-characters" (as opposed to number of bytes) and "string-terminated-with-a-character" (as opposed to byte). I understand that, in theory, one might want to read, for example, "a string terminated with ܀, which is encoded in UTF-8 as |
I've never encountered a real-world example either, so I agree that it's probably not something to worry about. If it's really required then the user can revert to reading an array of integers using It does seem like there may be cases of UTF-16 strings being terminated by
|
Ok, half a year later I've got to this. My proposal to change is quite a big one, so I'd like to at least outline it:
Implementation-wise, it means:
Actually, I've got a basic refactored implementation up and running (for Java and Ruby, at least), everything else is done ~60-70%. I expect to finish and push this one soon. |
And, yeah, @LogicAndTrick, now |
Well, it's mostly done and committed. Please check, test and comment. Currently we're missing implementations for PHP and Perl, everything else should be fine. |
@GreyCat Should be done anything for PHP? I see that the mentioned methods were already implemented for the PHP runtime. |
I've submitted PHP changes for compiler & runtime a few hours ago - so no more work in that department required (although feel free to make them faster, more optimized, etc). |
Just completed Perl implementation, that kind of completes this task. Probably some more tests on border cases won't hurt (i.e. @sergeyzelenyuk — please take a look, probably this could be implemented in much more efficient manner? |
Fixed PositionInSeq, InstanceIoUser, InstanceStdArray, IfStruct
Add quick start guide for Node.js
I can't work out if this is possible currently, but it doesn't seem obvious: Many formats reserve a fixed space for a file name but also have a terminating character (usually
\0
) to support variable length names.For example:
Right now there's two options currently available that I can think of:
Option 1. Use
str
with a fixed length and the consumer has to deal with the terminator manually:Option 2. Use
strz
with a terminator and then a second value for the rest of the array:It'd be nice to support an optional terminator in the
str
type to automatically truncate the string in the runtime:Alternatively, a
size
option in thestrz
type would work as well, but I think it'd be more suited for thestr
type because the data is still fixed-width in the end.You can see a real-world example in the
doom_wad.ksy
format (boxes = null characters):The text was updated successfully, but these errors were encountered: