Fixed-length string with terminating character #13

LogicAndTrick · 2016-07-29T10:26:49Z

I can't work out if this is possible currently, but it doesn't seem obvious: Many formats reserve a fixed space for a file name but also have a terminating character (usually \0) to support variable length names.

For example:

typedef struct {
    char name[256];
} example;

example e;
e.name = "test";
writeToFile(e); // Writes { 'T', 'E', 'S', 'T', '\0', ... [+ 251 more bytes] }

Right now there's two options currently available that I can think of:

Option 1. Use str with a fixed length and the consumer has to deal with the terminator manually:

KSY:
      - id: name
        type: str
        size: 256
        encoding: ASCII

Code:
    var e = Example.FromFile("...");
    e.Name = FixTerminatedString(e.Name);

Option 2. Use strz with a terminator and then a second value for the rest of the array:

      - id: name
        type: strz
        encoding: ASCII
        terminator: 0x00
        consume: false
      - id: junk
        size: 256 - name.length

It'd be nice to support an optional terminator in the str type to automatically truncate the string in the runtime:

KSY:
      - id: name
        type: str
        size: 256
        encoding: ASCII
        terminator: 0x00

Runtime:
    public string ReadStrByteLimit(long length, bool terminated, byte terminator, string encoding)
    {
        var bytes = ReadBytes(length);
        if (terminated) bytes = bytes.TakeWhile(b => b != terminator).ToArray();
        return System.Text.Encoding.GetEncoding(encoding).GetString(bytes);
    }

Alternatively, a size option in the strz type would work as well, but I think it'd be more suited for the str type because the data is still fixed-width in the end.

You can see a real-world example in the doom_wad.ksy format (boxes = null characters):

The text was updated successfully, but these errors were encountered:

GreyCat · 2016-07-29T11:54:22Z

Yeah, that might be a good idea - I like the approach that you've suggested. It also has a very important feature - thus data types can be both read (what we're doing now) and written (what we'll hopefully be doing in the future - i.e. serialization of objects into binary streams) without any ambiguility.

GreyCat · 2016-10-16T12:26:54Z

Returning to this proposal: I'd like to suggest slight modification: do not reuse terminator per se, but add another option like pad-right, which can be used to specify right padding bytes. I've seen two real-life use-cases:

many formats use 0 designate end of string + pad the rest of the field with 0s as well
some formats (ISO9660, majority of bank/card billing formats) use 0x20 (ASCII space) for right-padding of the field

The parsing idea is to read full field at once and then strip as many right-padding bytes (from the right side of the string) as we can. The generation (serialization) idea is to add right padding bytes to the given string's byte representation to make it up for full desired length of the field.

The semantics are slightly different than terminator, so I think that reusing terminator is a bad idea, it's better to add distinct new parameter. What do you think of it?

LogicAndTrick · 2016-10-17T01:31:05Z

I'd still prefer the terminator, myself. There's cases where the rest of the data in the buffer after the terminator is simply gibberish, consider that the string buffer would often be allocated using malloc and strcpy:

char* str = (char*) malloc(16);
strcpy(str, "ABC");

Possible result:

41 42 43 00 C5 92 A3 A6 10 0F 57 C4 25 68 D2 6F 
A  B  C  .  |----- uninitialised memory ------|

GreyCat · 2016-10-17T19:14:02Z

Ok, then we actually need both - they're different, albeit linked concepts. I don't really like the idea of mixing different string reading algorithms (i.e. str and strz, though). May be we could just add size: X limiter argument for strz? That fits current "workaround with true object" well too, i.e.:

- id: my_str
  type: strz
  # terminator: 0 is by default
  encoding: UTF-8
  size: 16

instead of:

seq:
  - id: my_str
    type: terminated_str
    size: 16
types:
  seq:
    - id: value
      type: strz
      # terminator: 0 is by default
      encoding: UTF-8

LogicAndTrick · 2016-10-18T01:45:31Z

Yeah, works for me. But I'm starting to see the two different string readers merge into one. If terminator isn't specified (and had no default) and size is specified, you could replicate the str method using strz. Only having one string reading method would simplify the interface and also reduce duplicated work when adding new features such as the pad-right option.

seq:
  - id: my_str
    type: str      # performs the role of both str and strz
    # at least one of size and terminator must be specified
    size: 16       # optional if terminator is specified (strz behaviour)
    terminator: 0  # doesn't default to 0 if not specified
    # other str/strz options
    encoding: ASCII
    pad-right: 0x20
    ...

While we're talking about strings, what do you think about adding a default encoding to the meta section, like the endianness? Often when a file uses a particular encoding, the encoding will be the same for the entire file.

meta:
  endian: le
  encoding: UTF-8
seq:
  - id: str1
    type: str
    size: 32
    # use the default encoding
  - id: str2
    type: str
    size: 64
    encoding: ASCII   # encoding can still be overridden

GreyCat · 2016-10-18T05:26:47Z

two different string readers merge into one.

I don't think that essential two string methods will ever merge, but it's probably a good idea to unify APIs. I support the idea of having type: strz to be a handy synonym for type: str + terminator: 0. As for the implementation, unfortunately, these two are sufficiently different. Many languages include some sort of distinct native read_strz implementation which we'd want to reuse. OTOH, I agree that we should probably avoid repetition and extract all postporcessing, i.e. something like (as now "postprocessing" becomes more and more complex:

def read_str_eos(encoding, right_pad)
  buf = read_till_eof
  postporcess_str(buf, encoding, right_pad)
end

def read_str(size, encoding, right_pad)
  buf = read_buf(size)
  postprocess_str(buf, encoding, right_pad)
end

def read_strz(encoding, term, include_term, consume_term, eos_error, right_pad)
  buf = read_till_terminator(term, include_term, consume_term, eos_error)
  postprocess_str(buf, encoding, right_pad)
end

def postprocess_str(buf, encoding, right_pad)
  # 1. remove padding
  # 2. convert byte buffer -> string using specified encoding
end

what do you think about adding a default encoding to the meta section

Great idea :) Added as #34.

Yet another "while we're here" proposal: may be we should rethink read_* API names, so bytes vs strs would be more alike. Right now they're:

read_str <-> read_bytes
read_str_eos <-> read_bytes_full
read_strz <-> no read-bytes-until-terminator call

LogicAndTrick · 2016-10-18T05:54:23Z

read_till_terminator(term, include_term, consume_term, eos_error)

Out of interest, are UTF-16 and UTF-32 supported encodings for null-terminated strings?

GreyCat · 2016-10-18T14:16:50Z

I haven't ever seen a real-life example of null-terminated strings in 2- or 4-byte per char encodings. The same actually goes to "string-limited-with-number-of-characters" (as opposed to number of bytes) and "string-terminated-with-a-character" (as opposed to byte). I understand that, in theory, one might want to read, for example, "a string terminated with ܀, which is encoded in UTF-8 as dc 80", but normal machine-readable formats usually do not work this way.

LogicAndTrick · 2016-10-19T00:08:17Z

I've never encountered a real-world example either, so I agree that it's probably not something to worry about. If it's really required then the user can revert to reading an array of integers using repeat-until to terminate and do manual decoding of the string.

It does seem like there may be cases of UTF-16 strings being terminated by 00 00, according to Wikipedia:

UTF-16 uses 2-byte integers and since either byte may be zero, cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit UTF-16 characters, terminated by a 16-bit NUL character.

GreyCat · 2017-02-14T07:54:24Z

Ok, half a year later I've got to this.

My proposal to change is quite a big one, so I'd like to at least outline it:

Remove distinct string types from DataType. That is, there would be no more 3 distinct StrByteLimitType, StrEosType and StrZType. Instead, we'll use the same model as EnumType uses: there would be single StrFromBytesType, which will be based on some BytesType and will carry only encoding.
Make 3 distinct BytesType implementations:
- BytesLimitType
- BytesEosType
- BytesTerminatedType <- the one that was missing, StrZType-like strings will base on this one
Change size-based byte types (i.e. BytesLimitType and BytesEosType) to support terminator, include and right-pad
- NOTE: no consume and eos-error:
  - consume is absolutely pointless given that we know some byte size a priori that we'll consume anyway
  - eos-error should be considered to be always false in this case: there's no real "reading" happens, and if anyone would want to check whether it actually hit the terminator, it's quite easy to do so by a simple "resulting array ends with terminator byte?" check

Implementation-wise, it means:

All runtimes lose read_str_byte_limit, read_str_eos, read_strz
All runtimes get read_bytes_term (which is basically a slightly stripped down version of read_strz without encoding stuff)
All runtimes need to implement something like static methods:
- bytes_strip_right(bytes, pad_right)
- bytes_terminate(bytes, term, include)
All compilers need to implement bytesPadTermExpr, which will do calls to aforementioned bytes_* static methods
All translators need to implement bytesToStr, which converts byte array into a string given an encoding (which would be used extensively to read strings and it will be available as bytearray.to_s(encoding) in expression language as well)

Actually, I've got a basic refactored implementation up and running (for Java and Ruby, at least), everything else is done ~60-70%. I expect to finish and push this one soon.

GreyCat · 2017-02-14T07:55:46Z

And, yeah, @LogicAndTrick, now str and strz is basically the same thing. The only difference is strz means implicit terminator: 0 if terminator is not mentioned.

GreyCat · 2017-02-14T17:42:09Z

Well, it's mostly done and committed. Please check, test and comment. Currently we're missing implementations for PHP and Perl, everything else should be fine.

ghost · 2017-02-15T11:07:57Z

@GreyCat Should be done anything for PHP? I see that the mentioned methods were already implemented for the PHP runtime.

GreyCat · 2017-02-15T11:34:40Z

I've submitted PHP changes for compiler & runtime a few hours ago - so no more work in that department required (although feel free to make them faster, more optimized, etc).

GreyCat · 2017-02-15T20:05:41Z

Just completed Perl implementation, that kind of completes this task. Probably some more tests on border cases won't hurt (i.e. size + terminator without str, or having no string removed as padding, or full string removed as padding, etc), but generally it works.

@sergeyzelenyuk — please take a look, probably this could be implemented in much more efficient manner?

Fixed PositionInSeq, InstanceIoUser, InstanceStdArray, IfStruct

Add quick start guide for Node.js

GreyCat added the enhancement label Jul 29, 2016

GreyCat mentioned this issue Oct 18, 2016

Default encoding for strings #34

Closed

LogicAndTrick mentioned this issue Dec 1, 2016

Add Quake MDL format kaitai-io/kaitai_struct_formats#7

Merged

GreyCat mentioned this issue Jan 20, 2017

Fix-length zero-padded string handling #91

Closed

GreyCat self-assigned this Feb 14, 2017

GreyCat added this to the v0.7 milestone Feb 14, 2017

GreyCat closed this as completed Feb 15, 2017

dreckard mentioned this issue Jun 20, 2017

Strz type support for UTF-16 and UTF-32 #187

Closed

DL4PD pushed a commit to DL4PD/kaitai_struct that referenced this issue Mar 13, 2019

Merge pull request kaitai-io#13 from morpho-os/master

789f0b4

Fixed PositionInSeq, InstanceIoUser, InstanceStdArray, IfStruct

krisutofu pushed a commit to krisutofu/kaitai_struct that referenced this issue Jan 2, 2022

Merge pull request kaitai-io#13 from kaitai-io/js-quick-start-guide

a30cb67

Add quick start guide for Node.js

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixed-length string with terminating character #13

Fixed-length string with terminating character #13

LogicAndTrick commented Jul 29, 2016 •

edited

Loading

GreyCat commented Jul 29, 2016

GreyCat commented Oct 16, 2016

LogicAndTrick commented Oct 17, 2016

GreyCat commented Oct 17, 2016

LogicAndTrick commented Oct 18, 2016

GreyCat commented Oct 18, 2016

LogicAndTrick commented Oct 18, 2016

GreyCat commented Oct 18, 2016

LogicAndTrick commented Oct 19, 2016

GreyCat commented Feb 14, 2017

GreyCat commented Feb 14, 2017

GreyCat commented Feb 14, 2017

ghost commented Feb 15, 2017

GreyCat commented Feb 15, 2017

GreyCat commented Feb 15, 2017

Fixed-length string with terminating character #13

Fixed-length string with terminating character #13

Comments

LogicAndTrick commented Jul 29, 2016 • edited Loading

GreyCat commented Jul 29, 2016

GreyCat commented Oct 16, 2016

LogicAndTrick commented Oct 17, 2016

GreyCat commented Oct 17, 2016

LogicAndTrick commented Oct 18, 2016

GreyCat commented Oct 18, 2016

LogicAndTrick commented Oct 18, 2016

GreyCat commented Oct 18, 2016

LogicAndTrick commented Oct 19, 2016

GreyCat commented Feb 14, 2017

GreyCat commented Feb 14, 2017

GreyCat commented Feb 14, 2017

ghost commented Feb 15, 2017

GreyCat commented Feb 15, 2017

GreyCat commented Feb 15, 2017

LogicAndTrick commented Jul 29, 2016 •

edited

Loading