Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed-length string with terminating character #13

Closed
LogicAndTrick opened this issue Jul 29, 2016 · 15 comments
Closed

Fixed-length string with terminating character #13

LogicAndTrick opened this issue Jul 29, 2016 · 15 comments
Assignees
Milestone

Comments

@LogicAndTrick
Copy link
Collaborator

LogicAndTrick commented Jul 29, 2016

I can't work out if this is possible currently, but it doesn't seem obvious: Many formats reserve a fixed space for a file name but also have a terminating character (usually \0) to support variable length names.

For example:

typedef struct {
    char name[256];
} example;

example e;
e.name = "test";
writeToFile(e); // Writes { 'T', 'E', 'S', 'T', '\0', ... [+ 251 more bytes] }

Right now there's two options currently available that I can think of:

Option 1. Use str with a fixed length and the consumer has to deal with the terminator manually:

KSY:
      - id: name
        type: str
        size: 256
        encoding: ASCII

Code:
    var e = Example.FromFile("...");
    e.Name = FixTerminatedString(e.Name);

Option 2. Use strz with a terminator and then a second value for the rest of the array:

      - id: name
        type: strz
        encoding: ASCII
        terminator: 0x00
        consume: false
      - id: junk
        size: 256 - name.length

It'd be nice to support an optional terminator in the str type to automatically truncate the string in the runtime:

KSY:
      - id: name
        type: str
        size: 256
        encoding: ASCII
        terminator: 0x00

Runtime:
    public string ReadStrByteLimit(long length, bool terminated, byte terminator, string encoding)
    {
        var bytes = ReadBytes(length);
        if (terminated) bytes = bytes.TakeWhile(b => b != terminator).ToArray();
        return System.Text.Encoding.GetEncoding(encoding).GetString(bytes);
    }

Alternatively, a size option in the strz type would work as well, but I think it'd be more suited for the str type because the data is still fixed-width in the end.

You can see a real-world example in the doom_wad.ksy format (boxes = null characters):

image

@GreyCat
Copy link
Member

GreyCat commented Jul 29, 2016

Yeah, that might be a good idea - I like the approach that you've suggested. It also has a very important feature - thus data types can be both read (what we're doing now) and written (what we'll hopefully be doing in the future - i.e. serialization of objects into binary streams) without any ambiguility.

@GreyCat
Copy link
Member

GreyCat commented Oct 16, 2016

Returning to this proposal: I'd like to suggest slight modification: do not reuse terminator per se, but add another option like pad-right, which can be used to specify right padding bytes. I've seen two real-life use-cases:

  • many formats use 0 designate end of string + pad the rest of the field with 0s as well
  • some formats (ISO9660, majority of bank/card billing formats) use 0x20 (ASCII space) for right-padding of the field

The parsing idea is to read full field at once and then strip as many right-padding bytes (from the right side of the string) as we can. The generation (serialization) idea is to add right padding bytes to the given string's byte representation to make it up for full desired length of the field.

The semantics are slightly different than terminator, so I think that reusing terminator is a bad idea, it's better to add distinct new parameter. What do you think of it?

@LogicAndTrick
Copy link
Collaborator Author

I'd still prefer the terminator, myself. There's cases where the rest of the data in the buffer after the terminator is simply gibberish, consider that the string buffer would often be allocated using malloc and strcpy:

char* str = (char*) malloc(16);
strcpy(str, "ABC");

Possible result:

41 42 43 00 C5 92 A3 A6 10 0F 57 C4 25 68 D2 6F 
A  B  C  .  |----- uninitialised memory ------|

@GreyCat
Copy link
Member

GreyCat commented Oct 17, 2016

Ok, then we actually need both - they're different, albeit linked concepts. I don't really like the idea of mixing different string reading algorithms (i.e. str and strz, though). May be we could just add size: X limiter argument for strz? That fits current "workaround with true object" well too, i.e.:

- id: my_str
  type: strz
  # terminator: 0 is by default
  encoding: UTF-8
  size: 16

instead of:

seq:
  - id: my_str
    type: terminated_str
    size: 16
types:
  seq:
    - id: value
      type: strz
      # terminator: 0 is by default
      encoding: UTF-8

@LogicAndTrick
Copy link
Collaborator Author

Yeah, works for me. But I'm starting to see the two different string readers merge into one. If terminator isn't specified (and had no default) and size is specified, you could replicate the str method using strz. Only having one string reading method would simplify the interface and also reduce duplicated work when adding new features such as the pad-right option.

seq:
  - id: my_str
    type: str      # performs the role of both str and strz
    # at least one of size and terminator must be specified
    size: 16       # optional if terminator is specified (strz behaviour)
    terminator: 0  # doesn't default to 0 if not specified
    # other str/strz options
    encoding: ASCII
    pad-right: 0x20
    ...

While we're talking about strings, what do you think about adding a default encoding to the meta section, like the endianness? Often when a file uses a particular encoding, the encoding will be the same for the entire file.

meta:
  endian: le
  encoding: UTF-8
seq:
  - id: str1
    type: str
    size: 32
    # use the default encoding
  - id: str2
    type: str
    size: 64
    encoding: ASCII   # encoding can still be overridden

@GreyCat
Copy link
Member

GreyCat commented Oct 18, 2016

two different string readers merge into one.

I don't think that essential two string methods will ever merge, but it's probably a good idea to unify APIs. I support the idea of having type: strz to be a handy synonym for type: str + terminator: 0. As for the implementation, unfortunately, these two are sufficiently different. Many languages include some sort of distinct native read_strz implementation which we'd want to reuse. OTOH, I agree that we should probably avoid repetition and extract all postporcessing, i.e. something like (as now "postprocessing" becomes more and more complex:

def read_str_eos(encoding, right_pad)
  buf = read_till_eof
  postporcess_str(buf, encoding, right_pad)
end

def read_str(size, encoding, right_pad)
  buf = read_buf(size)
  postprocess_str(buf, encoding, right_pad)
end

def read_strz(encoding, term, include_term, consume_term, eos_error, right_pad)
  buf = read_till_terminator(term, include_term, consume_term, eos_error)
  postprocess_str(buf, encoding, right_pad)
end

def postprocess_str(buf, encoding, right_pad)
  # 1. remove padding
  # 2. convert byte buffer -> string using specified encoding
end

what do you think about adding a default encoding to the meta section

Great idea :) Added as #34.

Yet another "while we're here" proposal: may be we should rethink read_* API names, so bytes vs strs would be more alike. Right now they're:

  • read_str <-> read_bytes
  • read_str_eos <-> read_bytes_full
  • read_strz <-> no read-bytes-until-terminator call

@LogicAndTrick
Copy link
Collaborator Author

read_till_terminator(term, include_term, consume_term, eos_error)

Out of interest, are UTF-16 and UTF-32 supported encodings for null-terminated strings?

@GreyCat
Copy link
Member

GreyCat commented Oct 18, 2016

I haven't ever seen a real-life example of null-terminated strings in 2- or 4-byte per char encodings. The same actually goes to "string-limited-with-number-of-characters" (as opposed to number of bytes) and "string-terminated-with-a-character" (as opposed to byte). I understand that, in theory, one might want to read, for example, "a string terminated with ܀, which is encoded in UTF-8 as dc 80", but normal machine-readable formats usually do not work this way.

@LogicAndTrick
Copy link
Collaborator Author

I've never encountered a real-world example either, so I agree that it's probably not something to worry about. If it's really required then the user can revert to reading an array of integers using repeat-until to terminate and do manual decoding of the string.

It does seem like there may be cases of UTF-16 strings being terminated by 00 00, according to Wikipedia:

UTF-16 uses 2-byte integers and since either byte may be zero, cannot be stored in a null-terminated byte string. However, some languages implement a string of 16-bit UTF-16 characters, terminated by a 16-bit NUL character.

@GreyCat
Copy link
Member

GreyCat commented Feb 14, 2017

Ok, half a year later I've got to this.

My proposal to change is quite a big one, so I'd like to at least outline it:

  • Remove distinct string types from DataType. That is, there would be no more 3 distinct StrByteLimitType, StrEosType and StrZType. Instead, we'll use the same model as EnumType uses: there would be single StrFromBytesType, which will be based on some BytesType and will carry only encoding.
  • Make 3 distinct BytesType implementations:
    • BytesLimitType
    • BytesEosType
    • BytesTerminatedType <- the one that was missing, StrZType-like strings will base on this one
  • Change size-based byte types (i.e. BytesLimitType and BytesEosType) to support terminator, include and right-pad
    • NOTE: no consume and eos-error:
      • consume is absolutely pointless given that we know some byte size a priori that we'll consume anyway
      • eos-error should be considered to be always false in this case: there's no real "reading" happens, and if anyone would want to check whether it actually hit the terminator, it's quite easy to do so by a simple "resulting array ends with terminator byte?" check

Implementation-wise, it means:

  • All runtimes lose read_str_byte_limit, read_str_eos, read_strz
  • All runtimes get read_bytes_term (which is basically a slightly stripped down version of read_strz without encoding stuff)
  • All runtimes need to implement something like static methods:
    • bytes_strip_right(bytes, pad_right)
    • bytes_terminate(bytes, term, include)
  • All compilers need to implement bytesPadTermExpr, which will do calls to aforementioned bytes_* static methods
  • All translators need to implement bytesToStr, which converts byte array into a string given an encoding (which would be used extensively to read strings and it will be available as bytearray.to_s(encoding) in expression language as well)

Actually, I've got a basic refactored implementation up and running (for Java and Ruby, at least), everything else is done ~60-70%. I expect to finish and push this one soon.

@GreyCat
Copy link
Member

GreyCat commented Feb 14, 2017

And, yeah, @LogicAndTrick, now str and strz is basically the same thing. The only difference is strz means implicit terminator: 0 if terminator is not mentioned.

@GreyCat GreyCat self-assigned this Feb 14, 2017
@GreyCat GreyCat added this to the v0.7 milestone Feb 14, 2017
@GreyCat
Copy link
Member

GreyCat commented Feb 14, 2017

Well, it's mostly done and committed. Please check, test and comment. Currently we're missing implementations for PHP and Perl, everything else should be fine.

@ghost
Copy link

ghost commented Feb 15, 2017

@GreyCat Should be done anything for PHP? I see that the mentioned methods were already implemented for the PHP runtime.

@GreyCat
Copy link
Member

GreyCat commented Feb 15, 2017

I've submitted PHP changes for compiler & runtime a few hours ago - so no more work in that department required (although feel free to make them faster, more optimized, etc).

@GreyCat
Copy link
Member

GreyCat commented Feb 15, 2017

Just completed Perl implementation, that kind of completes this task. Probably some more tests on border cases won't hurt (i.e. size + terminator without str, or having no string removed as padding, or full string removed as padding, etc), but generally it works.

@sergeyzelenyuk — please take a look, probably this could be implemented in much more efficient manner?

@GreyCat GreyCat closed this as completed Feb 15, 2017
DL4PD pushed a commit to DL4PD/kaitai_struct that referenced this issue Mar 13, 2019
Fixed PositionInSeq, InstanceIoUser, InstanceStdArray, IfStruct
krisutofu pushed a commit to krisutofu/kaitai_struct that referenced this issue Jan 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants