Discussions on the upcoming MessagePack spec that adds the string type to the protocol. #128

Closed
kiyoto opened this Issue Feb 24, 2013 · 220 comments

Comments

Projects
None yet

kiyoto commented Feb 24, 2013

This issue continues the discussion that started in this issue, which has grown interminably long.

Here is to a fruitful, invigorating, productive thread! Hooray chums!

Contributor

kazuho commented Feb 25, 2013

@cabo

Thank you for responding and thank you for explaining the possibilities, things that might or might not happen at IETF. It helps existing users a lot to understand what would happen in terms of compatibility once the specification be brought to IETF.

In the tone you write, I can assume that you are trying to be precise as possible instead of trying to cheat (I do not mean to offend you by saying this, I mean that since some do such things, it makes me feel that you are not as such and it makes your words more trustworthy to me). I really appreciate it.

@frsyuki

It seems like that even we succeed in agreeing how to introduce a string type to MessagePack, it seems that there would be contiuing pressure to add more and more types.

The problem of current MessagePack is that it does not define a solid way to extend the protocol without sacrificing backwards compatibility. Unless we resolve the issue now, it is likely that we would have a protocol update again that would break compatilibity (i.e. old decoders refusing to work since it cannot decode the newly introduced type).

So I hereby propose to slightly modify @frsyuki's proposal https://gist.github.com/frsyuki/5022569

I would like to request, instead of adding a "binary" type, add a blob with a type tag (hereafter referred to as "Extended" type) that can be used to illustrate the type of the payload.

What I basically mean is that since once we succeed in defining a string type, we should start using a type-length-value design (http://en.wikipedia.org/wiki/Type-length-value) for other types to be added. The merit of the type-length-value approach is that the applications can preserve and copy unknown types of data. So even if we introduce new types of data in the future, we can guarantee that existing applications will continue to work without modification, handling data of unknown types as a set of type-id and opaque octets.

Changes to the format

The list below shows the mapping of the Extended types (please check the differences from https://gist.github.com/frsyuki/5022569).

0xc4-0xc9,0xd4,0xd5 FixExtended (0bytes - 7bytes extended type)
0xd6 Extended 8 (extended type)  // new
0xd7 Extended 16 (extended type)  // new
0xd8 Extended 32 (extended type)  // new

0xd9 string 8 (String type)  // new
0xda string 16 (String type)  // changed from raw 16
0xdb string 32 (String type)  // changed from raw 32

The definition adds FixExtended types for any type of data below 7 bytes in length. Data longer than 8 bytes would go into either of the Extended 8, 16, 32 types, regarding on their lengths.

Payload of each Extended type would start with a single byte (hereafter referred to as ExtendedType) that designates the type of the data, and the succeeding bytes would be the actual payload.

ExtendedType is defined as below.

0x00      - binary
0x00-0xef - reserved
0xf0-0xff - private extension

For now, the only defined Extended type would be "binary". The uppermost 16 slots are reserved for private extensions so that people can play with new ideas. The rest are being reserved for future possibilities.

For example, if we would ever want to add a time_t type to MessagePack (using ExtendedType=0x31), then the encoded data would look like:

0xc7 0x31 0x51 0x2a 0xd5 0xb0 ; is Feb 25 03:08:32 2013 (0x512ad5b0)

Or if we would ever want to introduce string types using Shift_JIS encoding (which is a legacy but still very-common encoding in Japan with some incompatibilies with Unicode in terms of actual usage) using ExtendedType 0x32, then it would look like

0xd6 0x0a 0x32 0x82 0xb1 0x82 0xf1 0x82 0xc9 0x82 0xbf 0x82 0xcd ; 3 byte header, next 10 bytes say "Hello" in Japanese

As I mentioned before, introduction of these types (if they ever get introduced) would not break implementations that do not know or care about the introduced types; they can just handle the data as a set of unknown type-id and octets.

As a footnote, an FAQ against this proposal might be why I defined the ExtendType to be single byte, instead of defining it as a variable length value. The reason is to keep the proposal simple as possible (so that it can be accepted by as many as possible). If we ever reach near to getting out of reserved types, we could start adding multibyte ExtendedTypes, and even if we did so, existing codecs would not break, since the succeeding bytes of the multibyte ExtendedTypes would be considered as part of the opaque binary octets that will be preserved.

cabo commented Feb 25, 2013

@kazuho Thanks for the kind words.

I completely agree that some form of type-tagging is needed for long-term extensibility.
Please see the appendix in the draft draft I mentioned yesterday (http://www.tzi.de/~cabo/draft-bormann-apparea-bpack-01pre2.txt) for the approach I'm currently favoring.

Basically, I believe the text string/raw string dichotomy is too basic to leave it to a type-tagging scheme. Tagged types can then be built from what we have, including text strings and raw strings, but also ints etc. I also arrived at single-byte type tags, because I believe these should be introduced very sparingly.

Owner

frsyuki commented Feb 25, 2013

@kazuho It seems good idea.

I needed to think about the user-defined types as well. I mentioned at the first comment, though.

cabo commented Feb 25, 2013

Oh, and one experience we made in the IETF is that the code point range that you call "private extensions" should best be called "experimental values", so it is clear these aren't up for grabs. Otherwise Company X will come and claim 0xf0 for their purposes, do production deployments with that, and then another company will claim 0xf1, and after a short while all of the code points are gone, circumventing the tight control that should be exercised around handing out new code points.

cabo commented Feb 25, 2013

Another comment: never use up all code points. E.g., a future extension like the Half floating point I suggested would no longer be possible once all code points are used up. Nobody knows what we'll want to do in five years from now, so we always should have some free space for new requirements.

Contributor

kazuho commented Feb 25, 2013

@cabo

Thank you for you positive comment.

I believe the text string/raw string dichotomy is too basic to leave it to a type-tagging scheme.

Yes I agree with that. My proposal does not use type-tagging for strings. It's a variant of @frsyuki's proposal. Strings will be stored in the "raw" area of the current spec.

Another comment: never use up all code points. E.g., a future extension like the Half floating point I suggested would no longer be possible once all code points are used up. Nobody knows what we'll want to do in five years from now, so we always should have some free space for new requirements.

I disagree. If there is a requirement to store a large number of very short types as scalars (for example half floating point numbers as you mentioned), MessagePack would not be a good format no matter how you would extend it, since you would need 3 bytes (1 byte overhead) for every half precision floating point number. The best way in general to use such kind of short values would be to store them in typed arrays (e.g. HalfFloatArray). And for such purposes my proposal will work very fine.

And for data types that require larger space than half-floats or something alike, my proposal is space-efficient, since by using the remaining bytes it adds the FixedRaw types to miminize the encoded size, while leaving a much greater possibility to add new types than using the very few remaining tags.

Contributor

kazuho commented Feb 25, 2013

@cabo

Oh, and one experience we made in the IETF is that the code point range that you call "private extensions" should best be called "experimental values", so it is clear these aren't up for grabs. Otherwise Company X will come and claim 0xf0 for their purposes, do production deployments with that, and then another company will claim 0xf1, and after a short while all of the code points are gone, circumventing the tight control that should be exercised around handing out new code points.

Thank you for the comment. I agree that it should be worded as such if my proposal ever gets updated or gets merged to somewhere else.

Contributor

kazuho commented Feb 25, 2013

@frsyuki

Thank you for your comment. I am very glad to hear that.

Owner

frsyuki commented Feb 25, 2013

I created second proposal (incomplete, though):
https://gist.github.com/frsyuki/5028082

I think everyone understand this but again: I don't think my proposals are matured (meaning that I don't think that proposals are ready to be an established standard for now). We likely have different ideas. For now I don't have any intention to give approval for my drafts nor drafts on IETF. We can not assume the articles I already posted are likely the next msgpack for now.

cabo commented Feb 25, 2013

An example for how I would do tagging:

Without tagging, a UNIX time would use the uint32 (uint64 from 2106) type:

    ce 51 2b 0a 01

If we reserve a tag nn for tagging date/times, we could define the combination of nn and uint32 to mean UNIX time.
This now looks like this:

    c1 nn ce 51 2b 0a 01

Receivers can always decode this as a uint32 -- the tagging adds the semantic information.

I think the combinability of tagging with the existing type system is better than limiting tagging to raw strings only -- it will generally provide for a more meaningful decoding.

Owner

frsyuki commented Feb 25, 2013

@cabo I see. It's also an interesting idea.

Contributor

kazuho commented Feb 25, 2013

@cabo

I agree that tagging is a good idea. And time_t is indeed a good example in which it works very well. But I think the approach you proposed has a drawback in terms of footprint (and that is actually the reason why I did not take the route in my proposal).

My assumption is that most of the types that would ever be added to MessagePack would not be types that could be represented by using a set of tag and a single primitive that exist in the specification (except for binaries).

On the other hand if we used the tagging approach as you mentioned, each tagged type would take extra space of two bytes; an overhead that should better be avoided (since, the small footprint is one of the reasons people choose MessagePack).

For example, consider adding a "Point" class that would store two 16-bit floating point numbers.

With my proposal, it would be encoded as

c8 NN AA AA BB BB ; NN for the type tag, AAAA and BBBB are the values

With @cabo's proposal, it needs be encoded as

c1 NN d5 04 AA AA BB BB ; NN for the type, AAAA and BBB for the values

As you can see, there is 33% overhead in memory footprint in this case. I think this overhead is a bigger problem than the merits than can be achieved by introducing tagging to general types, and thus I decided to take the way I proposed.

Contributor

kazuho commented Feb 25, 2013

@cabo

My assumption is that most of the types that would ever be added to MessagePack would not be types that could be represented by using a set of tag and a single primitive that exist in the specification (except for binaries).

Sorry I did not show an example that explains my assumption.

As an example, consider adding a date type to MessagePack. Some languages use time_t, but others do not. ECMAScript uses a different format which can only be represented as a IEEE754 double prec. floating point number. I do not think there is a common answer to how the internal representation of a date type should be. And there are also the problem of when the epoch should be when we choose a number to represent date types, not to mention how we should handle timezones.

So if we were to add support for date types, I think the best way would be to use some kind of structure (of multiple values) as the internal representation instead of trying to use a set of type tag and some primitive number.

And if many of the implementations agree on how the internal representation should be, why wouldn't us just use the representation? If we were to tarnsmit time_t values, I think using int64_t (or uint32_t) would be the right answer.

Maybe I have confused you by using time_t in my example (I just thought using a such well-known and primitive type would make others easier to understand the format), but this is how I think. Sorry for that.

Owner

frsyuki commented Feb 25, 2013

@cabo @kazuho

I think potential advantage of @cabo's idea is that applications can know the partial meaning of the deserialized object (stored with the tag). But I think applications can do nothing with the objects excepting just holding it in memory and/or write them into another place as-is.

I mean that if applications know the meaning of the objects (and applications want to deal with the objects), they know how to decode the object from a byte array. An only issue is how to implement the decoder. Then MessagePack libraries can provide utility APIs to implement them.

On the other hand, @kazuho's idea has an advantage in terms of the serialized size.

cabo commented Feb 25, 2013

One of the design principles that have made messagepack so successful
is the seperation of structure and semantics. You can always decode a
msgpack instance without referring to a Schema or IDL file.

I think there is danger in the tagging discussion that we are leaving
that path. Having the deserializer rely on information in the
(extensible) tags for deriving the internal structure of the tagged
information comes dangerously close to that. At the end of this path,
there are ASN.1 PER and XML EXI, and the beauty of msgpack is that it
provides exactly the other end of the scale.

I used date/time as an example because so far I have heard only two
proposals for data types that should be added: date/time and UUID. I
think UUIDs are best represented as a 128-bit binary object (in
msgpack, this would be a raw string), so there is little need for
adding a type. Date/time, however, may benefit from being explicitly
tagged.

Please contrast this to the other discussion around adding an IEEE 754
Half: This would be for an existing type with well-known semantics
(number), but a more compact representation. If (big if) we add that,
the only reasonable place is right beside 0xca and 0xcb. This is
exactly not what I had in mind when writing up the tagging proposal.

Re footprint: Not having a tag in a binary string saves a byte.
I believe that the untagged use will be the most common one.
(But optimization comes after getting the structure right.)

Contributor

kazuho commented Feb 25, 2013

@cabo

Re footprint: Not having a tag in a binary string saves a byte.
I believe that the untagged use will be the most common one.
(But optimization comes after getting the structure right.)

I agree with you in the assumption that untagged binaries would be the ones that would be mostly used. But that does not mean adding an extra byte for such usage is inefficient in terms of footprint.

In general, I assume that an untagged binary data would be fairly large in terms of size. AFAIK the request for such type is from people wanting to store images. For storing such large objects, a difference of one byte is not an issue.

On the other hand, tagged types would be much smaller in size. The examples we have discussed so far (date/time, half-float, Point) are all small in terms of size. Those small, tagged binary objects are the objects that we should try to encode as short as possible, since they are what would bloat the size of serialized data in terms of ratio (which is the metric we should look at when we talk about footprint).

And to repeat, my proposal is good in the fact that it ensures backwards compatibility (i.e. adding new types would not cause existing decoders to fail).

cabo commented Feb 25, 2013

Most uses I have for binary strings are things like cryptographic hashes, MACs, IP and MAC addresses (yes, the other meaning of MAC) etc. These are quite small (but not as small as my average strings); a byte would still make a difference.

I don't think there is a way to add representation alternatives like Half in a forward compatible way. So that would best be done now or never. For future backward and forward compatibility, I think always having tags for binary only and having tags for a wider set of data types is about equivalent.

So the remaining difference in footprint is
-- always spending a byte for binary, vs.
-- spending two bytes for a tag, only where a tag is desired.

I sure can live with both ways, but would prefer the smaller footprint of untagged binary and the ability to tag not only binary strings but also numbers and text strings.

Midar commented Feb 25, 2013

I have to say that I like @frsyuki's first proposal / BinaryPack1pre2 best. @kazuho's proposal / @frsyuki's second proposal would hurt the most common case, as it has no FixString. @cabo's proposal of adding tags seems like yet another case where layers would not be separated well: Whether something is a time_t or an uint32_t/uint64_t is not really important, as both are the same and decoded the same way. This is not about storing it, it's about how to use it - which is something that belongs in a schema IMHO. And that schema should be extern and not embedded, as that only wastes space.

To bring in a completely different side to the discussions: I disagree about having an "extension type" completely. If a new type gets added, old parsers won't parse it. But that's ok! Why not just have versions? You could say "Generate for version 1.0" if you want to be backwards compatible and you could say "Generate for 1.1" once enough parsers have been updated. This is how other format works as well. Saying that we always need to be compatible to parsers that implement an old version of the protocol means that we will be seriously limited. It means we could never add half-float. It means we could not add (u)int128, etc. We could only add new extensions which are encoded in an inefficient way. That would mean that only the types from the first protocol version are first-class citizens and all other types are wasting space. We would end up with an encoding that is not better than BSON when it comes to space efficiency or clean design.

Therefore my pledge is to have versioned protocols and break compatibility on purpose: Old parsers don't need to be able to read data from new protocols, but new parsers need to be able to read data from old protocols.

If the MessagePack people are going for the extensions, I hope at least @cabo will reconsider the tags so that we have at least one format that does the right thing.

Contributor

kazuho commented Feb 25, 2013

@cabo

Thank you for providing real use-cases. Let's use SHA-1 (or HMAC-SHA1; 20 bytes), IPv4 and v6 addresses (4 byte / 16 byte), MAC addresses (6 bytes) as an exmaple to evaluate the efficiency of the approaches.

a) use the remaining type tags

  • SHA-1 - 21 bytes
  • IPv4 - 5 bytes
  • IPv6 - 17 bytes
  • MAC address - 7 bytes

Introducing these types would cause backwards incompatibility. We would sooner or later use all slots, and it would become impossible to add extensions.

b) using my proposal (tag on binary only)

  • SHA-1 - 23 bytes
  • IPv4 - 6 bytes
  • IPv6 - 19 bytes
  • MAC addresses - 8 bytes

Introducing these types would not cause backwards incompatibility. There would be no limit for adding new types.

c) using @cabo's proposal (make all types annotatable)

  • SHA-1 - 24 bytes
  • IPv4 - 7 bytes
  • IPv6 - 20 bytes
  • MAC addresses - 10 bytes

Introducing these types would not cause backwards incompatibility. There would be no limit for adding new types.

If we compare the approaches by comparing the sum of the bytes required, it would be,

a) 50 bytes
b) 56 bytes
c) 61 bytes

As you can see, if we take approach b, with 10%+ overhead we can have infinite extension slots, while guaranteeing backwards compatibility compared to a. I think these examples do illustrate that it is the way we should take.

I don't think there is a way to add representation alternatives like Half in a forward compatible way.

As I said before, I think using MessagePack as a format for storing half-floats is a bad idea. Even if you use the few remaining tags there would be still 50% overhead. If such usage does matter, I think using something like 4/5 encodings (2bits at minimum for a type tag) would be a better approach. Besides, IMO the general use case of using half-floats is to store many of them at once, and for such case we can introduce things like HalfFloatArray if we take either of the approaches a or b, and that would save space.

So the remaining difference in footprint is
-- always spending a byte for binary, vs.
-- spending two bytes for a tag, only where a tag is desired.

No. it is as follows, and the numbers above show the sizes under the use-cases you are interested in.

-- spend extra byte for binary below 8 bytes, save one byte for tagged types below 8 bytes, vs.
-- spend two extra bytes for tagged types

catwell commented Feb 25, 2013

@kazuho

0xc4-0xc9,0xd4,0xd5

Couldn't we avoid things like this (non-continuous ranges for fix types)? It will make the definition of the format confusing IMO.

Something else: while we're discussing changes to MessagePack we could add typed collections to the discussion.

For instance, starting with @kazuho 's point type proposal:

c8 NN AA AA BB BB

This has the following structure:

[tagged type header + length] [tag] [data]

If you want to store for instance a polygon as an array of points you will have to write:

[array header + length] [tagged type header + length] [tag] [data1] [tagged type header + length] [tag] [data2] ...

I think this use case (similarly typed collections) is frequent and the current MessagePack encoding for it is rather wasteful. It would be interesting if we could write something like this instead:

[typed array header + length] [tagged type header + length] [tag] [data1] [data2] ...

In that case for large collections it results in a 33% space gain.

Do you think this is something that could be added to MessagePack?

cabo commented Feb 25, 2013

@kazuho I wouldn't normally tag the binaries, so the numbers would be:

  • SHA-1 - 22 bytes
  • IPv4 - 6 bytes
  • IPv6 - 18 bytes
  • MAC addresses - 8 bytes

@catwell Indeed, homogeneous arrays provide an opportunity for optimization.
Is that optimization needed?

All we should be trying to do here is design the structure right, and then make sure we don't waste bytes unnecessarily. But outright designing for optimization leads towards EXI and PER (or HDF5, or ...), not towards a better msgpack.

Owner

frsyuki commented Feb 25, 2013

I disagree about having an "extension type" as a whole

@Midar says. And I complement another example why allowing future extension has disadvantage and could cause hesitation to adopt msgpack.

One application uses the Integer type of msgpack to represent times. A receiver reads a time object as an Integer (or Raw) and it is working. This is an assumption. Then if msgpack added time type (regardless of its format. It could be by extension tag or could be new header byte assignment), the sender will send the same object using that newly added time type.

A) If a old deserializer maps the newly added time type into Integer (or Raw) (here I assumed the old deserializer can still read the newly added value thanks for a predefined trick in the format such as extension tags), the receiver still works. Because the object the receiver receives is the same as the expected one. This is ok.

B) But if the old deserializer restores it into byte array or a tuple of type tag and binary (or integer), it doesn't work.

Adding data type could break working applications horribly. The reason why adding string (or binary) type is still seems ok is that we don't have to consider B case on strings (or binaries).

Midar commented Feb 25, 2013

@frsyuki Agreed.

Another point is that the only type extensions that would make sense - because they are really a new type and not schema embedded into data - are the ones which need extra support from the parser anyway and could never be parsed by an old parser and would not return something meaningful if treated as binary. An example would be (u)int128_t/(u)int256_t numbers (which are used by SSE/AVX) or halffloats etc. All these need special parsing and no extension type whatsoever would help an old parser. Because if an old parser would support it, it would sometimes be a number (because it was small enough to fit into one of the existing types) and sometimes be binary. What good would that be?!

Owner

frsyuki commented Feb 25, 2013

However, we don't have the problem I mentioned if applications we don't use added types implicitly (because this part doesn't happen: "the sender will send the same object using that newly added time type").

I mean that having extension tags (@kazuho's idea) and not-adding types are compatible if the extension tags are used only when applications clearly specify to use the type (meaning that new serializers don't use Extended type automatically).

Contributor

kazuho commented Feb 25, 2013

@catwell

0xc4-0xc9,0xd4,0xd5

Couldn't we avoid things like this (non-continuous ranges for fix types)? It will make the definition of the format confusing IMO.

I totally agree that it is agree. But there are no contiguous slots left any more.

Do you think this is something that could be added to MessagePack?

Yes!!! That's the entire reason I am proposing the extension to introduce tags. The reason why it is taking so long to add string type is because adding types corrupt the existing applications.

If we introduce the ability to add extended types, it would be much easier since adding such types would not cause other applications (middlewares) to collapse. For example, a middleware that transfers MessagePack objects by looking at the "to" field would continue to work if you add new types.

So, there would be much less rallying against dding new types; the parties interested in having such types (for example HalfFloatArray) can just register the type id for the types they need, and share the implementation instead of reinventing the types they all use on top of the binary type.

It is like how TCP/IP, XML, or ASN.1 works. The IP protocol has a "protocol number", and let's others invent new protocols without destroying the entire IP protocol. An recent example in this area is SCTP, which is trying to become a better alternative to TCP.

By defining such an extension point, it would help people share more code for encoding / decoding data, which actually is what MessagePack is all about.

Contributor

kazuho commented Feb 25, 2013

@frsyuki

However, we don't have the problem I mentioned if applications we don't use added types implicitly (because this part doesn't happen: "the sender will send the same object using that newly added time type").

My idea behind the proposal is that the libraries should never store data using the extended type unless specified explicitly by the programmer.

As I explained using the example of TCP/IP, it is a "separation of layer" problem. Extended types should always be used explicitly at the MessagePack codec level. Though, of cause, people are allowed to use a wrapper (or combine the wrappers) that handle the conversions to encode / decode the extended types, or some MessagePack implementations may allow developers to explicitly plug-in the use of such extensions.

Owner

frsyuki commented Feb 25, 2013

@catwell @kazuho

0xc4-0xc9,0xd4,0xd5
Couldn't we avoid things like this (non-continuous ranges for fix types)? It will make the definition of the format confusing IMO.
I totally agree that it is agree. But there are no contiguous slots left any more.

I agree....
Another possible idea is to assign 0xd4 to "8bytes Extended" and 0xd5 to "16 bytes Extended":

    0xc4-0xc9 FixExtended (0bytes - 5bytes Extended type)  // new
    0xd4 8bytes extended (8bytes Extended type)  // new
    0xd5 16bytes extended (16bytes Extended type)  // new
    0xd6 extended 8 (Extended type)  // new
    0xd7 extended 16 (Extended type)  // new
    0xd8 extended 32 (Extended type)  // new

Assumption here is that 8bytes binary and 16bytes binary are be often used rather than 6 or 7 bytes (this could be wrong).
This format is still confusing, though....and slightly complex to implement serializers.

Contributor

kazuho commented Feb 25, 2013

@cabo

@kazuho I wouldn't normally tag the binaries, so the numbers would be:

  • SHA-1 - 23 bytes
  • IPv4 - 6 bytes
  • IPv6 - 19 bytes
  • MAC addresses - 8 bytes

Sorry, I misunderstood that you would not tag the binaries. Would you mind explaining which spec. the numbers are calculated using?

If it is https://gist.github.com/frsyuki/5022569, then I think the numbers would be:

  • SHA-1 - 22 bytes
  • IPv4 - 6 bytes
  • IPv6 - 18 bytes
  • MAC addresses - 8 bytes

and that would be 54 bytes in total. My proposal was 56 bytes, so the additional overhead is 3.7%... I think it is something that we could afford instead of opening the possibility to extend types without sacrificing interoperability.

cabo commented Feb 25, 2013

@kazuho: The numbers you calculated from 5022569 are correct, actually I thought I wrote exactly those above and not the ones you quote? github must be strange today.

The other numbers are almost as good because you are spending more of the reserved code points.
I continue to believe we should keep some breathing space there (and spend one for 16-bit IEEE 754, but that is a different discussion).

Owner

frsyuki commented Feb 25, 2013

I agree with @kazuho's idea. If all serializers use the extended types only when users specify to use that type clearly.

Owner

frsyuki commented Feb 25, 2013

Another idea is that applications are suggested to use the tag in this order: 0x7f, 0x7e, 0x7d, 0x7c, ..., 1.

Then applications will not use the most significant bit. Thus we might able to use the bit to change that the type tag is not 1byte integer but a variable-length integer (coded in variable byte coding).

Contributor

kazuho commented Feb 25, 2013

@frsyuki

Another idea is that applications are suggested to use the tag in this order: 0x7f, 0x7e, 0x7d, 0x7c, ..., 1.

Then applications will not use the most significant bit. Thus we might able to use the bit to change that the type tag is not 1byte integer but a variable-length integer (coded in variable byte coding).

I assume that you are talking about how the ExtendedType should be numbered. Am I right? If that's the case, it sounds like a very good idea, since IMO it would be too early to determine how the type-id's would / should be used. If we make a guideline as such, it would not hurt people using the ExtendedTypes should we ever decide to introduce an official usage of the ExtendedTypes (expect for binary (0x00)); we could number them from 1 in ascending order.

cabo commented Feb 25, 2013

Oh, I was hoping you were considering opening up the fixnums as a quarry for applications...

cabo commented Feb 25, 2013

(link to variable byte coding: SNDV, RFC 6256 http://tools.ietf.org/rfc/rfc6256.txt)

Midar commented Feb 25, 2013

Btw, if we are going to break compatibility in order to be extendable in the future, how about we get rid of the current integer types and encode them in LEB128? That would allow integers up to 128 bit while we just need a singe type. For example, we could say that the first bit being a one means it's an integer, the second bit is used as continuation bit and the remaining bits as value. If the continuation bit is set, the next byte is read as follows: If the most significant bit is set, this is a continuation bit. The lower 7 bits are the next 7 bits of the value. This could be done until we have enough bytes to hold the value.

This is pretty much LEB128, with the exception that for the first byte, it's moved by one bit.

This sounds like a really good way to encode numbers and would make codepoints free. We could free even more codepoints if we would say that it has to start with 11 or something like that.

cabo commented Feb 25, 2013

@Midar LEB128 is just SDNV done the other way round. msgpack is MSB-first throughout, so SDNV would be a better fit. But there are so many ways to skin this cat... So far, msgpack has been able to use type-length-value throughout, where the length of the length (or, for short things, the length itself) is encoded in the type byte.

It is 2am in Japan, so I expect I'll do the Internet-Draft for Orlando based on what we have so far. I think 5022569 is more widely understood at this point than 5028082, so I'll stick with 5022569 for now. Unless somebody speaks up with a great idea in the next hour or so... (Of course, that doesn't stop us from further exploring 5028082 here.)

Owner

frsyuki commented Feb 25, 2013

@cabo

Oh, I was hoping you were considering opening up the fixnums as a quarry for applications...

Do you mean you thought we can grab bytes from positive/negative fixnum formats and assign them to other types even though it breaks compatibility...?
Well, I believe you think working code is important and you would never adopt such idea. And I will never agree with such completely incompatible change.

cabo commented Feb 25, 2013

Sorry, was trying to be facetious. Should remember to put in smilies...

Owner

frsyuki commented Feb 25, 2013

Regarding RFC, I do not really understand why msgpack needs to go the IETF's standardization process at this time.

Owner

frsyuki commented Feb 25, 2013

I said serializers should not use these types by default but I agree with the basic concept of @kazuho's extension idea.

Application-specific extension types is useful in this case for example:

An application wants to deal with type X transparently (meaning without schema). On idea is to just give up and use schema or something. Another idea is to introduce the application-specific extension types.
If the deserializers provide application with a way to convert the objects with application-specific type information into an instance of type X, applications can restore type X transparently.
If the serializers provide applications with a way to convert an instance of the type X into Extended type, applications can store type X transparently.

One implementation idea is to provide a API to register a functions like this:

    # register a callback function to read a typed array
    unpacker.register_type(:type_id=>0x7f) {|data|
        array = data.read_int16.times.map { data.read_int16 }
        TypedArray16.new(array)
    }

    # register a callback function to write a typed array
    packer.register_type(:class=>TypedArray16) {|obj,buffer|
        buffer.write_type_id(0x7f)
        buffer.write_int16(obj.size)
        obj.each {|element| buffer.write_int16(element) }
    }

cabo commented Feb 25, 2013

We don't have to (and won't) start the process in earnest. But it is good for IETF to know that something is happening here that might obviate the need to invent something else. It is also good for msgpack to get some more scrutiny from people who have a wider perspective on doing Internet protocols. The time is about right to start this communication process as there is a lot of interest in JSON right now in the IETF. There will be a JSON BOF in Orlando, and I'd like to fill the inevitable question for what a binary version of JSON might be with the prospect that a future version of msgpack might be it. And I want to continue discussing protocols that build on this vision.

cabo commented Feb 25, 2013

@frsyuki I would welcome this flexibility.

I would also expect people come up with canned (= "standard") extension types, and it would be nice to maintain some commonality between applications. So if someone comes up with a good, reusable way to do timestamps, it is better if that is not 0x53 in one app and 0x47 in another.

I also wouldn't want to force applications to tag things in the format — a JSON like environment might provide information on what an integer or a blob means in a better way.

Owner

frsyuki commented Feb 25, 2013

@cabo
I explained a problem that changing behavior of serializers causes incompatibility of working applications and could cause hesitation to adopt msgpack.

kenn commented Feb 25, 2013

For FixExtended, I suggest to assign 0xc4-0xc7, that is:

0xc4 - 11000100 - FixExtended 0 bytes
0xc5 - 11000101 - FixExtended 1 bytes
0xc6 - 11000110 - FixExtended 2 bytes
0xc7 - 11000111 - FixExtended 3 bytes

This means that FixExtended works much like other Fix- types - you only need a bitmask to get the length, no if-branches. That way, the specification would look much simpler, and implementations can benefit from faster processing, as three branches (0xc4-0xc9, 0xd4, 0xd5) could mess up branch prediction and make CPU pipeline stall. It seems too granular to pay such a penalty to me.

Still, where we have sparse data and NULLs show up repetitively, those NULLs are represented in the shortest form.

This also saves some reserved types from completely used up.

cabo commented Feb 25, 2013

@frsyuki I'm not quite sure what your last comment was commenting on. I think it is clear that future serializers/deserializers will run with parameters. Apart from registering extensions, parameters could also indicate a msgpack version (think 0.9/1.0/1.1, whatever). And/or you could put a version in the data. None of these are wonderful, but evolution happens, and it is better to embrace it (within reason) than to deny it.

cabo commented Feb 25, 2013

@kenn what application do you have in mind for 0-3 byte binary blobs? (All my binary blobs are not much, but a bit larger, so I'm just wondering.)

kenn commented Feb 25, 2013

@cabo Any types that have variable length, for instance. Shift_JIS, maybe? With such a crafted representation, it's likely that the most frequently used values are encoded to the shortest forms.

But I can see your point, it's probably that the only thing I really care about is NULL - defining only FixNil might be satisfactory.

cabo commented Feb 25, 2013

@kenn I thought we had 0xc0 already? What do we need a second nil for?

kenn commented Feb 25, 2013

@cabo Are you saying that String should not have 0 bytes and use 0xc0 instead for empty string? :) You can't retrieve the type information from 0xc0.

cabo commented Feb 25, 2013

@kenn Sorry, I misunderstood this as a real nil. Yes, an empty blob is indeed useful. I'm just not sure that it occurs often enough to merit spending a code point.

kenn commented Feb 25, 2013

@cabo that of course depends on the data distribution and histogram, but I can say from my usage that sparse data is a real thing, particularly when we deal with denormalized relational data. If we don't have FixExtended or FixNil, null is going to be 4 bytes, or 2 bytes if we have FixExtended or FixNil. Now we're talking about up to 2x space efficiency for null-oriented extended data.

cabo commented Feb 25, 2013

@kenn That kind of sparse data looks like a perfect usage for 0xc0 nil, though.

Contributor

kazuho commented Feb 25, 2013

@kenn

For FixExtended, I suggest to assign 0xc4-0xc7

I disagree for three reasons.

Reason 1) efficiency in terms of footprint. I agree with @cabo that the types that we go into FixExtended would likely be bigger than 3 bytes in general (except for strings), so such design would be inefficient than that of my proposal

Reason 2) the difference of implementation costs between my proposal is negligible. It is just one if statement or something alike.

Reason 3) there is no positive reason to leave reserved bytes (note: even if we leave reserved bytes, we can never use them without making incompatible changes). I believe that the most important thing for a data format to guarantee is long-term compatibility, and I think that many others think the same way. But if we are to introduce strings (I understand we are), some incompatible change is inevitable. That would rise fear to the existing / potential users that the format may once again break compatibility. I think this is very bad. Personally I would never trust a data format that makes frequent changes.

In other words, I think we should guarantee the users that we should never ever break compatibility when we break it by introducing strings. Guaranteeing that by design is a good thing.

cabo commented Feb 25, 2013

@kazuho I think it is important to distinguish backward and forward compatibility. Breaking backward compatibility occasionally is almost inevitable in any real-word system. The world simply changes. Now, msgpack doesn't have 128-bit integers today — do you suggest we add them now because we might need them later? YAGNI is a good advisor…. It pays to be prepared for change.

Breaking forward compatibility is worse. The current need to do that at least in a partial sense is caused by two reasons:

  • divergence in existing implementations. Some have been using raw for text strings, some for binary blobs. This wasn't a healthy situation, and cleaning it up was going to cause problems anyway. But the other problem is:
  • we don't have enough code points left that we could spend to solve this problem in the optimal way. Using up more code points means that it will be even harder so solve any future problems.

Painting yourself into a corner is rarely the best way to stay agile.

kenn commented Feb 25, 2013

First, I'd stress that I'm fine with @kazuho's spec. That said, I'll point out some aftereffects of this change that might have been missed.

  1. When 3 bytes aren't enough, probably 7 bytes aren't enough, either. Particularly when we can't store 8 bytes. Perhaps 4 bytes could be used often but I don't think 5, 6 and 7 bytes will be used much. Where 5, 6 and 7 bytes are observed, it's probably a variable length format and any arbitrary size would work proportionally.
  2. Performance implication could be negligible, but complicating the specification is never negligible IMO, as it's something that application users like myself read. An artificially complicated spec indicates the spec has a complicated history. On the contrary to your concern and inadvertently, it's a built-in way of telling new comers that the spec was unstable.
  3. I fail to see why using up reserved types would help. I can see @cabo's point on int128. When we introduce a breaking change, things break no matter what. By not leaving reserved bytes, you're encouraging a fork of msgpack sooner, that breaks "some" compatibility but works just fine in the future (thus unknown right now) killer use cases. We have two class of types (native types and extended types) and we still have an extension point in the extended area that's not used up, so what you're really saying is "we used up the first-class types and anything should go to the second." I still fail to see why using up only the 1st-class types is a good thing.

Again, I'm fine either way - just wanted to put everything on the table. :)

kenn commented Feb 25, 2013

@cabo

That kind of sparse data looks like a perfect usage for 0xc0 nil, though.

Iff we have a schema or an IDL with it. Otherwise the type information for the value is lost forever. Again, we would have an empty string as 0 bytes of string rather than 0xc0.

cabo commented Feb 25, 2013

Enjoy: http://tools.ietf.org/html/draft-bormann-apparea-bpack-01

I tried to word the abstract to make sure that people understand this is a snapshot.

At least it is better than (and now finally replaces) the previous snapshot from October, -00.

Recommended reading, because it contains more information than just the codepoints.

Contributor

kazuho commented Feb 25, 2013

@kenn

When we introduce a breaking change, things break no matter what. By not leaving reserved bytes, you're encouraging a fork of msgpack sooner, that breaks "some" compatibility but works just fine in the future (thus unknown right now) killer use cases.

I am afraid your understanding is wrong.

My proposal does not prohibit adding new types; it adds the possibility to add 256 new types with 2 bytes overhead (for small-sized data), or infinite number of new types without breaking interoperatiblity.

We only have 8 remaining reserved bytes in @frsyuki's previous proposal (see https://gist.github.com/frsyuki/5022569). So the possibility to add new types using those bytes is pretty limited (and it also destroys interoperability).

But if we move to the approach I proposed, we can store much more types of data without breaking interoperability with low overhead in terms of footprint overhead, since it uses the few remaining reserved bytes to represent ExtendedFixed types.

Even if you do not hesitate to break compatibility adding new types, this is really the last chance to introduce such an possibility for providing a way to add certain number of new types in a compact form (as ExtendedFixed in my proposal does).

kenn commented Feb 25, 2013

@kazuho

My proposal does not prohibit adding new types; it adds the possibility to add 256 new types with 2 bytes overhead (for small-sized data), or infinite number of new types without breaking interoperatiblity.

I understand what you mean.

My question boils down to: Are we sure that there's no future that 2 bytes overhead is too much?

If the answer is yes, then I agree to use up and give every bit to the ExtendedFixed.

Still, other part of my concern remains - the spec would look ugly and less human-friendly. That might alienate users from understanding. Also there will be a bumpy speed characteristics - it could get surprisingly and unnaturally slow when there are many ExtendedFixed data with 6 or 7 bytes length - branch hazards are real in the micro scale.

kenn commented Feb 25, 2013

Another idea is to have 0-5 bytes for FixExtended.

0xc4 - 11000100 - FixExtended 0 bytes
0xc5 - 11000101 - FixExtended 1 bytes
0xc6 - 11000110 - FixExtended 2 bytes
0xc7 - 11000111 - FixExtended 3 bytes
0xc8 - 11001000 - FixExtended 4 bytes
0xc9 - 11001001 - FixExtended 5 bytes

We can no longer use a bitmask but instead we can just subtract 0xc4 to get the length. Less consistent and less human-friendly than 0xc4-0xc7, but two more lengths and the performance will be predictable. Just food for thoughts.

Contributor

kazuho commented Feb 25, 2013

@kenn

Are we sure that there's no future that 2 bytes overhead is too much?

If you do come up to such situation, I think you should create a completely different format as I have suggested to @cabo. MessagePack is very compact and proven protocol, but the format is already complicated maybe due to the fact since if I under stand correctly it derived from one of @frsyuki's projects.

There are many possibilities if you redesign the protocol from ground. On this thread, I have already pointed out the possibility of using 4/5 encoding and alike for representing small numbers as scalars. You can also add typed arrays. Key type of maps should be limited to strings (if the assumed use-case is Web APIs). Strings can have shorter representations.

These approaches are no longer possible by trying to extend MessagePack since it is already an working protocol, because when upgrading an working protocol, what matters the most is compatibility.

Still, other part of my concern remains - the spec would look ugly, that might alienate users from understanding. Also there will be a bumpy speed characteristics - it could get surprisingly and unnaturally slow when there are many data with 6 or 7 bytes length - branch hazards are real in the micro scale.

I disagree. In case of copying deserializers, the bottleneck will be calls to malloc (or other kinds of memory allocation functions), and the copying code can be inline expanded for each fixed size. In case of non-copying deserializers all you need to do is setup a pointer. In either case, copying 6 or 7 bytes would not be a problem. And in case if you are talking about unaligned access, MessagePack is a byte-level protocol and there is already 50% chance of copying between odd addresses and even addresses. My proposal does not add any new problem in that aspect.

And your claim that MessagePack might be considered to have been unstable is also wrong IMO. The sad fact is that MessagePack is unstable now although many have expected to be the contrary for several years, (aren't we trying to introduce incompatibility now by introducing strings?). And my position is that we should make MessagePack stable by design, so that we can guarantee interoperability this time.

Contributor

kazuho commented Feb 26, 2013

@cabo

Now, msgpack doesn't have 128-bit integers today — do you suggest we add them now because we might need them later?

No, what I am proposing is to use this opportunity (of adding a string type) to let such things happen easier. Under my proposal adding int128_t in the future would be quite easy, since it would not break backwards compatibility nor take one of the very few remaining slots (only 9 left in the previous version of @frsyuki's propsal, the one which you seem to prefer).

If you want to add int128_t and uint128_t by using the reserved slots you would need to win a very intense battle on what should get into the left slots. (only 9 left). And even if you succeed, the game would become tough and tough as time progresses and the reserving slots gets consumed.

On the other hand if we take the way I proposed, it would be much easier to add new types since there are 256 slots at least, and introducing a new type does not break backwards compatibility.

Contributor

kazuho commented Feb 26, 2013

@cabo

Actually I expect you would be the one of the most who would benefit from my proposal which has now been merged to @frsyuki's (https://gist.github.com/frsyuki/5028082).

With this format and the revised ExtendedType numbering scheme proposed by @frsyuki in #128 (comment) it would be easy to create upper-layer protocols with the types you would need on supporting sensors (I assume that is what you are interested in). You can add your own types by using the ExtendedType IDs starting from 0x7f in the descending order... and nobody will complain about it.

IMO it would help you a lot since you can concentrate on solving your task instead of making tough discussions on how the few remaining slots should be used, or if existing assignments of slots should be revoked (which @frsyuki clearly states as the way he would never agree in #128 (comment)).

PS. And if any of the types you add does seem to have much broader usage than only in the areas you are working on (I assume int128_t and uint128_t would be a good candidate), then we could discuss whether moving the type ID to the lower values (which would be considered application-independent) and the fruits of your work will be shareable between people working in other areas as well.

yappo commented Feb 26, 2013

I strongly wish to keep backward compatibility for new version from current messagepack format.
And I hope that the general user of the MessagePack library does not confusion.

mattn commented Feb 26, 2013

For example, if there is a script language which does not have string length for their string object(a.k.a: char*), I wonder this change will rescue from their problems that will get. If not, they will have to provide get_string / get_raw both for the time being.
And they will provide especial class or something like to simulate string class to marshal/unmarshal mesasges. I'd rather not to hope non-compatibility changes. So if you have least plan to add new string identify to separate raw or string, I'll agree that. of course, I know the really issue is in the language's specs.

additional comment
I'm not heavy use to msgpack.

cabo commented Feb 26, 2013

@kazuho Let's be clear what your proposal is about.

You are adding a length-type-value (LTV) layer on top of the type-length-value (TLV) mechanism that msgpack has.
You gain another 256 code points, the encoding of which doesn't need to consider length, because that's done in the new LTV layer. You lose the ability to introduce any new short data elements at the TLV layer (I'm keeping up the Half example, but that is just an example). Even byte strings already have to pay the LTV tax.

You win forward "interoperability" in the sense that all new types are forced into the Procrustes bed of the LTV-TLV scheme — a receiver knows the length of any new type without actually understanding what it means. If I start sending you data in my wonderful new type, your deserializer won't give up, but you still won't know what it means. Yes, there is a value to the deserializer not giving up, but I would like to point out that it is limited: there still is no actual interoperability if you don't know what the data means.

(The type tagging scheme in the current Internet-Draft actually has all the same benefits, just a slightly different allocation of inefficiencies and above all the potential to do more meaningful interpretation of the data if the second-layer tag is unknown. And it keeps orthogonal the question whether to spend the top-level codepoints now or later.)

If the LTV-TLV scheme (or the tagging scheme) is going forward, I'd recommend to spend time on two issues:

  • How is the new code point space actually curated. Right now it seems you want both free for all and careful allocation. You can't have all the benefits of both at the same time. RFC 5226 http://tools.ietf.org/html/rfc5226 may be a relatively dry document, but it points out a number of ways to manage a codepoint space.
  • Make the LTV/tagging scheme itself reasonably efficient (are seven-byte objects really that frequent?). That isn't possible without some data, and some vision of how the new space will be used. I think it would be useful to collect information about this from the various groups of current users (including those, like binarypack and msgpack-js, who already have a binary string type).
Contributor

kazuho commented Feb 26, 2013

@cabo

You are adding a length-type-value (LTV) layer on top of the type-length-value (TLV) mechanism that msgpack has.

Your understanding is wrong. First of all, MessagePack is not TLV. Int32 is encoded as shown below. There is no length.

0xd2 XX XX XX XX

To be correct, MessagePack is a TV mechanism that partially uses length to encode variable length structure. And that is why you cannot extend the protocol without breaking backwards compatibility. If it were LTV or TLV then such problem should never arise from the first point.

And the encoding I am proposing also uses the same approach. It uses the first bytes to show that it is an extended type (and for show-length data encodes the length in the same byte as well), and the succeeding byte to show the actual type (i.e. subtype of the extended byte). It could be called MajorType-Length-SubType-Value encoding, but it is not LTV layer on TLV.

  • How is the new code point space actually curated. Right now it seems you want both free for all and careful allocation. You can't have all the benefits of both at the same time. RFC 5226 http://tools.ietf.org/html/rfc5226 may be a relatively dry document, but it points out a number of ways to manage a codepoint space.

I understand the problem, but the question is, should we define it now? Isn't rough consensus and running code is important than discussing how we should maintain the extended type IDs? As proposed, people can start first with application-dependent types and after then, should we have any consensus, assign them an official ID. It's the same way how IETF handles port numbers. Things mature eventually, and the proposed extented type system has the capability to support such progress.

  • Make the LTV/tagging scheme itself reasonably efficient (are seven-byte objects really that frequent?). That isn't possible without some data, and some vision of how the new space will be used. I think it would be useful to collect information about this from the various groups of current users (including those, like binarypack and msgpack-js, who already have a binary string type).

I partially agree. However to fix the problem, you'd need to answer this question: would it be possible to collect the right data? It is hard if not impossible to expect what kind of data would be stored as extended types (we could name some possibilities, but it is impossible IMO to guess the relationship bet. probability and size). Besides, it would be harder to implement if the number of bytes that get the short form becomes skipping.

These two are the reason why I assigned the short form to the smallest sizes, since under non-assumption it is the best way to decrease the overhead ratio of the objects.

rasky commented Feb 26, 2013

@methane @frsyuki please change Python into a strong-string language. If you keep it as weak-string language, the new specification says:

Serializer:

  • store the object in the String type if it can't know clearly whether the object represents a string or a byte array
    don't have to validate a string on storing it.
  • should store the object in the Binary type if users set a marker which means "this is a byte array" on the object

This would be a disaster for Python. Python has already a very clear way to distinguish between strings and byte arrays. Strings are the "unicode" type, and byte arrays are the "bytes" (aka "str" in Python 2.x). If Python 2.x implementation is allowed to store arbitrary "str" into the new String type, a disaster would ensue and Python msgpack would be totally useless after all the efforts of adding the String type.

On other hand, for strong-string langauges, the specification says:

Serializers:

  • store byte arrays using the Binary type
  • store strings using the String type
  • may implement an option to store byte arrays in the String type to keep the backward compatibility with current msgpack implementations

This is exactly what the Python binding for messagepack should do, and the only way to make it useful for Python programmers. Currently, msgpack-python is broken as I've shown in the example in my initial ticket.

Python has strong different between bytes and unicode, and it's totally wrong to list it in the weak-string category.

Member

methane commented Feb 26, 2013

@rasky

I've wrote new design of msgpack-python API in Japanese. I'll translate it in English later.

I'll add options to packer and unpacker. Users can select which type bytes packed into.
And default is raw. Since there are many software accepts only raw.
(As you know, string will override raw in new format.)

Unpacker provides which type binary unpacked to and which type string unpacked to.
Default value of both options will be bytes. (It may be changed in few years later).
You will be able to select bytearray for binary and unicode for string, too.

If you want transparent string/binary separation, packer will be able to pack bytes to binary
and unpacker will be able to unpack string to unicode.
But it is not default for compatibility reason. msgpack is not a format only for Python.

Owner

frsyuki commented Feb 26, 2013

@rasky Thank you for your comment. Actually, I don't have strong opinion on Python 2 because I don't use Python. So let me put "Python 2?" to both of the categories for now because @methane mentioned different thing.

I don't think the proposal is perfect. We might have unknown thing (meaning we could not able to classify Python clearly and need to have another options). I think refining/rethinking the proposal + experimental implementations is the next step.

rasky commented Feb 26, 2013

@methane I disagree with your reasoning. When msgpack will have the new string type (so both string and raw), it will be very clear what must be done, that is: unpack string into unicode, unpack raw into bytes. That's the only serious solution for Python, and I think it MUST be the default. I don't see why you should have a different default.

After all, it's the msgpack format that has decided to go this way. @frsyuki decided by default all existing serialized msgpack containing "raw" will be interpreted as "string" in the future. So I think msgpack-python should follow this, and not provide an additional backward-compatibility layer upon the specification.

So: I'm OK with having both modes, but the one that closely follows the specification of messagepack must be the default in my opinion. Just do a major version release bump, and tell people about it. That's how messagepack is evolving.

Member

methane commented Feb 26, 2013

@rasky
Why I chose bytes for unpacker's default is compatibility to current msgpack-python.
I'll deprecate relaying default option in near future (or at a time supporting new format) and change it in future.

I don't have schedule for changing packer's default because I can't forecast how fast new format spread in the world.
It will be the time msgpack-python reaches 1.0.

rasky commented Feb 26, 2013

@methane if you need compatibility, then @frsyuki plan of changing current raw into string is wrong. @frsyuki set this plan to minimize compatibility issues, but in reality there will be lots of compatibility issues for string-strong languages (including Python 2).

The fact that you are using a different default means that you are basically try to cover the plan migration problem within your binding. That is wrong in my opinion.

@frsyuki do you agree that a Python binding should BY DEFAULT unpack FixString and FixRaw into the non-string "bytes" container, and never use by default the "string" container? Even if you don't know Python, I think you understand that it is totally wrong and it means that either your specification is wrong (and doesn't allow binding to implement it correctly because of backward compatibility concerns), or the binding is wrong.

dalle commented Feb 26, 2013

Do we really need 4 ways of storing an empty string?

0xa0
0xd9 0x00
0xda 0x00 0x00
0xdb 0x00 0x00 0x00 0x00

cabo commented Feb 26, 2013

@kazuho I don't have time for a full response, but two data points:

Almost all real world TLV schemes have cases where the L is implied from the T. msgpack is pretty much a classic TLV encoding in this respect, except that many other TLV encodings have more commonality in the way the express lengths. Telling me you like the words in my message to have different meanings doesn't mean my message is wrong, it just means you didn't invest the effort to read it the way I meant it.

I pointed to RFC 5226 to make you think about how to manage code point space. Port numbers aren't the only number resource managed by the IETF, and RFC 5226 points out other ways to manage a number space. Saying "it's like port numbers" is actually wrong, because the port number space has quite different characteristics from the code point space we are talking about. I recommend you indeed have a look at RFC 5226.

cabo commented Feb 26, 2013

@dalle msgpack doesn't try to enforce one way. Consider the number of ways you can write a 0!

dalle commented Feb 26, 2013

@cabo I was just thinking if it could be of any use to pack data even further. If 0xa0-0xbf would be a 0 - 31 byte string, then then length parameter of 0xd9 could be shifted by 32 bytes, and give a 32 - 287 byte string. Perhaps unnecessary, but the data would be more packed, ever so slightly, but it increases the complexity of the protocol though. I imagine there would be a number of occasions when a string (or perhaps even a binary) would be exactly 256 bytes long, this way we don't need to go to full 16-bit length string (0xda) to store it.

cabo commented Feb 26, 2013

@dalle Yes, we do that in other encodings (look up the encoding of CoAP). It doesn't strike me as the msgpack way… While technically your way is superior, it also causes confusion with implementers who don't see the same approach in other data types. You do have a point that this is more useful with an 8-bit length (which msgpack so far didn't have(*)) than with the jump from 4/5 bit to 16 bit we have in other places. On the other hand, how many strings are between 256 and 287 bytes? 1 %? And you still only save one byte for each of them, 0.4 %. So, in total, you saved 0.04 % for confusing some implementers. Worth it? I don't know.

(*) well actually it has 0xcc and 0xd0 8-bit integers where the same would apply even more (half of the 0xcc and 5/8 of the 0xd0 space is useless)… and it hasn't been done there either.

cabo commented Feb 26, 2013

@rasky as usual in such a transition, there are two considerations:

  • What is the right way. Of course, all languages that distinguish byte strings from text strings (and that includes JavaScript, Python 2/3, Ruby 1.9/2.0 (*)) should make use of this fact in the default configuration.
  • What is backwards compatible. Old code that just happens to be upgraded to a new version of the library shouldn't break. This is in direct conflict with the previous point. You don't want to be forced to fix up dozens of .to_msgpack method calls into .to_msgpack(:legacy_mode), even if it is the right thing in the long run… So, for these applications, there is a need for the default to support the legacy instead of the desirable approach.

In summary, there needs to be some thinking about managing the transition. From a technical point of view, managing that isn't too hard. But it also must look good and avoid alienating people…

(*) I don't understand the "string-strong" etc. terminology. The question should be about how people actually code, not some theoretical concept. The languages I cited all make people think about the difference between text and binary and all make that difference available pervasively to the programmer. In other languages, that is still a problem being actively worked on, see http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html for a nice example. (Languages that don't address this issue at all are probably dead or completely unsuitable for either text processing or work on binary data.)

rasky commented Feb 26, 2013

@cabo it's the new msgpack specification that mandates that "raw" will become "string". This means that, in @frsyuki 's view, there is no big backward compatibility problem to be handled with this transition, as most msgpack users will have utf-8 strings stored in existing messagepacks. What I can't understand is why we should first draft a specification that breaks compatibility (by deciding that "raw" becomes "string" for existing msgpack), and then we need to implement a workaround at the bindings levels, thus making all bindings make the wrong thing by default (that is, by default they would implement the msgpack specification wrong). It looks a completely wrong plan to me.

I see two possibilities:

  • If "most" msgpacks out there use strings as "raw", then the current plan is correct, and bindings should convert by default raw into strings. For instance, in Python, u"foobar" == "foobar" for all ASCII strings, so for all ASCII strings there would be absolutely no difference even if msgpack-python did the correct thing by default.
  • If "most" msgpacks out there use non-strings as "raw", then the current plan is wrong and should be adjusted accordingly, that is keeping the current "raw" as "raw", and adding a separate code for "strings".

I would also notice:

  • There is no need to change all calls to unpackers to add a boolean arguments. It is sufficient if msgpack-python implements a global call setCompatibilityMode(True) or setSpecificationVersion(1) to be called at the beginning of existing programs. This would be totally acceptable. Moreover, existing msgpack users would know that there is a msgpack migration going on because they would follow msgpack news, they would know that upgrading msgpack-python to the new marjor version breaks their code (because they would read changelogs before upgrading).
  • With the proposed solution, the migration will never happen. You are bounding new generations of Python messagepack users to create wrong messagepacks by default forever, unless they can be convinced to read all the documentation, find out that X months/years ago there was no string type (and this will be hilarious 2-3 years from now), find out that msgpack-python is migrating by doing the wrong thing by default, and that they must explicitly set a new argument "implement_correct_specification=True" to all calls. It's totally backward. It's old users that should migrate their code, not new users.

najeira commented Feb 26, 2013

I want to read current msgpacked data by new readers without changing my code.
I can accept that make default mode to new spec after deprecation period with 6 months or 1 year later.

rasky commented Feb 26, 2013

@najeira if you want to do that, then just don't updated to the new msgpack-python supporting the new specification of msgpack. I don't think it's reasonable to ask that all the following things happen at the same time:

  • You don't need the new specification
  • You don't want to change your code
  • You want to update msgpack-python to the newer version

What is the benefit of updating to a version implementing a new specification that you don't want to use, and demand not to add even 1 line of code? My proposal requires 1 line of code to be added to your program.

catwell commented Feb 26, 2013

@rasky Since it's Python, I think that's not how it works. Your software may depend on a system version of MessagePack so you don't really control when your users decide to update.

cabo commented Feb 26, 2013

@rasky we sometimes have to upgrade code for other reasons, e.g. because of security bugs. If that suddenly breaks code, that's not good. So when the semantics of the library change, it's probably (speaking Ruby)

    require 'msgpack/modern'

or

    require 'msgpack'
    MessagePack::IM_NOT_STUCK_IN_THE_PAST = true

or some such.

najeira commented Feb 26, 2013

@rasky thank you for your comment.

I want to update msgpack-libraries because to get future bug fixes.
I am going to update with no problems because I know new specs and I will read changelog.

But I think somebody will miss the compatibility information in changelog.
We should be careful to break backword compatibility.

In deprecation period, you can use global "setSpecificationVersion(2)".

Member

methane commented Feb 26, 2013

@rasky
Since msgpack is a binary format, users may be late to find updating msgpack-python losing their data.
I think changing default behavior without deprecation process is not Pythonic.

I'll change default behavior of Unpacker after deprecation term.
And I'll change default behavior of Packer after most applications using msgpack support new format.

I've translated my current plan to support string-ext.
https://gist.github.com/methane/5022403

Midar commented Feb 26, 2013

Maybe we should indeed add string as a new type and not change raw to string

That way, old data will be exactly the same if you update. Old parsers can read the data form a new generator just fine as long as no strings are used. Only new data won't work with old parsers.

rasky commented Feb 26, 2013

OK, I still disagree for the stated reasons, but I won't push this any further, as we won't reach consensus. Either plan is fine, it's just how you want to manage your community, so it's up to you.

Member

methane commented Feb 26, 2013

@rasky Thank you for understanding me.
Yes, it's a issue of msgpack-python's policy to API change.
It's not a format's issue.

Member

methane commented Feb 26, 2013

@Midar Generally speaking, I feel you're right. Adding clean string type is good design.
clean string must be utf-8 and decoders can just reject non utf-8 data.

But there are some reason why it's difficult to take such a good design.
Main reason is described in @frsyuki 's proposal. Forcing to encoders validate UTF-8 will be difficult
or drop performance on some language doesn't having native unicode type. (C, Perl, php and Ruby).
So we cannot prohibit invalid UTF-8 in string type.

And there are some other reasons:

One big purpose of using msgpack is logging. Log data contains many "maybe string" or "almost string" data.
Adding clean string type to msgpack doesn't helps such application.
But adding not string type may help log processors. For example, console output can skip not string data.
And skip regex matching for not string data.

There are many applications uses msgpack as a compact JSON now.
They already uses raw as a clean string type.
Adding another clean string type confuse them.

On the other hand, adding binary type helps most applications.
If you want clean string type, you can put only validated UTF-8 to string.
Decoders should have option to reject invalid UTF-8 data in string.

Contributor

ugorji commented Feb 27, 2013

I started following the msgpack string support thread a few hours ago.

I wrote the Go language msgpack implementation at
https://github.com/ugorji/go-msgpack
http://blog.ugorji.net/2012/04/announcing-go-msgpack.html
This is currently the de-facto msgpack implementation within the Go community.

Initially, I believed that native string support was not necessary. After much use,
I have come to understand that it is, especially when interop is necessary.

My concerns with the current proposals on the table are:

  • Let current RawBytes now represent utf-8 strings (because many of the libraries work this way).
    I think this is a error. Imagine an organization that uses SHIFT-JIS and keeps their strings as
    SHIFT-JIS encoded raw bytes. Now, the stored data is read using a python3 client, which takes the
    raw bytes and converts to a str (using bytes0.decode("utf-8")). This change hoses them completely.
  • We are taking backward compatibility (of stored data) off the table
    (especially in terms of the semantic meaning).

Let's take a step back, and say that we want to achieve everything in a way that will have
minimal impact on the libraries and on stored data, including:

  • backward compatibility
  • forward compatibility
  • differentiation between (utf-8) strings and byte arrays
  • support for general-purpose extensions
  • support for private extensions

Let's look at this possible solution. Different folks proposed different parts of this earlier:

    1. All libraries should ignore (skip) all the reserved words that they cannot handle
    1. utf-8 string is now explicitly defined as 0xc1 representation_for_byte_array
    1. Spec-Defined Extensions are explicitly defined as 0xd9 representation_for_array
      representation_for_array is an array where the first element is a tag, and all other
      elements represent the value.
      2 spec defined extensions I can think of are non-utf8-strings, and datetime.
      This are pretty common things which people will want to store, or have already been
      storing in different non-interoperable ways.
    1. Private extensions are defined similarly to spec-defined extensions, but using 0xd8

Let's see some examples:

  • utf-8 strings, because they are everywhere
    0xc1 representation_for_byte_array
  • time representation as 2 integers: seconds since epoch, and nanosecond offset, and timezone (e.g. UTC)
    (Assume extension code for datetime is 0x01)
    0xd9 followed by representation for [0x01, 1234568, 199929939, "GMT"]
  • string in SHIFT-JS
    (Assume extension code for non-utf8-string is 0x02,
    and byte array representing string in SHIFT-JS is [7,1,2,6,3])
    0xd9 followed by representation for [0x02, "SHIFT-JS", [7,1,2,6,3] ]
  • my application Point class, which can be represented by 2 double-precision co-ordinates
    (Assume private extension code for Point is 0x01)
    0xd8 followed by representation for [0x01, 23.25, 32.12]
  • my application currency which is represented by currency denomination and double-precision value
    (Assume private extension code for currency is 0x02)
    0xd8 followed by representation for [0x02, 25.50, "USD"]

Note:

  • In terms of implementations, both Java and Python3 want to decode before storing as a string,
    but some other languages like C++, Go, etc just store the raw bytes in a string holder (and don't decode).
Owner

frsyuki commented Feb 27, 2013

@rasky

it's the new msgpack specification that mandates that "raw" will become "string". This means that, in @frsyuki 's view, there is no big backward compatibility problem to be handled with this transition, as most msgpack users will have utf-8 strings stored in existing messagepacks. What I can't understand is why we should first draft a specification that breaks compatibility (by deciding that "raw" becomes "string" for existing msgpack), and then we need to implement a workaround at the bindings levels, thus making all bindings make the wrong thing by default (that is, by default they would implement the msgpack specification wrong). It looks a completely wrong plan to me.

Yes, I think I'm on the same page with you.

If "most" msgpacks out there use strings as "raw", then the current plan is correct, and bindings should convert by default raw into strings. For instance, in Python, u"foobar" == "foobar" for all ASCII strings, so for all ASCII strings there would be absolutely no difference even if msgpack-python did the correct thing by default.
If "most" msgpacks out there use non-strings as "raw", then the current plan is wrong and should be adjusted accordingly, that is keeping the current "raw" as "raw", and adding a separate code for "strings".

I agree. I'm not sure the python stuff but that's a concern in strong-string languages.

There is no need to change all calls to unpackers to add a boolean arguments. It is sufficient if msgpack-python implements a global call setCompatibilityMode(True) or setSpecificationVersion(1) to be called at the beginning of existing programs. This would be totally acceptable.

Global settings don't work if an user are writing a library using msgpack. Actually "scoped" setting is necessary. Some languages might have this kind of scope but many languages don't, I think.

With the proposed solution, the migration will never happen. You are bounding new generations of Python messagepack users to create wrong messagepacks by default forever, unless they can be convinced to read all the documentation, find out that X months/years ago there was no string type (and this will be hilarious 2-3 years from now), find out that msgpack-python is migrating by doing the wrong thing by default, and that they must explicitly set a new argument "implement_correct_specification=True" to all calls. It's totally backward. It's old users that should migrate their code, not new users.

I got it. I understood what you mean. That's why I wrote that deserializers should return different types for strings and binaries in a new major version by default, with difficulty.
As @najeira mentioned, current implementation could have a bug. That means maintainers need to keep two series updated (new and old major versions). However, I don't think it's easy. Thus it's difficult to require them to release a new major version with the new default behavior soon. Since new users still have a control to optionally turn on the new feature.

As @methane mentioned, it's a issue of msgpack-python's development policy.

Owner

frsyuki commented Feb 27, 2013

@cabo @kazuho

I think one of @kazuho's points is whether old deserializers can accept unknown types or not. (I don't care msgpack is a TLV or not).
Current deserializers reject all unknown types and just throws an exception because the deserializers don't know the length of the following data.

Thus applications can't introduce new types. Users need to modify deserializer code and serializer code. Plus, the data stored by the modified serializers are not accepted by the other deserializers.
@kazuho's idea keeps msgpack implementations away from application-specific (upper layer) matters.

Owner

frsyuki commented Feb 27, 2013

@ugorji Interesting.

  1. All libraries should ignore (skip) all the reserved words that they cannot handle

This doesn't work because deserializers can't know the length of following data (meaning we can't add types without breaking compatibility). "Extended type" idea works as @kazuho says:

The problem of current MessagePack is that it does not define a solid way to extend the protocol without sacrificing backwards compatibility. Unless we resolve the issue now, it is likely that we would have a protocol update again that would break compatilibity (i.e. old decoders refusing to work since it cannot decode the newly introduced type).

_

Spec-Defined Extensions are explicitly defined as 0xd9 representation_for_array representation_for_array is an array where the first element is a tag, and all other elements represent the value.

Similar to this idea but I think following idea looks very good: deserializers return a tuple of (type, data) where type is integer and data is a byte array.
Type == 0 is the only pre-defined type meaning that the data represent a byte array. The other types are for application-specific (upper layer) extension and deserializers never restore the object into an instance of certain class automatically.

Java and Python3 want to decode before storing as a string,

Well, current (and new) Java implementation don't decode UTF-8 in the deserializer because it's a statically-typed language (basically. don't have to read: it has a feature to deserialize a byte array and convert its type at the same time for performance optimization).

Java programs have a pre-declared type information and the msgpack library can provide a way to convert deserialized objects (this time their class is something special defined in msgpack library) into the declared types. This type conversion process decodes UTF-8 if the destination type is String. But if the declared type is a byte array, it doesn't decode UTF-8.

In other words, in statically-typed languages, a msgpack library can use schema. I think other statically-typed languages can provide the same mechanism. As far as I know, at least current C++ implementation has the same feature)

Thus we basically don't have to care care string/binary problem of deserialization in statically-typed languages. As the proposal mentions, serializers could cause compatibility problems in weak-string languages even if it's a statically-typed languages, though.

Owner

frsyuki commented Feb 27, 2013

@ugorji Sorry, I misunderstood one point:

All libraries should ignore (skip) all the reserved words that they cannot handle
This doesn't work because deserializers can't know the length of following data (meaning we can't add types without breaking compatibility). "Extended type" idea works as @kazuho says:

This works. Because you meant 0xC1 is a annotation on objects defined by existent spec.

cabo commented Feb 27, 2013

@frsyuki The proposal @kazuho made and the one in 1pre2 (including the appendix) are actually almost isomorphic.

They differ in two places where we seem to have different judgement:

  • I believe untagged binary is the most important addition. So tagging is a bit less efficient in my proposal, and untagged is more efficient.
  • @kazuho believes this is the last extension we'll ever make at the TLV layer, so he can spend more code points and make some cases more efficient. I believe it is good to be able to embrace change again at a future time. This is actually orthogonal to the previous point. (Because extensions at the TLV layer are pretty much equivalent, except for coding efficiency, to new tags, that may indeed be possible, if we believe we never will need to add an efficiency encodable type. I'm keeping up my Half thing to suggest otherwise…)

One more difference: I also believe tagging is useful beyond binary. Tagging an integer or an array/table might be more natural for some new "extended" types. So 1pre2 Appendix B makes that possible in addition to tagging binary.

cabo commented Feb 27, 2013

A while ago, I wrote:

Breaking forward compatibility is worse. The current need to do that at least in a partial sense is caused by two reasons:

  • divergence in existing implementations. Some have been using raw for text strings, some for binary blobs. This wasn't a healthy situation, and cleaning it up was going to cause problems anyway. But the other problem is:
  • we don't have enough code points left that we could spend to solve this problem in the optimal way. […]

This is the reason why there cannot be an entirely smooth transition. You just get to choose which of the various groups of implementations is hurt the least (actually: unhurt). (Or you can choose to ignore the issue, but I think we are beyond that.)

Member

methane commented Feb 27, 2013

@cabo Adding new type on first-class LTV layer will break backward compatibility.
I agree that we should have forward compatibility this time, and never break backward compatibility anymore.

Contributor

ugorji commented Mar 3, 2013

@cabo

The examples you give mostly have been designed 15 years ago or earlier.

That is a false statement. You can only make that claim for String.getBytes example. All the others are relatively recent and went through the same thinking that we are doing now. (I'm cautious about getting into tangents wrt python3 and JSON and unicode, but I'm well aware of the history there also).

Can you extend your gist with some examples?

I'm not sure how better to give an example. We know how arrays, ints, floats, Raw are currently represented in msgpack. All the new "types" just just piggy-backs on those. i.e.

ExplicitString is one byte (e.g. 0xd4) + representation of Raw
ExplicitBinary is one byte (e.g. 0xd5) + Representation of Raw
Timestamp is one byte (e.g. 0xd6) + representation of an array containing 1, 2 or 3 elements
PrivateExtension is one byte (e.g. 0xd7) + representation of an array containing 2 or more elements, with first element being a Tag (FixNum)

cabo commented Mar 3, 2013

@ugorji Actually, you are right, Unicode support in MySQL is only about 10 years old, not 15. I gave the link to the Python3 internals — the design is quite recent, but I understand it has been done this contorted way to support some backward compatibility with existing extensions. More generally, system-internal interfaces will always provide more options than modern on-the-wire formats.

But that is indeed a side discussion.

With the examples, I now understand your proposal better. (I'm just not sure whether you'd have a 0x92 after the 0xd7 or whether that is implied.)

I don't have a strong opinion on the timestamp. Sure, this can be done this way. (No timezone for floats? Not a big problem.)

I think the ternary approach to the text/bytes dichotomy is just offloading some of the work to places where it doesn't belong. It may somewhat ease the transition of MessagePack, but it prolongs the pain of that transition into the indefinite future.

I prefer immediate pain to long-term pain. That is certainly a matter of philosophy.

I also would prefer single-byte overhead for JSON-style short UTF-8 strings.

Contributor

ugorji commented Mar 3, 2013

@cabo

I also would prefer single-byte overhead for JSON-style short UTF-8 strings.

Yes, I forgot about that one. That's the Raw8 that @frsyuki defined in his proposal as String8.

I'm just not sure whether you'd have a 0x92 after the 0xd7 or whether that is implied.

Great point. I've haven't thought much about how to make the format more compact, just wanted to discuss the high level design and ideas first and make sure we think through all the tradeoffs carefully. That will definitely be a great optimization if we can do it.

No timezone for floats? Not a big problem.

This can be discussed further, but the idea is that if you want to specify a timezone, use the larger array style. The most compact will be int/float/double seconds (@methane and @Midar idea).

I prefer immediate pain to long-term pain. That is certainly a matter of philosophy.

I'm with you - I'm exactly the same way. But we both know we can't make that determination for everyone, especially people who have committed to msgpack for the last few years (e.g. Pinterest-style customers, etc). If my app was live and this changed, I would have been upset (I was planning to use Raw as binary in Go).

P.S.

This is personal for me, because I'm planning on storing all my data in my datastore and caching layer as MsgPack encoded objects in NoSQL KV-style storage, and have hopes of terabyte-sized data. Timestamps come from different locations (which is why I need the timezone).

Contributor

ugorji commented Mar 8, 2013

Hi folks,

I've updated my thoughts at https://gist.github.com/ugorji/5077089

I went through my library and updated it as a proof of concept to see how things work, how much churn the library has to make, and how legacy mode for serializers and deserializers is still supported. It works very well with the changes I made to the idea, easily supporting legacy mode, new mode, mixed mode (e.g. Raw is string, and binary is explicit), extensions, etc.

Please take a look and share your thoughts.

Owner

frsyuki commented Mar 11, 2013

I updated my proposal!: MessagePack update proposal v3
https://gist.github.com/frsyuki/5131535

@ugorji @cabo
I agree that msgpack should accept the ambiguity of strings and byte arrays and Raw type represents the ambiguity.
But I think we can assume that applications don't use ambiguity-tolerant code (which accept and handle the ambiguity correctly) and ambiguity-strict code (which don't accept the ambiguity and assume transparency) at the same time.

I would like you to read the "Solution" section of my updated proposal.

The proposal also includes guidelines for the future possible extensions including of the time type.

Points of the proposal are:

  • It doesn't obsolete current implementation of msgpack at all. Thus it provides users with an option to keep the perfect compatibility with existent data and applications
  • It accepts the existence of ambiguity of strings and byte arrays.
  • It suggest reasonable default behavior
  • It achieves the important purposes:
    • msgpack provides users with a mechanism to deserialize/serialize string objects transparently without causing incompatibility
    • msgpack provides upper-layer code of msgpack with a mechanism to define original types without changing msgpack spec
    • msgpack keeps compatibility and doesn't cause any impacts on existent code even if new types were defined to msgpack in the future
Contributor

kazuho commented Mar 11, 2013

@frsyuki
Thank you for writing down the proposal.

I fully agree to the three purposes (objectives), and I think the proposal is a clever approach.

One little thing, how about phrasing the section "guidelines of new releases of msgpack implementations", as:

existing implementations that is already capable of handling binaries stored in "raw" should not enable "binary_extension" by default without users' consent in future releases, since such an action would break source-level compatibility

My understanding is that for other cases there would be no compatibility concerns due to the transition; and actually that is the reason I favor v3 over v2.

Midar commented Mar 11, 2013

@kazuho Wouldn't your issue that you need typed arrays be solvable by using tagged binary, like in @cabo's proposal? Anyway, I still don't think MsgPack would be the right format for that: It sounds like they store these values in a file. You want them aligned correctly so you can just mmap() them. MsgPack does not know anything about alignment, so that sounds like a bad idea. MsgPack is not so useful as a file format for huge data, but useful as a network format for small data. File formats for huge data tend to align stuff correctly because a few wasted bytes on the hard drive don't matter, performance (i.e. being able to just mmap() it) is much more important there. Especially in 3D graphics! This is why I think such extensions are not too useful.

@ugorji

One major reason why I think we should allow options for character encoding type, is because this is done even in databases. For example, IBM DB2 unicode DB advises customers that they can either use UTF-8 or UTF-16 for storage and performance recommendations. For example, COBOL, Java use native UTF-16, so converting bytes to this format is cheap. However, for Japanese text, UTF-8 uses about 50% more space than UTF-16, so UTF-16 may be better option. So let the (de)serializer hint define that.

Given that msgpack is all about compact performant storage and (de)serializing, this is a natural option.

This is not compatible with the lightweight approach of MsgPack. Even for asian languages, there's only a little benefit of UTF-16 over UTF-8, while almost all other languages are at a disadvantage with UTF-16. Therefore UTF-8 is the only choice that makes sense. And supporting non-Unicode character sets will just unleash hell on earth ;).

Oh, and don't forget that for UTF-16, we need to either decide on an endianess, have two UTF-16 string types (BE and LE) or always use a BOM (wastes 2 bytes!).

Regarding using a double to store timestampz, I think the state of time management has moved beyond using just a single value with lost precision to represent time.

Care to elaborate? You think more than microsecond precision for the next 10000 years is not enough? Because that is what you get with a double. Today, you even get nanosecond precision IIRC.

@frsyuki

https://gist.github.com/frsyuki/5131535

This format does not seem to have explicit string support, only for raw and extensions. Does this mean if you want to have a guaranteed string and not "maybe a string" you need to use extensions? If so, maps with strings as keys will become insanely inefficient.

Owner

frsyuki commented Mar 11, 2013

@Midar
The proposal v3 has string support. It clearly allows deserializers to assume that the Raw type contains only UTF-8 strings, and requires serializers to store only strings using the Raw type. It also suggests to do that by default.
On the other hand, it doesn't enforce serializers to store only valid UTF-8 byte sequence in the Raw type. Its reasons are: performance, weak-string code, and compatibility with existent data.

My intention is that the format spec itself doesn't enforce the choice whether applications should assume the type contains only valid UTF-8 strings even with those downsides. The format spec itself suggests that implementations may assume valid UTF-8 strings by default and they should provide the other option as well.

rasky commented Mar 11, 2013

Oh no, we're going backward it seems. Didn't we say that we must only support UTF-8 because anything else must die, die, die, die among flames? Why now you dropped the UTF-8 requisite? The new spec is basically useless if implementations are not forced to ONLY use utf-8 for both serialization and deserialization, and specifically they MUST NOT encode any binary buffer as string type.

yrashk commented Mar 11, 2013

@rasky existing implementations rely on the fact that current standard defines those raw types as binaries; therefore if you make them UTF-8 only, this will break backward compatibility.

rasky commented Mar 11, 2013

We discussed this already for weeks. Currently-serialized msgpacks are a mix-match of binary strings (eg: image files), UTF-8 strings and non-UTF8 strings. There is no way you can preserve backward compatibility. It's a dead horse. We have to decide for the path of less breakage. My suggestion has always been to treat the current Raw type as "black-box binary data", and add a separate "UTF-8 string", but @frsyuki is vocal in converting the current Raw type into String.

Now, I personally don't care about the migration plan. Anything goes for me. BUT it's paramount that we end up with a String type that encodes UTF-8 strings ONLY, and that bindings and applications can assume (for both serialization and deserialization) that that String type means "a Unicode string encoded in UTF-8 format". I don't care if we get there in 3 years or tomorrow, but it's important that we agree on the goal.

Having a specification where the Raw type is maybe UTF-8 or maybe not, depending on the bindings, on the language, on the configuration options, and on the moon phase, isn't solving ANY problem. It's only making things worse, much worse.

EDIT: fixed a few mistakes

yrashk commented Mar 11, 2013

@rasky I agree, a specific UTF-8 string type would be very useful. As the author of https://github.com/yrashk/exmsgpack I am personally in favour of treat the current Raw type as "black-box binary data", and add a separate "UTF-8 string" :)

cabo commented Mar 11, 2013

I'm with @rasky and @yrashk on this. It is very important to nail down things and come up with an unambiguous semantics. If the coding efficiency starts to hurt, we can always swap out the encoding layer as long as the semantics and structure (and thus the API) stay constant. (Of course, it should not become ridiculously inefficient, but optimizing the byte stream may be on a more application dependent layer than the more global activity of standardizing on the structure.)

Owner

frsyuki commented Mar 12, 2013

@rasky @cabo @yrashk

The v3 is probably too neutral for the choice of ambiguity.
How do you think about this v3.5 which makes the position clearer?:

MessagePack update proposal v3.5: https://gist.github.com/frsyuki/5139552

Contributor

kazuho commented Mar 12, 2013

@Midar

Wouldn't your issue that you need typed arrays be solvable by using tagged binary, like in @cabo's proposal?

It can be done either way. The difference between the approaches are: a) whether or not some slots are left as "reserved," and b) space efficiency.

My understanding is that the idea behind introducing an "extension type" is that MessagePack should be open to application-level extensions in a way that does not cause interoperability problems.

If that (the italic part) is true, then it is not necessary to leave some slots as "reserved" (since changing their definition introduce interoperability problems); using all the slots to create a space-efficient encoding is preferable and will be more attractive to application developers.

OTOH, if we decide not to care about interoperability problems when application developers try to add new types (note: I am opposed to such approach), IMO we should not introduce any kind of type extension mechanism at all since it would be a waste of the few remaining slots.

It sounds like they store these values in a file. You want them aligned correctly so you can just mmap() them. MsgPack does not know anything about alignment, so that sounds like a bad idea.

Having the same question, I had actually asked about it, and the answer was that copying the data once (from file to memory) is not an issue for them.

Besides, there is a potential benefit of copying the data; by copying, the programmers can guarantee that the data exists on memory (swap-outs are ignored in general in game programming), whereas a call to mmap does not have such guarantees. Guaranteeing latency is important for games, and IMO copying data instead of mmaping them is an easy way to solve the issue.

EDIT: PS. I forgot to mention about the potential advantage of @cabo's approach: applications can have partial knowledge of the type (i.e. "time" type could be an annotation against "int"). My impression to the idea is that although it is elegant in design, I wonder if there would be any actual merit on having partial knowledge. If there aren't such benefit, limiting the appearance of type-tagging to binary data would not be a drawback.

Contributor

ugorji commented Mar 12, 2013

In general, I really like the new proposal. It's a very nice solution IMO. I really like the fact that applications can serialize their types however they want to, and not limited to/by the msgpack spec.

I went through and implemented support for the new spec to see how things would work in practice, and make sure we're not missing anything.

The only concern I found was that extensions typically first need to serialize into a temp byte buffer, and then pass that to an underlying Encoder. This allows them find the length of the encoded value. It seems
there isn't any way around it.

I would suggest the following:

  • Reserve a subset for future spec'ed out extension: 0 - 31 (0x00 - 0x1f). Applications can use 32 - 255 (0x20 - 0xff).
  • Keep Raw (don't change to String). This way, it's clear what is changing in the spec; spec changes are all additions and clarifications. We can just state that Raw is used strictly for UTF-8 strings and binary is encoded as an extension in Application Profile, whereas Raw is ambiguous in Basic Profile.
  • Can we pre-define time.Time as an extension. Right now, everyone implements time in their own way (like JSON) and it's not inter-operable. (If we can't make this happen in short time, that's ok. But we should try.).

To @Midar comments:
The benefit of UTF-16 over UTF-8 for asian and many languages is significant for data size (about 50% on average). However, the downside is the endian-ness, BOM and supplementary plane/surrogate pairs.

Having said that, I'm fine with going with just UTF-8. It was enough for me to put the issues on the table and for the team to make an informed decision.

Regarding using a double for timestampz, I was really talking about the need for storing the timezone. I just re-read what i wrote and admit that that first sentence was confusing and out of place (Happens when I multi-task sometimes). Floats/Doubles are generally ok (I referenced them continuously in notes/posts thereafter).

rasky commented Mar 12, 2013

@frsyuki version 3.5 looks better, thanks. Some comments:

  • I don't think it's clear enough that ambiguity-tolerant behavior is provided only for backward-compatibility. I think you should state more clearly that:
    • All new msgpack users MUST only use ambiguity-strict behavior. This is very important for interoperability.
    • Bindings must default to ambiguity-strict behavior after a transition time (up to them)
  • The option that you named "binary_extension" is confusing in my opinion. If I read this, I understand "do I need the binary extension" while you are overloading it with the meaning "do I want ambiguity-strict behaviour". I think it's wrong. New users don't need backward compatibility, so they will think this option only means "do I need the binary extension". Please rename it to something that it is more clear. Call it "string_backward_compatibility" or something like that.
  • You wrote "Applications may agree that String represents byte arrays as well if they desire (ambiguity-tolerant behavior)". I think this is confusing again. I think it should be worded "Applications may need to load msgpack files in the old format, and they will be able to do it with the string_backward_compatibility option".

Midar commented Mar 12, 2013

@frsyuki The 3.5 proposal is better than the 3.0 proposal. However, I'm starting to get confused with all those proposals: There is no string 8 type. And it talks about string, when before it talked about raw. Can we maybe instead of having diffs all the time have complete tables like in @cabo's I-D? That would also allow easy diffing them against each other using the diff tool of choice.

Also, I think this is still not strict enough in saying that it should be UTF-8. Maybe we can use the same definitions of MUST, SHOULD, MAY etc. that the IETF uses and write them in uppercase?

"A string SHOULD be encoded in UTF-8 and a decoder MAY decode invalid UTF-8 as binary" for example would be quite clear. SHOULD means you must if you don't have good reasons - in this case backwards compatibility.

@kazuho Copying should not be really necessary, as there is MAP_POPULATE, __builtin_prefetch() etc. to load the data while still not having the enormous overhead of copying the data.

My impression to the idea is that although it is elegant in design, I wonder if there would be any actual merit on having partial knowledge.

A good example you brought up is dates: If your deserializer does not know about dates, it can still give you a double (or whatever we agree to use) and the application can do something useful with it.

@ugorji

In general, I really like the new proposal. It's a very nice solution IMO. I really like the fact that applications can serialize their types however they want to, and not limited to/by the msgpack spec.

IMHO this is a very bad thing about it. There should be no room for interpretation if it should be interoperable.

Keep Raw (don't change to String). This way, it's clear what is changing in the spec; spec changes are all additions and clarifications. We can just state that Raw is used strictly for UTF-8 strings and binary is encoded as an extension

This is way better, as there is no room for interpretation. It basically means "Hey, that always was intended as an UTF-8 string, but it was not specified enough. If you are using binary data in it, you should migrate to the new extension types".

I was really talking about the need for storing the timezone

Why is the timezone necessary if we decide that all timestamps are in UTC?

Contributor

ugorji commented Mar 12, 2013

@Midar

In general, I really like the new proposal. It's a very nice solution IMO. I really like the fact that applications can serialize their types however they want to, and not limited to/by the msgpack spec.

IMHO this is a very bad thing about it. There should be no room for interpretation if it should be interoperable.

The whole idea of extensions is for user-defined types to be supported where applications can decide on theirs. For example, I have a bitset that is represented using a byte array in code.
type Bitset []byte
However, storing this is not performant because there are a lot of holes full of zero's. I can store it in my custom format where the holes are represented by a tagged number somehow. Think of how compressors work, etc.

That's what I like about it. An organization or group can agree to use msgpack as they like it, defining extensions for their custom types. It's interoperable within their use case. I think that's why @frsyuki called it the "Application Profile".

The globally interoperable ones will be the basic types and spec-defined extension types (currently only Binary i.e. tag 0). That's why I asked @frsyuki to consider reserving 0-31 for spec additions of global interoperable extensions (e.g. time).

Why is the timezone necessary if we decide that all timestamps are in UTC?

Regarding timezone, UTC just gives a global way of representing timezones by using an offset of UTC. For example, EST/EDT is UTC - 5, PST/PDT is UTC-8, etc. Thus, you can say EST or say UTC-5 as your time zone (loosely speaking). Many languages now have been adding support for timezone in their time representation, databases have added timestampz support, etc.

@frsyuki

About keeping Raw. I liked the v3.0 because it was clear that it was additions and clarifications, and that old format was in tact. I think you can resolve the issues raised by folks previously without changing the denotation as Raw.

Midar commented Mar 12, 2013

@ugorji

The whole idea of extensions is for user-defined types to be supported where applications can decide on theirs

But raw/string is not an extension and should be interoperable!

Regarding timezone, UTC just gives a global way of representing timezones by using an offset of UTC. For example, EST/EDT is UTC - 5, PST/PDT is UTC-8, etc. Thus, you can say EST or say UTC-5 as your time zone (loosely speaking). Many languages now have been adding support for timezone in their time representation, databases have added timestampz support, etc.

UTC is also a time. If you have a time in UTC, you can convert it to your local time. Thus I don't see a reason to store a time zone?!

Contributor

ugorji commented Mar 12, 2013

@Midar

But raw/string is not an extension and should be interoperable!

That's the idea behind requesting that we reserve 0-32 for spec-defined globally-interoperable extensions. Currently, 0 is reserved.

UTC is also a time. If you have a time in UTC, you can convert it to your local time. Thus I don't see a reason to store a time zone?!

UTC is a time system (speaking loosely). The reason for timezones is not that you want to change it to your local one, but you are recording an aware time, defined as time and place. In this respect, UTC is just a reference for others to define timezones off (like equator is a reference for defining latitude, as an analogy).

I had some links above on aware time, but you can google timezones and handling time for more information.

Contributor

kazuho commented Mar 13, 2013

@Midar

Copying should not be really necessary, as there is MAP_POPULATE, __builtin_prefetch() etc. to load the data while still not having the enormous overhead of copying the data.

You are right that copying is not necessary, what I am saying is that the game developers told me that doing copy once or twice at load time is not an issue for them whereas the inflation of size is (25% in case of typed-arrays of float vs. non-typed arrays). I do not know if all game developers agree on the idea, but to me the answer seems to have grounds.

Thank you for pointing out MAP_POPULATE (a linux extension). I had only checked POSIX options. OTOH, I doubt if __builtin_prefetch() (of GCC) is useful in such case, since it is defined as an advisory intrinsic, meaning that it would not cause any faults (i.e. no SEGV when an invalid address is given), and since the intrinsic maybe compiled to / executed as noop. But anyway that's off-topic IMO in the fact that there are other ways to guarantee mmaped data to be loaded onto memory.

My impression to the idea is that although it is elegant in design, I wonder if there would be any actual merit on having partial knowledge.

A good example you brought up is dates: If your deserializer does not know about dates, it can still give you a double (or whatever we agree to use) and the application can do something useful with it.

I disagree. Using partial knowledge means that you can do some calculations using the value (instead of treating the value only as opaque data).
In case of annotating an integral value as a DateTime type, it's accuracy might be 1 second, or 1 millisecond (the epochs might be different as well). So if you would want to use the partial knowledge such as calculate the time difference only knowing that the type is an integral value is not enough, you need the concrete type information. Partial knowledge does not help.

@ugorji

Reserve a subset for future spec'ed out extension: 0 - 31 (0x00 - 0x1f). Applications can use 32 - 255 (0x20 - 0xff).

There was a discussion regarding the issue before you came (starts from #128 (comment)). Personally I agree with you that it would be wise to leave some tags as reserved. It would be great if you could take a look into the comments around the link and make comments.

@ugorji @Midar

Can we pre-define time.Time as an extension. Right now, everyone implements time in their own way (like JSON) and it's not inter-operable. (If we can't make this happen in short time, that's ok. But we should try.).

I agree that introducing DateTime type (as an extended type) would benefitable for some. By defining it as an extended type, we would not need to spend too much efforts in reaching a single defintion of the type; i.e. there could be a DateTime type that is fully compatible with JavaScript and another DateTime type that is compatible with SQL92, should the interested parties disagree on a single definition of the type.

Contributor

ugorji commented Mar 13, 2013

@kazuho

I read the previous comments before i joined the thread. This latest proposal seems to put some stakes on the ground.

I had previously read the proposal as defining the tag as an integer in msgpack format. However, on further reading, @frsyuki said it's a single byte. If an integer, I'd suggest we reserve [-32,0] range, and if a single byte, we reserve [0,31]. Either way is fine IMO.

Regarding time, this problem is mostly solved. Aware Time now is typically precisely represented by seconds since epoch, fractional seconds (in nanoseconds), and timezone (e.g. EST,EDT,etc). We've discussed different ways of representing it: floats, doubles, tuple of 2 integers, etc with a timezone string. It will be nice if it is specified, so serializers provide two extensions out of the box (binary and time). Applications can still define their own custom one if they want, but I think this will cover most of the ground that folks look for OOTB.

In supporting this spec in the msgpack Go library, I just serialized time as 2 integers and a string (which could be nil, to indicate UTC/GMT timezone) with extension tag 2.

rasky commented Mar 13, 2013

@ugorji why do you need a string for timezone? Using a delta in minutes seem enough (eg: +180, -90), and strings are harder to parse, require tables of constants, and so on. If they user wants to convey a human readable representation of a timezone, he/she can surely use a separate string within the same msgpack, like "Eastern Zone" or "EDT". But for an interoperable timestamp specification, we need something that is more easily machine-parsable.

BTW: does your implementation support dates in the past, before the UNIX epoch? If not, I don't think the timezone is very useful. Basically, there are two different needs:

  1. Representation of a moment in history, bound to a place on the planet (date / time / timezone). This must work also for historical dates in the past.
  2. Representation of a monotonic machine timestamp to mark an event. This doesn't need to be in the past, and doens't need a timezone; UTC is more than enough, exactly like UTF-8 is more than enough for Unicode strings.

Having a representation that allows for timezones but not for dates in the past doesn't look right to me. I think that msgpack might be better off simply with the second (machine timestamp), so basically a 64-bit integer for timestamp (let's not make it 32bit, since we're getting closer to the overflow) plus a 32-bit integer for sub-second precision.

Midar commented Mar 13, 2013

@ugorji

UTC is a time system (speaking loosely). The reason for timezones is not that you want to change it to your local one, but you are recording an aware time, defined as time and place. In this respect, UTC is just a reference for others to define timezones off (like equator is a reference for defining latitude, as an analogy).

I had some links above on aware time, but you can google timezones and handling time for more information.

I don't see why you need to store the timezone in the date? You can store all dates in UTC and convert them to whatever timezone you want. If you want to specify what timezone is used by an application/user, that should be a different field and not part of the timestamp.

@kazuho

You are right that copying is not necessary, what I am saying is that the game developers told me that doing copy once or twice at load time is not an issue for them whereas the inflation of size is (25% in case of typed-arrays of float vs. non-typed arrays). I do not know if all game developers agree on the idea, but to me the answer seems to have grounds.

Yes, having untyped arrays and a type before every float definitely is not suitable for this. But IMHO, having unaligned data is not suitable either. And aligned data is contrary to the concept of MsgPack to be as small as possible.

OTOH, I doubt if __builtin_prefetch() (of GCC) is useful in such case, since it is defined as an advisory intrinsic, meaning that it would not cause any faults (i.e. no SEGV when an invalid address is given), and since the intrinsic maybe compiled to / executed as noop.

Usually, this generates an instruction, e.g. PREFETCH on x86 and PLD on ARM. This should be enough I think :).

I disagree. Using partial knowledge means that you can do some calculations using the value (instead of treating the value only as opaque data).
In case of annotating an integral value as a DateTime type, it's accuracy might be 1 second, or 1 millisecond (the epochs might be different as well). So if you would want to use the partial knowledge such as calculate the time difference only knowing that the type is an integral value is not enough, you need the concrete type information. Partial knowledge does not help.

What I meant here is that the decoder is missing knowledge about the type of this field, but the application knows this field. Now instead of having to patch the decoder, the application could just use this knowledge.

@ugorji

In supporting this spec in the msgpack Go library, I just serialized time as 2 integers and a string (which could be nil, to indicate UTC/GMT timezone) with extension tag 2.

I hope by integer, you mean 64 bit integer? Because with 32 bit, we will get a problem in 2038. This would be a step back from all other formats.

@rasky

I think that msgpack might be better off simply with the second (machine timestamp), so basically a 64-bit integer for timestamp (let's not make it 32bit, since we're getting closer to the overflow) plus a 32-bit integer for sub-second precision.

Agreed. This might have the advantage over a double that it works well on hardware without a floating point unit. HOWEVER, I think we should have the second 32 bit integer (the nanoseconds, I assume?) optional, as often, second-precision is more than enough.

Contributor

ugorji commented Mar 13, 2013

@rasky

Exactly. I agree with you on everything you have said, and this is the discussion we need to have.

You are right that using a delta in minutes (maybe even quarter-hours) should be adequate, and is the better model for representing timezone. With quarter hours, it will typically range from -12_4 to +14_4 range (-48 to + 56). We can even encode this to a positive fixnum by adding -48.

My position on time is that the timezone should be stored if provided by app/language, and skipped if not. Using your idea, the time can be stored as 3 integers: int64, int32, fixnum.

Also, many languages and databases now support this data type (timestampz in sql, aware datetime in python, etc). Time in Go Language always have a timezone attached to them, and it will suck to lose those

Contributor

ugorji commented Mar 13, 2013

@Midar

@rasky just explained above what timezone represents. I think there is some reasoning for timezone that you are not aware of. And to your second question, I did. Time is typically int64 seconds and int32 nanoseconds.

@rasky

In thinking further, it may be best to set our reference time/epoch as January 1, year 1, 00:00:00.000000000 UTC (ie first field are seconds since this epoch). For some explanation of why, I used the reasoning behind Go's internal time representation. See info here:
http://golang.org/src/pkg/time/time.go?s=24945:24982#L133

Midar commented Mar 13, 2013

@ugorji I still don't get why we need a timezone included in the date. What is the problem with having that as a separate type if you really care? You did not bother to give an explanation why it has to be included in the date type.

About using 01.01.0001 as reference date, I strongly disagree. The world has agreed on using 01.01.1970 as the reference date, so using something else is just causing confusion and unnecessary processing - most systems use 01.01.1970 as internal reference date, so we should use this. If the designers of the Go language decided they need to be be special, that's honestly their problem and should not affect MsgPack.

Contributor

ugorji commented Mar 13, 2013

@Midar

Fair enough.

It's fair to use Unix Epoch as reference date. I'm ok with that. That's what I've been pushing. Just wanted to put this on the table, with a justification that others that have thought long and hard about this problem have come around to.

IMO, Unix Epoch will work fine too, but I'd still like to think it through and read up more on time representation.

To your other question of why put in a timezone, it's because it's part of the time. Time displayed has a timezone in the representation. E.g. Mon Jan 2 15:04:05 MST 2006, or Mon Jan 02 15:04:05 -0700 2006. It's naive to believe that the only reason for that is to display it in the local server's or user's timezone. With rasky's idea of just storing the offsets and using quarter-hours as defined by http://en.wikipedia.org/wiki/List_of_time_zones_by_UTC_offset, it will take an extra byte (positive fixnum). For languages and applications that include this in their data type, we shouldn't just drop it.

Midar commented Mar 13, 2013

@ugorji Yes, but that's something that's only useful for displaying to the user and thus I think not every date should contain it. I think we'd need 3 date types:

  • Seconds since epoch
  • Seconds since epoch + nanoseconds since the full second
  • Seconds sincs epoch + nanoseconds since the full second + time zone

Anyway, shouldn't we create a new ticket for this?

Contributor

ugorji commented Mar 13, 2013

@Midar agreed.

I think it's possible use a single type to display it though. Since it's an extension, the extension can read one, two or 3 integer values from the byte stream, and decode appropriately, or we can treat it as either a single integer value or an array of 2 or 3 integer values.

I've created #130 for it.

kamipo commented Mar 13, 2013

Hi, I'm msgpack user.

I think v3 and v3.5 is a very good proposal that is considered compatibility.

this proposal is highly compatible for msgpack users who use like JSON.

I think that can choose behavior follows:

  • If I can assume that existent data is stored only valid UTF-8 encoded strings in raw, I can choose ambiguity-strict behavior to the deserializer.
  • Even if I choose ambiguity-strict behavior to the serializer, if do not stored binary, current deserializer can restored as before.

the v3.5 proposal is not accept that assume this?

Basic Profile been not obsolete in the v3.5 proposal?

My hope is that the compatibility is very considered when new extensions.

Contributor

kazuho commented Mar 14, 2013

@ugorji

I read the previous comments before i joined the thread.

Please forgive if I sounded like criticizing you on not reading the discussion. I just wanted to show some pointers related to the definition of extension types from this very long issue.

I had previously read the proposal as defining the tag as an integer in msgpack format. However, on further reading, @frsyuki said it's a single byte. If an integer, I'd suggest we reserve [-32,0] range, and if a single byte, we reserve [0,31]. Either way is fine IMO.

Although @frsyuki has not documented in his last proposal the details of how the extended type tags should be numbered, my interpretation of his comment (#128 (comment)) is as follows:

  1. tags are identified at API level using integers
  2. the format spec. would use one byte to represent the tag
  3. tags >0x80 will be reserved, so that if anybody ever comes up in need for many tags, the format spec. can be extended to support multi-byte tags using variable-length encoding
  4. serializers / deserializers that do not know how to handle multi-byte tags will continue to work correctly in case 3, since the succeeding bytes of the tag will be the first bytes of the data when seen from the older version of the spec.

It is tricky. But I think the design is fine considering the fact that no one can tell at this moment how many extended types would actually get used. Regarding the issue, checking #128 (comment) (esp. the last paragraph) might help.

My response to his design can be found in the comment next to @frsyuki's (#128 (comment)). I have the same position as you do in the idea that it might be clever to reserve some IDs.

@Midar

What I meant here is that the decoder is missing knowledge about the type of this field, but the application knows this field. Now instead of having to patch the decoder, the application could just use this knowledge.

Thank you for the explanation. If that is the case it would mean that the application has 100% knowledge of the data. Regardless of the underlying type (whether it is an integral value or opaque octets) the application would be possible to handle the data. In other words, the case does not illustrate a merit of using type annotations that can be applied to any data type instead of using type tags only against octets.

For the rest of the discussion (regarding how MessagePack with type extensions could possibly used for game graphics), I think I would better stop commenting on the issue now that IMO we have mostly reached agreement on the issues regarding the topic (thank you for making the discussion precise).

Midar commented Mar 14, 2013

@kazuho

If that is the case it would mean that the application has 100% knowledge of the data. Regardless of the underlying type (whether it is an integral value or opaque octets) the application would be possible to handle the data. In other words, the case does not illustrate a merit of using type annotations that can be applied to any data type instead of using type tags only against octets.

It does. One deserializer might now dates and deserializes it into a data object. Another deserializer does not know it, but the application knows it's a date and can convert it. This way, too implementations in different languages can exchange the data just fine, even though one deserializer is lacking dates.

Contributor

ugorji commented Mar 14, 2013

@kazuho I didn't think you were criticizing me. Sorry if my response came out sounding snippy. I also see now that I missed your reason for sending me to re-read those comments. Apologies again.

  • tags are identified at API level using integers
  • the format spec. would use one byte to represent the tag
  • tags >0x80 will be reserved, so that if anybody ever comes up in need for many tags, the format spec. can be extended to support multi-byte tags using variable-length encoding
  • serializers / deserializers that do not know how to handle multi-byte tags will continue to work correctly in case 3, since the succeeding bytes of the tag will be the first bytes of the data when seen from the older version of the spec.

I just implemented the extension Proof of Concept for Go. From this recent experience, the fleshed idea as you wrote makes extensions hard. What about this, if we want to open up unlimited number of application extensions.

  • tags are signed integers (+/- fixnum or int8..64).
  • Spec reserves a fixnum range of tags (either positive or negative fixnum depending on how many we want to reserve).
  • serializers/deserializers will work same way now or in the future without spec having to rev again.

If that is the case it would mean that the application has 100% knowledge of the data. Regardless of the underlying type (whether it is an integral value or opaque octets) the application would be possible to handle the data. In other words, the case does not illustrate a merit of using type annotations that can be applied to any data type instead of using type tags only against octets.

It does. One deserializer might now dates and deserializes it into a data object. Another deserializer does not know it, but the application knows it's a date and can convert it. This way, too implementations in different languages can exchange the data just fine, even though one deserializer is lacking dates.

This is a really good point. Let me explain how the Go library works in the proof of concept to show why extensions solve this elegantly.

The Go deserializer works in two modes:

  • schema-less: where it decodes into an anything, and makes best effort based on the stream
  • schema: where it decodes into a specific typed object, and uses type information to coerce the stream. E.g. if you pass a signed int to be decoded from an unsigned value in the stream, we decode it, ensure it doesn't overflow, and set it.
    Within here, we treated time specially and when we saw it, we looked for an integer or integer array in the stream.

In the new mode, the extension support solves this problem more elegantly by letting applications configure that a tag should be decoded into a specific type, so custom code for handling things like time, appengine/datastore.Key, etc go out of the deserializer. In the Go library, there isn't a binary extension or time extension. There's just a generic way to configure that tags are mapped to a specific type, and when decoding, we decode into that type when we see the tag. We then provide functions for en/decoding time and binary, and customers can "explicitly" use it when initializing their decoder. The usage looks something like this:

  // create and configure options
  dopts = msgpack.NewDecoderOptions()
  eopts = msgpack.NewEncoderOptions()
  
  // configure extensions, to enable Binary and Time support for tags 1 and 2
  // this says that, when you see tags 0 or 1 in the stream, assume a binary or time and call these guys
  dopts.AddExt(reflect.TypeOf([]byte(nil)), 0, msgpack.DecodeBinaryExt)
  dopts.AddExt(reflect.TypeOf(time.Time{}), 1, msgpack.DecodeTimeExt)
  // this says that, when you see a byte array or time to encode, call these guys
  eopts.AddExt(reflect.TypeOf([]byte(nil)), 0, msgpack.EncodeBinaryExt)
  eopts.AddExt(reflect.TypeOf(time.Time{}), 1, msgpack.EncodeTimeExt)
  
  // Use encoder and decoder with these options
  dec = msgpack.NewDecoder(r, dopts)
  err = dec.Decode(&v) 
  
  enc = msgpack.NewEncoder(w, eopts)
  err = enc.Encode(v) 

This way, applications can always pass a schema or schema-less mode, and still get appropriate data decoded, instead of getting something opaque and doing a second walk through to possibly coerce.

Contributor

kazuho commented Mar 14, 2013

@ugorji

I just implemented the extension Proof of Concept for Go. From this recent experience, the fleshed idea as you wrote makes extensions hard.

It shouldn't be hard, since all you need to do is:

  • in deserializer, read a single byte as unsigned int and return the value as the tag
  • in serializer, assert that given tag (unsigned int) is smaller than 256, and then emit it as a single byte

The discussion about multi-byte tags is that such cases must not be neglected since we might need to make such change in the future, but the design should be kept as simple as possible.

What about this, if we want to open up unlimited number of application extensions.

  • tags are signed integers (+/- fixnum or int8..64).
  • Spec reserves a fixnum range of tags (either positive or negative fixnum depending on how many we want to reserve).
  • serializers/deserializers will work same way now or in the future without spec having to rev again.

It is simple by design, but IMO harder to implement. In case of msgpack-c, you would need to copy some parts of the existing implementation that decodes the tags since it is hard to add a guard that limits the appearance of tags (please see https://github.com/msgpack/msgpack-c/blob/e96e20ccfddf6ee6a61add394fb96dbc97e75fc4/msgpack/unpack_template.h#L187).

The difference between the two approaches are subtle (whether to define the extension tags as single bytes that need to be represented as integral values of a larger type at the API level, or as a partial spec. of MsgPack data types). If I were to implement an existing format specification I would not mind doing it either way.

But at the moment of defining the spec., I prefer using the former (single byte at format level & integral value at API level).

@Midar

If that is the case it would mean that the application has 100% knowledge of the data. Regardless of the underlying type (whether it is an integral value or opaque octets) the application would be possible to handle the data. In other words, the case does not illustrate a merit of using type annotations that can be applied to any data type instead of using type tags only against octets.

It does. One deserializer might now dates and deserializes it into a data object. Another deserializer does not know it, but the application knows it's a date and can convert it. This way, too implementations in different languages can exchange the data just fine, even though one deserializer is lacking dates.

There are no difference in between using type-annotations on any type vs. using type tags only applicable to octets.

In the case of type-annotations that you just explained, the application is using the knowledge of how the DateTime object is being serialized (that it is serialized as an integer, with certain epoch, with certain precision). Without such knowledge it is impossible to apply calculations using the integral value that represents a DateTime.

And that is exactly the same in the other approach (that uses type tags onl applicable to octets). If the application knows how DateTime is encoded (that it is serialized as an integral value of certain length and certain endian, with certain epoch, with certain precision), the application can apply calculations.

@ugorji

Thank you for providing an example of how a MessagePack binding can provide extension points at the API level. IMO there has been almost no discussion about how the bindings should support type extensions, and IMO your example is very elegant.

If you could push the binding you implemented I think it would help all of us in understanding the required changes more precisely. How does it sound? Thank you in advance.

Contributor

ugorji commented Mar 14, 2013

@kazuho

When I said it makes writing extensions hard, I was talking just about the multi-byte support ie responding to your statement below:

serializers / deserializers that do not know how to handle multi-byte tags will continue to work correctly in case 3, since the succeeding bytes of the tag will be the first bytes of the data when seen from the older version of the spec.

It's nice if extensions get a byte array that represents the extension, without having to expect potential garbage in front. Maybe I'm talking based on the implementation I have, and how else I could support it. I have functions that can decode a signed int or just read a byte from the stream, so I can easily call either one if the type-descriptor matches an extension type.

My main point is that if we think we want to support multi-byte tags in the future, we should do it now. It wouldn't bloat msgpack (since folks can still use single-bytes), but it allows the functionality without having to rev the spec again. Libraries may have to adjust code to handle it (I just went through this exercise with the Go library).

That's why I proposed specifying tags as signed integers (unbounded), and possibly reserving (a subset of) the positive fixnum range.

Regarding pushing the binding I have out, I've hesitated because I changed the API to make them simpler and cleaner. I wanted to somehow notify users before making the push. Let me read up on git and forks/branches/merges and all that and see if we can do it soon with a "private" fork/branch/something (say tomorrow) (my git ability is somewhat limited). If folks have any ideas on how best to do this, please let me know.

Contributor

ugorji commented Mar 14, 2013

Did the git refresher. Uploaded the code bits with support for extensions on a dev branch.

See https://github.com/ugorji/go-msgpack/tree/dev

Please let me know your thoughts.

Contributor

kazuho commented Mar 15, 2013

@ugorji

Thank you for the clarification.

It's nice if extensions get a byte array that represents the extension, without having to expect potential garbage in front.

Even if we take the "single-byte tagging format" approach, extensions need not take care of potential garbage. Serializers / deserializers that only support single-byte tagging can simply reject extensions to be registered above ID 127. And if we ever find out that 128 extension IDs (including binary (id=0)) were not enough, we can introduce variable-length-encoding at the MessagePack binding level, and let the binding strip the variable-length tag and pass only the data to the extension.

My main point is that if we think we want to support multi-byte tags in the future, we should do it now. It wouldn't bloat msgpack (since folks can still use single-bytes), but it allows the functionality without having to rev the spec again. Libraries may have to adjust code to handle it (I just went through this exercise with the Go library).

I mostly agree. The reason I prefer only introducing single-byte tagging even though I most agree with you is because IMO it is unlikely that we would ever be in need for more than 128 tagging points.

I expect that most (if not all) of the extensions would be application-specific, meaning that the tags may overlap between different applications. DateTime might get defined as an global (i.e. an application-independent tag that every MessagePack binding should support) but I do not think there would be tens of such types that are required globally.

And even if we are to introduce support for multi-byte tags at this moment I still think using a variable-length encoding separate from current MessagePack tagging is the way to go in terms of simplicity.

Regarding pushing the binding I have out, I've hesitated because I changed the API to make them simpler and cleaner. I wanted to somehow notify users before making the push. Let me read up on git and forks/branches/merges and all that and see if we can do it soon with a "private" fork/branch/something (say tomorrow) (my git ability is somewhat limited). If folks have any ideas on how best to do this, please let me know.

Thank you for considering of publishing the code to public. If you could push the code under a branch named something like "experiments/msgpack-next-draft3.5+ugorji1" (or whatever) then IMO people would not get confused.

Contributor

ugorji commented Mar 15, 2013

Thanks @kazuho

P.S. See go-msgpack dev branch: https://github.com/ugorji/go-msgpack/tree/dev link

I get exactly where you are coming from. I generally agree with you, and am fine either way.

Taking it further, I think we should make a binding decision on it sooner than later, and not have to revisit it at another time. We really have 3 options:

  1. Option A: One byte (range 0-0xff). Reserve 0x00-0x1f. Applications free to use 0x20-0xff.
  2. Option B: Multi-Byte decoded as a signed integer. Reserve negative fixnum [-32,-1]. Keep all else available.
  3. Option C: Multi-Byte. Reserve Positive Fixnum [0,127]. Negative fixnum or signed available for applications.

My preference is Option A, or Option B. Both reserve 32 slots for spec-defined extensions (we currently only really have high global demand for 2). And customers have 224 or 128 slots for extension tags using single byte.

If we think multi-byte support may come later, we should handle it now (option B), so we don't have to rev again. It still gives us 32 reserved and gives customers 128 available with a single byte, while allowing customers access to many more in the future.

Yes, libraries will have to change to support it, but that's a one-time cost.

Contributor

kazuho commented Mar 15, 2013

@ugorji

Thank you for publishing the code. I am eager to looking into it.

Regarding the tagging format, I think as follows.

If we are fine with single-byte tagging, then why should not we reserve more area, so that we can introduce multi-byte tagging just in case if some applications come up with demand for more extensions slots? For example, how about changing the option below:

  1. Option D: One byte (range 0-0xFF). Reserve 0x00,0x40-0xFF. Applications may use 0x01-0x3F.

If you think having 63 slots for application-defined types is not enough, then IMO you should rally for introducing multi-byte tagging, since I think it is hard at this moment to make accurate expectations of how many types an application might need to define (what I am trying to say is that if we can reach a concensus that applications would generally need atmost ten ore twenty tags, then using the single-byte notation would be fine; if we think that some applications would require hundreds of extension types to be defined, we should introduce multi-byte tagging).

Contributor

ugorji commented Mar 15, 2013

@kazuho

I see where you are coming from with your previous proposal i.e. 0x00-0x7f shared between applications and reserved, 0x80 to 0xff are unused. If we need to extend later, we leverage the fact that most-significant bit was not used before.

I don't much like it, because it tries to be too clever, and punts on making a decision.

I don't know either how many slots will be needed by users; I don't have any experience or knowledge there really to make any kind of informed recommendation.

Given the options:

  1. Single Byte (Option A or D). Shared however we deem appropriate
  2. Signed Integer value (Option B or C). Using +/- fixnum or int8...64 in msgpack decoded format.

If we're not sure, we should just go with the Signed Integer option, since it gives us the best of both worlds with the only cost being that design of some serializers may have to change (a one-time cost). Most users will be fine with only one byte (since the fixnum gives us 160 slots to share), but they reserve the option to use more if they want.

Contributor

kazuho commented Mar 18, 2013

@ugorji

I don't much like it, because it tries to be too clever, and punts on making a decision.

I disagree. I think that adding features that might never be used is a bad idea (esp. in the case where we can later extend the format without sacrificing compatibility).

And even if we agree to support multi-byte tagging, the representations of the tags should not use the MessagePack integer representations. Only one representation should be available for each tag. Adding another ambiguity now (how the encoded representation should be) is not a good think IMO esp. since we might need to discuss about canonical forms in the near future (see @frsyuki's latest proposal).

PS. So I would prefer option D, or in case of supporting multi-byte tags, IMO some kind of run-length encoding (that is unambiguous in terms of canonicalization) should be used.

Contributor

ugorji commented Mar 18, 2013

@kazuho

Okay. I'm struggling to see why we wouldn't leverage the compact variable-length msgpack representation of integers for this, but maybe there's some wisdom in there that I'm missing.

Assuming there is, and Msgpack's philosophy is that the format does not "recursively" depend on itself, then Option D is a good solution and I'm fine with it.

@frsyuki time for an update to the spec?

Contributor

ugorji commented Mar 18, 2013

P.S. For option D, can we do that applications can use 0x40-0x7f instead for starters?

Owner

frsyuki commented Mar 19, 2013

Let me explain how I use MessagePack. I just think examples should help to have better design.

distributed log analysis system

This system stores huge amounts of data so that users can run SQL queries on the data using Hadoop/Hive. It stores hundreds of TBs data in MessagePack format.

The efficiency in reading data is much more important compared to one in writing data. Thus files should be compressed and should not have pddings. If it needs to update data, it recreates the whole file (meaning that files are immutable). This tequnique named Log-Structured Merge Tree is widely known in analytical database area.
It doesn't need to align values as well because it reads data from remote storages in streaming (taking advantage of the msgpack's streaming deserializer). Note that remote shared memory/SSD is faster than local HDD. Reading from local disks is not always the ideal method.

Data itself are dynically-typed (a value have type information but a sequence of values don't have fixed type information) (taking advantage of the msgpack's type system) because schema of logs could be changed. MongoDB shows dynamic typing on DB is useful.
On the other hand, the query engine (SQL) is statically-typed. But it's highly expected that users change the SQL schema although it's impossible to convert all stored data phisycally because of its size. Thus the system uses the SQL schema only for reading data to project the types of msgpack data into the types of SQL. This technique is known as schema-on-read. I can't use Thrift or Protocol Buffers for this case because they need schema to write data.

This system itself should not allow deserializers to validate strings because it handles it programmably (it may need to project strings into binaries).

The system is written in Java and Ruby (the system itself is written in Java and uses above type projection layer. Other tools written in Ruby just read data as-is without schema).

distributed storage systems with HA (Kumofs, LS4)

One focues on small & many metadata for a startup company providing a photo sharing web service, and the other focues on large & few BLOBs like HD photos (I'm not engaged with these projects any more). I used MessagePack for the network protocol.

High availability is the most important feature of them. It can upgrade the version of software without downtime by restarting nodes one-by-one (this technique is known as rolling-upgrade).
The receivers use something like schema (progorammable data projection layer). A point here is that the data format and the schema could be inconsistent because the sender could be old. The receiver always converts msgpack data into the internal data representation.
Exchanging the version of software during protocol negotiation doesn't work well because it easily causes combination-explosion (note: rolling-upgrade needs both backward and forward compatibility).

The recievers should not care the type information on strings or binaries if they represent the same byte sequence because the type information might be changed. Instead, it needs to convert data from msgpack into internal data representation everytime on read. The conversion layer built on top of msgpack takes care of string/binary issue.

These systems are written in mix of Ruby and C++.

log collection system (Fluentd)

This system collects logs from many servers (like syslogd). It uses JSON for frontend interface and msgpack for internal data representation. It's completely schema-less. One of the the biggest users deploys this system on thousands of servers.

The system needs to handle broken data carefully because logs are often broken. But it should never reject broken logs because applications send logs asynchronously. It needs to send or store the broken data into somewhere so that users manually handle them later.

This system needs to communicate with clients written in strong/weak-string code at the same because data source includes apps written in PHP, Ruby, Python, ObjC, etc. and also embedded/sencor devices whose software is written in C.

Distinguishing strings from binaries helps in this case if all clients care appropriately. But this system still needs an option to extract string/binaries without validation as ambiguous data because it has to handle broken data programmably, whether clients send it intentionally or not.

A common issue is that receivers can't trust that the senders give expected type information. Thus the receivers need an option to handle strings/binaries as ambiguous data.
I agree that many applications want senders to distinguish strings from binaries explicitly instead of pushing the matter to the receiver.

Owner

frsyuki commented Mar 19, 2013

@kazuho @ugorji

I agree with Option D.

In case of supporting multi-byte tags, (a) variable-byte-coding is one option (b) and another option is using 0xff, 0xfe, 0xfd, … for 2bytes, 3bytes, 4bytes, … type tag.

Contributor

ugorji commented Mar 19, 2013

@frsyuki @kazuho

Option D.

Let me try to spell out what Option D would look like (based on previous comments from @kazuho and I think @frsyuki).

Proposal:

  • We support one or 2 bytes for tags.
  • With one byte, values are 0x00 - 0x7f. 0x00 - 0x3f (64 slots) is reserved for spec extension.
    Applications can use 0x40 - 0x7f (64 slots).
    In this mode, most significant bit is always 0 (which says its a one-byte tag).
  • With two bytes, values are 0x0000 - 0x7fff (about 32,760 more slots for applications).
    In this mode, most significant bit is always 1.
    We flip that bit, and use the 2 bytes to get a int16 number which represents the tag.

Key takeaways:

  • Tag is effectively an int16 non-negative value.
  • Most significant bit (sign bit) tells us whether it's a one or two-byte tag.
  • Range of tags is 0 - 63 for spec interoperable extensions, and 64 - 32,767 for private application extensions.
  • Applications understand that by using tags in the range 64-127, they get more compact encoding of their extension tags.

Thoughts? Does this look like something we can agree on for now, so we have a proposal we can start testing out among libraries?

Contributor

kazuho commented Mar 22, 2013

@ugorji

Okay. I'm struggling to see why we wouldn't leverage the compact variable-length msgpack representation of integers for this, but maybe there's some wisdom in there that I'm missing.

Providing more than one way to encode a single information is generally a bad idea. http://blogs.msdn.com/b/shawnste/archive/2005/03/24/utf8-security-and-whidbey-changes.aspx might be a good article which says,

The general understanding with any Encoding is that if there are multiple forms of encoding a character, then the likelihood that insecure data can disrupt or find security problems with software increases.

There are security flaws other than UTF-8 that originate from ambiguous protocol design (or vague conformance to the protocol); to name a few, loose HTML decoding of Internet Explorer (how it handles backquotes), HTTP response smuggling, would be good examples.

That said, it is a trade-off issue. So if there is any good reason to use a loose definition (allow multiple encoding styles for single information) we should do so. But in case of the tags, I do not think there is enough good reason. As I wrote in #128 (comment) it would be harder to implement than a variable-length-encoding in some bindings.

@frsyuki @ugorji
I am fine with @frsyuki's #128 (comment)

Owner

frsyuki commented Mar 25, 2013

First, I think we don't have to define two or more bytes now, although we need to be prepared in case.
Because if I were a developer of an application that needs to use hundreds of application-specific types, I'll build the type system within the payload of the Extension type, where it uses only one type tag to explain the payload contains the application's type system.

Second, variable-byte-coding is good to represent big integers. But in my opinion, it's more important for security and performance that deserializers can know the number of following bytes by reading a few fixed number of bytes.

Third, I think the Binary type is a special type which is more important compared to the other possible types. Some implementations such as Java provides a method to peek the next type. This method is necessary for schema processors (=type projection layer described at #128 (comment)) to convert MessagePack type into certain class without creating intermediate objects (Value objects).
Thus it's better to make it possible that deserializers can know whether the next object is Binary type.

Forth, applications want to assign numbers from 0.

Thus here I propose option E: https://gist.github.com/frsyuki/5235364

cabo commented Mar 25, 2013

A couple of quick comments to the most recent gist https://gist.github.com/frsyuki/5235364 while I'm still digesting it:

  • There seem to be conflicting statements about type tag -1 -- is this the binary type tag or the multi-byte tag extension type initial type tag byte?
  • The multi-byte type tag thing can be added at any time. Anyone could define tag type 53 to mean "use the next byte to define the actual tag type based on a list that is bound to the initial type byte 53", type 54 to do the same for another list etc. Defining now how to make multi-byte type tags just makes it easier for APIs to offer the information.
  • The profile idea is interesting, but probably more text is needed before I really can comment on that. Is there a relationship between profiles and "binary_extension=true/false"?

Midar commented Mar 25, 2013

@frsyuki Your new proposal looks good, though it needs some work on clarification IMHO. ExtType -1 is quite confusing. It took me some time to understand that this is the ExtType with the type set to -1 and thus just binary. Maybe we should define binary at the start as extension type -1 and then note that ExtType -1 exists as a shortcut. And what's even more confusing: at one point it says -1 / 0xFF is forbidden, but there's even a special type for -1!

Another thing that's a downgrade from your first spec is that there is no Raw8 type. Seeing we still have a free code point, can we add that please? Strings shorter than 256 byte but larger than 31 byte are not too uncommon.

Contributor

kazuho commented Mar 26, 2013

I agree with @cabo (with some extra comments stated below).

Regarding the profiles, it would be great to have a profile that limits the keys of map to strings (since many scripting languages have such limitation regarding their built-in hash types).

@frsyuki
Anyways, is it necessary to include the discussion regarding profiles within the format specification? IMO it would be possible (and would be easier) to handle the issue separately.

Contributor

ugorji commented Apr 4, 2013

Following up on this.

I read the spec. It looks good. This is a re-iteration of some concerns raised by @Midar and @cabo and @kazuho

Like others mentioned, there's some inconsistency and confusion around the ext type -1 description. You mention that type -1 is reserved for possible multi-byte ext type in the future. That is in conflict with -1 being binary type. Can we just

  • call them "ext 8 binary", etc.
  • State that negative values will be used for predefined extensions and future expansion to multi-byte types
  • Define an order of assigning these predefined types e.g. from -127, -126, ...

Can we remove discussion of Profiles from here? It gets confusing and starts us down the rabbit-hole that will take us longer to get closure on the basics to move forward. For example, my Go library uses extensions to define types which cannot be encoded/decoded in a standard way by altering member variables e.g. time.Time (which has all private member variables), library types which user doesn't control which require calling some non-standard API to modify the contents, etc. At a minimum, we can just state that encoders should support the ability to turn off writing the new types (String, Binary, extensions). And we can make discussion of Profiles separate from the v2.0 proposal.

Also, it would be nice to get some agreement and movement on this while we're all still engaged. It would be a shame if this ends up not going anywhere.

Contributor

ugorji commented Apr 17, 2013

@frsyuki PING

Owner

frsyuki commented Apr 19, 2013

@ugorji Thank you for PING and sorry for being so late. Hopefully I can take time this weekend (or next week...).

Owner

frsyuki commented Apr 22, 2013

Finally. Thank you for waiting...

My major intention in v4 was to define binary types as a part of ext type so that we can easily keep existent serialized data and implementations compatible with the new spec. I mean that if the binary type is a part of ext type, the change of spec is only addition of the ext type. Thus we can reeinterpret that existent implementation and serialized data just use the "primitive profile" in the new spec. Nothing will be obsoleted with this idea.

However, as you commented, it makes the spec complicated. The reason why I created "ext type=-1" formats is that it's better if deserializers can tell the next object is binary type by reading 1 byte. However, it makes difficult to put binary type as a part of ext type.

So, I changed my mind. MessagePack spec should be simple as possible as possible, and it simply adds binary type. It could be painful to upgrade but the format will be simpler (I'm sorry I took too long time to reach this idea). And I added string 8 format again. Adding a new format into existent type may also cause compatibility problem because existent implementations can't read the value while new implementations will automatically use the format. But I agree that it's better to have.

Here is v5: https://gist.github.com/frsyuki/5432559

  • I took an idea from v4 that ext type < 0 is reserved and >= 0 is for application-specific types. I think this is good idea.
  • For multi-byte type tag of ext type, I removed the description from main body and mentioned in "ext format family" section.
  • I removed "binary_extension" descrption and added description of "compatibility mode" instead in implementation guidelines.
  • I moved reduced the description of profiles and moved it to "Future discussion" section.
  • v5 has fixext 1/2/4/8/16 instead of fixext 0/1/2/3/4. I thought 1/2/4/8/16 is better for most of cases but I don't have strong opinion about it for now.
Contributor

ugorji commented Apr 22, 2013

Thanks @frsyuki

Looks good. Can folks do a quick vote with caveats?

  • My vote: __ Yay __
  • Caveat:
    • Consider taking out fixext1 and add fixext32.
      This will give more symetry with the other fixXXX, providing support up to 31/32 elements.
Contributor

ugorji commented May 30, 2013

Just wanted to share with the group what I have been working on.

I have just released Binc, a lightweight, compact, limitless, schema-free, precise, binary, high-performance, feature-rich, language-independent, multi-domain, extensible, data interchange format for structured data.

Along with the public release of Binc, I have released a High Performance and Feature-Rich Idiomatic Go Library providing encode/decode support for different serialization formats, including msgpack and binc . It includes support for user-defined extensions, even if the format doesn't natively support those. It also includes Go net/rpc codecs for using in rpc communication. This is a replacement for the de-facto Go msgpack library which I wrote at https://github.com/ugorji/go-msgpack (NOW DEPRECATED).

Please feel free to participate either in the G+ discussion or here. I would love to get your thoughts.

Thanks.

Midar commented Jun 15, 2013

@frsyuki I had a short look at it and it looks good. Then again, I'm way too tired to do a real review ;).

@ugorji I think it's a really bad decision to create a new format now that MsgPack is developing in the way you wanted. Especially seeing that you added support for UTF-16 and UTF-32 is sad. This is a step backwards! There is absolutely no reason whatsoever why you would ever want UTF-32 in a binary format that wants to be compact. You waste at minimum 11 bytes per codepoint‼ For characters not using all 21 bits of unicode (read: in practice all), it's even more. And to make matters worse, you even have BE and LE.

Contributor

ugorji commented Jun 15, 2013

@Midar I appreciate your sentiment, but I think it's unfair. Our timelines were not aligned, progress was too slow for me, and I had to move on. Simple as that. I'd still be involved w/ msgpack as progress happens, but my project cannot wait on it, and I had to look at alternatives.

You should have responded within the last 2 months.

Midar commented Jun 15, 2013

Actually, I do not think it's unfair. IIRC, you were among those who complained that @cabo created BinaryPack, but now you did the same. I do understand the need to have something now, which is why I implemented BinaryPack for now and will switch back to MessagePack once those issues are sorted out. You could have used @frsyuki latest proposal and used your own custom types for dates until those are standardized.

I'm not exactly sure why I should have responded within the last 2 month if it's @frsyuki who does the decision. I thought I already gave enough input before.

Contributor

ugorji commented Jun 15, 2013

I was concerned because @cabo was pushing msgpack into the standardization committee while the community was actively engaged in incorporating some changes.

I created Binc because the msgpack progress had stalled for about a month and a half, and it was affecting my project deliverables. So I moved on.

Moving on is never a bad decision. I clearly exercised a lot of patience as the last set of comments shows. I can only wait on a stalled project for so long. If there was another encoding format that gave me what I wanted, I would have taken that. Since no other one existed, I created one with a different philosophy and open-sourced it for comments and use by others if they wish. It's not a big deal.

Midar commented Jun 15, 2013

Well, he was not pushing MsgPack into a standard, he was pushing BinaryPack. This is exactly the same as if you would try to make your Binc a standard. He did so because MsgPack stalled. So, where is Binc different from BinaryPack now? And @cabo was even open to adding time, so instead of creating something completely new, you could have tried to enhance BinaryPack. And as you already saw, the changes from BinaryPack went back into MessagePack and @frsyuki is open to them.

Contributor

ugorji commented Jun 16, 2013

@Midar,

My one and only response to @cabo pushing BinaryPack into a standard.
#129 (comment)

Per binarypack, and I quote from http://tools.ietf.org/html/draft-bormann-apparea-bpack-00
BinaryPack is a minor derivation of MessagePack that was developed by Eric Zhang for the binaryjs project. ... This draft tries to be faithful to the successful MessagePack format, with the exception of enabling the distinction between opaque binary byte strings and UTF-8 byte strings, as introduced in the binaryjs project.

Like I said earlier, 2 months in my project lifetime is a long time, and I can't wait around pinging folks for responses and getting nothing, both privately and on the public issue tracker. I had to move on.

In starting from scratch, I designed Binc around a different philosophy; It tries to be comprehensive as opposed to a lowest common denominator (as seen in json/msgpack/etc).

We shouldn't hijack this thread to discuss Binc issues. Feel free to comment on the Binc site, and I'd be happy to follow up there at length.

Any updates on the proposal? Will there still be further revisions before they are implemented?

@frsyuki PING

Owner

frsyuki commented Jun 19, 2013

Next step is extensive announcement. I think all msgpack developers at least know about the new spec so that some developers start implementation.
So, I'm now working on renewal of the msgpack.org website. I expect that all msgpack users notice something is changing. preview: http://gyazo.com/211920d93923df49a279e8fd40d171ca
(background color of "Next" section is actually very difficult...)
I want to release it this month, hopefully.

Midar commented Jun 19, 2013

@frsyuki Does this mean https://gist.github.com/frsyuki/5432559 is safe to implement? Is this final? If so, I would go ahead to implement it.

Owner

frsyuki commented Jun 19, 2013

@Midar We can start implementation. I believe the spec will be the final because we discussed very very deeply.

Midar commented Jun 19, 2013

@frsyuki There's a bug. str 32 and array 32 both are 0xDB. Shouldn't array 32 be 0xDD?
Also, the type in ext is defined to be unsigned, even though you later state that type < 0 is reserved. Just change that to 2-complements signed?

Owner

frsyuki commented Jun 19, 2013

@Midar Oh, thank you! array 32 is 0xDD.
Which line defines the type in ext is unsigned? It should be signed.

Midar commented Jun 19, 2013

@frsyuki Never mind. I confused type with how XXXXXXXX is defined. But XXXXXXXX is the length, so it's ok.
I implemented the new spec now, minus extensions. Maybe we should add something on how to handle unknown extensions? Like add a container class that stores the type and the rest as data? I think this is important so that you can deserialize data and serialize it again without losing extensions.

Owner

frsyuki commented Jun 19, 2013

@Midar My idea is to use a container class to store an integer (type) and a byte array. In Ruby, it will be like:

class MessagePack::Extended
    def initialize(type, data)
      @type = type
      @data = data
    end

    attr_reader :type, :data

    def to_msgpack(packer)
      packer.buffer.write ...do serialize....
    end
end

Midar commented Jun 19, 2013

@frsyuki Yes, that's exactly what I meant. Ok, will go ahead, thanks!

Midar commented Jun 19, 2013

@frsyuki Can we maybe add that to the spec as well?
@cabo Will you bring this to the IETF?

Midar commented Jun 19, 2013

@frsyuki Another bug: fixtext 1 does not have a data byte in the diagram

Owner

frsyuki commented Jun 19, 2013

@Midar ooh, fixed. thank you!

Midar commented Jun 20, 2013

@frsyuki

In a minor release, deserializers support the bin format family and str 8 format. The type of deserialized objects should be same with raw 16 (== str 16) or raw 32 (== str 32)

Shouldn't that also include ext so the parsers don't choke on it? The idea is that libraries update the deserializer to support the new format in a minor version and then the next major version should also include a serializer that outputs the new format, right?

Btw, I implemented ext now. Is there any other implementation I can test it against? I tested it against myself and that worked, of course ;).

Here is a test file containing all ext types (I tested deserializing and serializing it again and compared that both are equal): https://webkeks.org/all_ext.msgpack

Midar commented Jun 26, 2013

@frsyuki Any ETA on when you will make the new spec official?

@AnyBody else: Have you already implemented it and checked my test file?

@Midar These are the results when I unpack your test file.

Midar commented Jun 27, 2013

@ramonmaruko This looks good. It seems I accidentally chose type 6 for the 64K ext. I guess the last two were just cut from your paste?

For which language is that and is it already committed?

@Midar Yes, it looks better when viewed as raw. This is for msgpack-ruby, and it hasn't been merged yet. I'm waiting for @frsyuki to review it.

Midar commented Jul 9, 2013

@frsyuki Any idea when you will release the new spec?

Member

repeatedly commented Jul 11, 2013

In japan, msgpack hackathon will be held.

http://www.zusaar.com/event/881006

After that, D, Ruby, C++, Java, Scala, Erlang and others will support new spec.

Hi. I'm implementing right now own object serialisation protocol on top of MessagePack (old spec, in Java), and did not notice this new spec until now. I flew over the spec and searched in this discussion, but could not find an answer to the most important issue I see to using MessagePack efficiently; possibly I used the wrong search terms. My question is this:

Is there any way, using the new spec, to define unknown size maps, arrays or raw (now also strings)?

To put it simply, when you start turning something into bytes, says using Java's serialisation to serialise a "java.util.Random" instance (hidden internals prevent using MP directly), you do not know how much bytes it needs, but it's wasteful to allocate a temporary byte array just to copy it later in the MP stream and throw it away.

And for maps of unknown size, there is also an obvious example: let's say I want to code my object kind of like in ProtocolBuffers; I create a map of unknown size, and I go over each "property", and and if it is not default, I create a new map entry with the property ID and value. There is no way to know how big the map will be ahead of time, except for going twice over all properties.

I'm not sure how this could be done for raw/string, but the obvious way to implement that for maps/arrays is to use a marker value, and nil would be perfect for that (or maybe that one unused code?)

So, is this possible with the new spec?

Midar commented Jul 17, 2013

@skunkiferous Nope. Also, it would make parsing it much more expensive, as a dynamic buffer has to grow. The overhead on the serializer for copying it would be much lower than the overhead from parsing your proposed format. It's one copy on the serializer, but it's n copies on the parser! Because what realloc() does (and even in Java, at some point it comes down to this) is see if there's enough space in the buffer and if not, allocate a new buffer and copy it. So, from one useless copy, you just went to n useless copies. That's hardly an improvement.

PS: Yes, you can reduce it to log(n) copies by wasting some space. Still, that's more than 1 copy.

Hi, when can we expect to see the proposed standard https://gist.github.com/frsyuki/5432559 implemented in the c/c++ msgpack tools?

This was referenced Aug 17, 2013

Member

kuenishi commented Aug 17, 2013

I am closing this long thread because we updated the spec. If you want more discussion, please open a new pull request to the spec file. Yay and goodbye!! :trollface:

kuenishi closed this Aug 17, 2013

Midar commented Aug 17, 2013

I think closing this too early as the new spec is not on the homepage yet, there's only a link to this discussion and proposal v5.

Member

kuenishi commented Aug 17, 2013

The new spec is now on the homepage and on this repository. The "next" link will soon be updated to "New spec has been come and update yours!"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment