Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussions on the upcoming MessagePack spec that adds the string type to the protocol. #128

Closed
kiyoto opened this issue Feb 24, 2013 · 220 comments
Labels

Comments

@kiyoto
Copy link

kiyoto commented Feb 24, 2013

This issue continues the discussion that started in this issue, which has grown interminably long.

Here is to a fruitful, invigorating, productive thread! Hooray chums!

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@cabo

Thank you for responding and thank you for explaining the possibilities, things that might or might not happen at IETF. It helps existing users a lot to understand what would happen in terms of compatibility once the specification be brought to IETF.

In the tone you write, I can assume that you are trying to be precise as possible instead of trying to cheat (I do not mean to offend you by saying this, I mean that since some do such things, it makes me feel that you are not as such and it makes your words more trustworthy to me). I really appreciate it.

@frsyuki

It seems like that even we succeed in agreeing how to introduce a string type to MessagePack, it seems that there would be contiuing pressure to add more and more types.

The problem of current MessagePack is that it does not define a solid way to extend the protocol without sacrificing backwards compatibility. Unless we resolve the issue now, it is likely that we would have a protocol update again that would break compatilibity (i.e. old decoders refusing to work since it cannot decode the newly introduced type).

So I hereby propose to slightly modify @frsyuki's proposal https://gist.github.com/frsyuki/5022569

I would like to request, instead of adding a "binary" type, add a blob with a type tag (hereafter referred to as "Extended" type) that can be used to illustrate the type of the payload.

What I basically mean is that since once we succeed in defining a string type, we should start using a type-length-value design (http://en.wikipedia.org/wiki/Type-length-value) for other types to be added. The merit of the type-length-value approach is that the applications can preserve and copy unknown types of data. So even if we introduce new types of data in the future, we can guarantee that existing applications will continue to work without modification, handling data of unknown types as a set of type-id and opaque octets.

Changes to the format

The list below shows the mapping of the Extended types (please check the differences from https://gist.github.com/frsyuki/5022569).

0xc4-0xc9,0xd4,0xd5 FixExtended (0bytes - 7bytes extended type)
0xd6 Extended 8 (extended type)  // new
0xd7 Extended 16 (extended type)  // new
0xd8 Extended 32 (extended type)  // new

0xd9 string 8 (String type)  // new
0xda string 16 (String type)  // changed from raw 16
0xdb string 32 (String type)  // changed from raw 32

The definition adds FixExtended types for any type of data below 7 bytes in length. Data longer than 8 bytes would go into either of the Extended 8, 16, 32 types, regarding on their lengths.

Payload of each Extended type would start with a single byte (hereafter referred to as ExtendedType) that designates the type of the data, and the succeeding bytes would be the actual payload.

ExtendedType is defined as below.

0x00      - binary
0x00-0xef - reserved
0xf0-0xff - private extension

For now, the only defined Extended type would be "binary". The uppermost 16 slots are reserved for private extensions so that people can play with new ideas. The rest are being reserved for future possibilities.

For example, if we would ever want to add a time_t type to MessagePack (using ExtendedType=0x31), then the encoded data would look like:

0xc7 0x31 0x51 0x2a 0xd5 0xb0 ; is Feb 25 03:08:32 2013 (0x512ad5b0)

Or if we would ever want to introduce string types using Shift_JIS encoding (which is a legacy but still very-common encoding in Japan with some incompatibilies with Unicode in terms of actual usage) using ExtendedType 0x32, then it would look like

0xd6 0x0a 0x32 0x82 0xb1 0x82 0xf1 0x82 0xc9 0x82 0xbf 0x82 0xcd ; 3 byte header, next 10 bytes say "Hello" in Japanese

As I mentioned before, introduction of these types (if they ever get introduced) would not break implementations that do not know or care about the introduced types; they can just handle the data as a set of unknown type-id and octets.

As a footnote, an FAQ against this proposal might be why I defined the ExtendType to be single byte, instead of defining it as a variable length value. The reason is to keep the proposal simple as possible (so that it can be accepted by as many as possible). If we ever reach near to getting out of reserved types, we could start adding multibyte ExtendedTypes, and even if we did so, existing codecs would not break, since the succeeding bytes of the multibyte ExtendedTypes would be considered as part of the opaque binary octets that will be preserved.

@cabo
Copy link

cabo commented Feb 25, 2013

@kazuho Thanks for the kind words.

I completely agree that some form of type-tagging is needed for long-term extensibility.
Please see the appendix in the draft draft I mentioned yesterday (http://www.tzi.de/~cabo/draft-bormann-apparea-bpack-01pre2.txt) for the approach I'm currently favoring.

Basically, I believe the text string/raw string dichotomy is too basic to leave it to a type-tagging scheme. Tagged types can then be built from what we have, including text strings and raw strings, but also ints etc. I also arrived at single-byte type tags, because I believe these should be introduced very sparingly.

@frsyuki
Copy link
Member

frsyuki commented Feb 25, 2013

@kazuho It seems good idea.

I needed to think about the user-defined types as well. I mentioned at the first comment, though.

@cabo
Copy link

cabo commented Feb 25, 2013

Oh, and one experience we made in the IETF is that the code point range that you call "private extensions" should best be called "experimental values", so it is clear these aren't up for grabs. Otherwise Company X will come and claim 0xf0 for their purposes, do production deployments with that, and then another company will claim 0xf1, and after a short while all of the code points are gone, circumventing the tight control that should be exercised around handing out new code points.

@cabo
Copy link

cabo commented Feb 25, 2013

Another comment: never use up all code points. E.g., a future extension like the Half floating point I suggested would no longer be possible once all code points are used up. Nobody knows what we'll want to do in five years from now, so we always should have some free space for new requirements.

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@cabo

Thank you for you positive comment.

I believe the text string/raw string dichotomy is too basic to leave it to a type-tagging scheme.

Yes I agree with that. My proposal does not use type-tagging for strings. It's a variant of @frsyuki's proposal. Strings will be stored in the "raw" area of the current spec.

Another comment: never use up all code points. E.g., a future extension like the Half floating point I suggested would no longer be possible once all code points are used up. Nobody knows what we'll want to do in five years from now, so we always should have some free space for new requirements.

I disagree. If there is a requirement to store a large number of very short types as scalars (for example half floating point numbers as you mentioned), MessagePack would not be a good format no matter how you would extend it, since you would need 3 bytes (1 byte overhead) for every half precision floating point number. The best way in general to use such kind of short values would be to store them in typed arrays (e.g. HalfFloatArray). And for such purposes my proposal will work very fine.

And for data types that require larger space than half-floats or something alike, my proposal is space-efficient, since by using the remaining bytes it adds the FixedRaw types to miminize the encoded size, while leaving a much greater possibility to add new types than using the very few remaining tags.

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@cabo

Oh, and one experience we made in the IETF is that the code point range that you call "private extensions" should best be called "experimental values", so it is clear these aren't up for grabs. Otherwise Company X will come and claim 0xf0 for their purposes, do production deployments with that, and then another company will claim 0xf1, and after a short while all of the code points are gone, circumventing the tight control that should be exercised around handing out new code points.

Thank you for the comment. I agree that it should be worded as such if my proposal ever gets updated or gets merged to somewhere else.

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@frsyuki

Thank you for your comment. I am very glad to hear that.

@frsyuki
Copy link
Member

frsyuki commented Feb 25, 2013

I created second proposal (incomplete, though):
https://gist.github.com/frsyuki/5028082

I think everyone understand this but again: I don't think my proposals are matured (meaning that I don't think that proposals are ready to be an established standard for now). We likely have different ideas. For now I don't have any intention to give approval for my drafts nor drafts on IETF. We can not assume the articles I already posted are likely the next msgpack for now.

@cabo
Copy link

cabo commented Feb 25, 2013

An example for how I would do tagging:

Without tagging, a UNIX time would use the uint32 (uint64 from 2106) type:

    ce 51 2b 0a 01

If we reserve a tag nn for tagging date/times, we could define the combination of nn and uint32 to mean UNIX time.
This now looks like this:

    c1 nn ce 51 2b 0a 01

Receivers can always decode this as a uint32 -- the tagging adds the semantic information.

I think the combinability of tagging with the existing type system is better than limiting tagging to raw strings only -- it will generally provide for a more meaningful decoding.

@frsyuki
Copy link
Member

frsyuki commented Feb 25, 2013

@cabo I see. It's also an interesting idea.

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@cabo

I agree that tagging is a good idea. And time_t is indeed a good example in which it works very well. But I think the approach you proposed has a drawback in terms of footprint (and that is actually the reason why I did not take the route in my proposal).

My assumption is that most of the types that would ever be added to MessagePack would not be types that could be represented by using a set of tag and a single primitive that exist in the specification (except for binaries).

On the other hand if we used the tagging approach as you mentioned, each tagged type would take extra space of two bytes; an overhead that should better be avoided (since, the small footprint is one of the reasons people choose MessagePack).

For example, consider adding a "Point" class that would store two 16-bit floating point numbers.

With my proposal, it would be encoded as

c8 NN AA AA BB BB ; NN for the type tag, AAAA and BBBB are the values

With @cabo's proposal, it needs be encoded as

c1 NN d5 04 AA AA BB BB ; NN for the type, AAAA and BBB for the values

As you can see, there is 33% overhead in memory footprint in this case. I think this overhead is a bigger problem than the merits than can be achieved by introducing tagging to general types, and thus I decided to take the way I proposed.

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@cabo

My assumption is that most of the types that would ever be added to MessagePack would not be types that could be represented by using a set of tag and a single primitive that exist in the specification (except for binaries).

Sorry I did not show an example that explains my assumption.

As an example, consider adding a date type to MessagePack. Some languages use time_t, but others do not. ECMAScript uses a different format which can only be represented as a IEEE754 double prec. floating point number. I do not think there is a common answer to how the internal representation of a date type should be. And there are also the problem of when the epoch should be when we choose a number to represent date types, not to mention how we should handle timezones.

So if we were to add support for date types, I think the best way would be to use some kind of structure (of multiple values) as the internal representation instead of trying to use a set of type tag and some primitive number.

And if many of the implementations agree on how the internal representation should be, why wouldn't us just use the representation? If we were to tarnsmit time_t values, I think using int64_t (or uint32_t) would be the right answer.

Maybe I have confused you by using time_t in my example (I just thought using a such well-known and primitive type would make others easier to understand the format), but this is how I think. Sorry for that.

@frsyuki
Copy link
Member

frsyuki commented Feb 25, 2013

@cabo @kazuho

I think potential advantage of @cabo's idea is that applications can know the partial meaning of the deserialized object (stored with the tag). But I think applications can do nothing with the objects excepting just holding it in memory and/or write them into another place as-is.

I mean that if applications know the meaning of the objects (and applications want to deal with the objects), they know how to decode the object from a byte array. An only issue is how to implement the decoder. Then MessagePack libraries can provide utility APIs to implement them.

On the other hand, @kazuho's idea has an advantage in terms of the serialized size.

@cabo
Copy link

cabo commented Feb 25, 2013

One of the design principles that have made messagepack so successful
is the seperation of structure and semantics. You can always decode a
msgpack instance without referring to a Schema or IDL file.

I think there is danger in the tagging discussion that we are leaving
that path. Having the deserializer rely on information in the
(extensible) tags for deriving the internal structure of the tagged
information comes dangerously close to that. At the end of this path,
there are ASN.1 PER and XML EXI, and the beauty of msgpack is that it
provides exactly the other end of the scale.

I used date/time as an example because so far I have heard only two
proposals for data types that should be added: date/time and UUID. I
think UUIDs are best represented as a 128-bit binary object (in
msgpack, this would be a raw string), so there is little need for
adding a type. Date/time, however, may benefit from being explicitly
tagged.

Please contrast this to the other discussion around adding an IEEE 754
Half: This would be for an existing type with well-known semantics
(number), but a more compact representation. If (big if) we add that,
the only reasonable place is right beside 0xca and 0xcb. This is
exactly not what I had in mind when writing up the tagging proposal.

Re footprint: Not having a tag in a binary string saves a byte.
I believe that the untagged use will be the most common one.
(But optimization comes after getting the structure right.)

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@cabo

Re footprint: Not having a tag in a binary string saves a byte.
I believe that the untagged use will be the most common one.
(But optimization comes after getting the structure right.)

I agree with you in the assumption that untagged binaries would be the ones that would be mostly used. But that does not mean adding an extra byte for such usage is inefficient in terms of footprint.

In general, I assume that an untagged binary data would be fairly large in terms of size. AFAIK the request for such type is from people wanting to store images. For storing such large objects, a difference of one byte is not an issue.

On the other hand, tagged types would be much smaller in size. The examples we have discussed so far (date/time, half-float, Point) are all small in terms of size. Those small, tagged binary objects are the objects that we should try to encode as short as possible, since they are what would bloat the size of serialized data in terms of ratio (which is the metric we should look at when we talk about footprint).

And to repeat, my proposal is good in the fact that it ensures backwards compatibility (i.e. adding new types would not cause existing decoders to fail).

@cabo
Copy link

cabo commented Feb 25, 2013

Most uses I have for binary strings are things like cryptographic hashes, MACs, IP and MAC addresses (yes, the other meaning of MAC) etc. These are quite small (but not as small as my average strings); a byte would still make a difference.

I don't think there is a way to add representation alternatives like Half in a forward compatible way. So that would best be done now or never. For future backward and forward compatibility, I think always having tags for binary only and having tags for a wider set of data types is about equivalent.

So the remaining difference in footprint is
-- always spending a byte for binary, vs.
-- spending two bytes for a tag, only where a tag is desired.

I sure can live with both ways, but would prefer the smaller footprint of untagged binary and the ability to tag not only binary strings but also numbers and text strings.

@Midar
Copy link

Midar commented Feb 25, 2013

I have to say that I like @frsyuki's first proposal / BinaryPack1pre2 best. @kazuho's proposal / @frsyuki's second proposal would hurt the most common case, as it has no FixString. @cabo's proposal of adding tags seems like yet another case where layers would not be separated well: Whether something is a time_t or an uint32_t/uint64_t is not really important, as both are the same and decoded the same way. This is not about storing it, it's about how to use it - which is something that belongs in a schema IMHO. And that schema should be extern and not embedded, as that only wastes space.

To bring in a completely different side to the discussions: I disagree about having an "extension type" completely. If a new type gets added, old parsers won't parse it. But that's ok! Why not just have versions? You could say "Generate for version 1.0" if you want to be backwards compatible and you could say "Generate for 1.1" once enough parsers have been updated. This is how other format works as well. Saying that we always need to be compatible to parsers that implement an old version of the protocol means that we will be seriously limited. It means we could never add half-float. It means we could not add (u)int128, etc. We could only add new extensions which are encoded in an inefficient way. That would mean that only the types from the first protocol version are first-class citizens and all other types are wasting space. We would end up with an encoding that is not better than BSON when it comes to space efficiency or clean design.

Therefore my pledge is to have versioned protocols and break compatibility on purpose: Old parsers don't need to be able to read data from new protocols, but new parsers need to be able to read data from old protocols.

If the MessagePack people are going for the extensions, I hope at least @cabo will reconsider the tags so that we have at least one format that does the right thing.

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@cabo

Thank you for providing real use-cases. Let's use SHA-1 (or HMAC-SHA1; 20 bytes), IPv4 and v6 addresses (4 byte / 16 byte), MAC addresses (6 bytes) as an exmaple to evaluate the efficiency of the approaches.

a) use the remaining type tags

  • SHA-1 - 21 bytes
  • IPv4 - 5 bytes
  • IPv6 - 17 bytes
  • MAC address - 7 bytes

Introducing these types would cause backwards incompatibility. We would sooner or later use all slots, and it would become impossible to add extensions.

b) using my proposal (tag on binary only)

  • SHA-1 - 23 bytes
  • IPv4 - 6 bytes
  • IPv6 - 19 bytes
  • MAC addresses - 8 bytes

Introducing these types would not cause backwards incompatibility. There would be no limit for adding new types.

c) using @cabo's proposal (make all types annotatable)

  • SHA-1 - 24 bytes
  • IPv4 - 7 bytes
  • IPv6 - 20 bytes
  • MAC addresses - 10 bytes

Introducing these types would not cause backwards incompatibility. There would be no limit for adding new types.

If we compare the approaches by comparing the sum of the bytes required, it would be,

a) 50 bytes
b) 56 bytes
c) 61 bytes

As you can see, if we take approach b, with 10%+ overhead we can have infinite extension slots, while guaranteeing backwards compatibility compared to a. I think these examples do illustrate that it is the way we should take.

I don't think there is a way to add representation alternatives like Half in a forward compatible way.

As I said before, I think using MessagePack as a format for storing half-floats is a bad idea. Even if you use the few remaining tags there would be still 50% overhead. If such usage does matter, I think using something like 4/5 encodings (2bits at minimum for a type tag) would be a better approach. Besides, IMO the general use case of using half-floats is to store many of them at once, and for such case we can introduce things like HalfFloatArray if we take either of the approaches a or b, and that would save space.

So the remaining difference in footprint is
-- always spending a byte for binary, vs.
-- spending two bytes for a tag, only where a tag is desired.

No. it is as follows, and the numbers above show the sizes under the use-cases you are interested in.

-- spend extra byte for binary below 8 bytes, save one byte for tagged types below 8 bytes, vs.
-- spend two extra bytes for tagged types

@catwell
Copy link

catwell commented Feb 25, 2013

@kazuho

0xc4-0xc9,0xd4,0xd5

Couldn't we avoid things like this (non-continuous ranges for fix types)? It will make the definition of the format confusing IMO.

Something else: while we're discussing changes to MessagePack we could add typed collections to the discussion.

For instance, starting with @kazuho 's point type proposal:

c8 NN AA AA BB BB

This has the following structure:

[tagged type header + length] [tag] [data]

If you want to store for instance a polygon as an array of points you will have to write:

[array header + length] [tagged type header + length] [tag] [data1] [tagged type header + length] [tag] [data2] ...

I think this use case (similarly typed collections) is frequent and the current MessagePack encoding for it is rather wasteful. It would be interesting if we could write something like this instead:

[typed array header + length] [tagged type header + length] [tag] [data1] [data2] ...

In that case for large collections it results in a 33% space gain.

Do you think this is something that could be added to MessagePack?

@cabo
Copy link

cabo commented Feb 25, 2013

@kazuho I wouldn't normally tag the binaries, so the numbers would be:

  • SHA-1 - 22 bytes
  • IPv4 - 6 bytes
  • IPv6 - 18 bytes
  • MAC addresses - 8 bytes

@catwell Indeed, homogeneous arrays provide an opportunity for optimization.
Is that optimization needed?

All we should be trying to do here is design the structure right, and then make sure we don't waste bytes unnecessarily. But outright designing for optimization leads towards EXI and PER (or HDF5, or ...), not towards a better msgpack.

@frsyuki
Copy link
Member

frsyuki commented Feb 25, 2013

I disagree about having an "extension type" as a whole

@Midar says. And I complement another example why allowing future extension has disadvantage and could cause hesitation to adopt msgpack.

One application uses the Integer type of msgpack to represent times. A receiver reads a time object as an Integer (or Raw) and it is working. This is an assumption. Then if msgpack added time type (regardless of its format. It could be by extension tag or could be new header byte assignment), the sender will send the same object using that newly added time type.

A) If a old deserializer maps the newly added time type into Integer (or Raw) (here I assumed the old deserializer can still read the newly added value thanks for a predefined trick in the format such as extension tags), the receiver still works. Because the object the receiver receives is the same as the expected one. This is ok.

B) But if the old deserializer restores it into byte array or a tuple of type tag and binary (or integer), it doesn't work.

Adding data type could break working applications horribly. The reason why adding string (or binary) type is still seems ok is that we don't have to consider B case on strings (or binaries).

@Midar
Copy link

Midar commented Feb 25, 2013

@frsyuki Agreed.

Another point is that the only type extensions that would make sense - because they are really a new type and not schema embedded into data - are the ones which need extra support from the parser anyway and could never be parsed by an old parser and would not return something meaningful if treated as binary. An example would be (u)int128_t/(u)int256_t numbers (which are used by SSE/AVX) or halffloats etc. All these need special parsing and no extension type whatsoever would help an old parser. Because if an old parser would support it, it would sometimes be a number (because it was small enough to fit into one of the existing types) and sometimes be binary. What good would that be?!

@frsyuki
Copy link
Member

frsyuki commented Feb 25, 2013

However, we don't have the problem I mentioned if applications we don't use added types implicitly (because this part doesn't happen: "the sender will send the same object using that newly added time type").

I mean that having extension tags (@kazuho's idea) and not-adding types are compatible if the extension tags are used only when applications clearly specify to use the type (meaning that new serializers don't use Extended type automatically).

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@catwell

0xc4-0xc9,0xd4,0xd5

Couldn't we avoid things like this (non-continuous ranges for fix types)? It will make the definition of the format confusing IMO.

I totally agree that it is agree. But there are no contiguous slots left any more.

Do you think this is something that could be added to MessagePack?

Yes!!! That's the entire reason I am proposing the extension to introduce tags. The reason why it is taking so long to add string type is because adding types corrupt the existing applications.

If we introduce the ability to add extended types, it would be much easier since adding such types would not cause other applications (middlewares) to collapse. For example, a middleware that transfers MessagePack objects by looking at the "to" field would continue to work if you add new types.

So, there would be much less rallying against dding new types; the parties interested in having such types (for example HalfFloatArray) can just register the type id for the types they need, and share the implementation instead of reinventing the types they all use on top of the binary type.

It is like how TCP/IP, XML, or ASN.1 works. The IP protocol has a "protocol number", and let's others invent new protocols without destroying the entire IP protocol. An recent example in this area is SCTP, which is trying to become a better alternative to TCP.

By defining such an extension point, it would help people share more code for encoding / decoding data, which actually is what MessagePack is all about.

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@frsyuki

However, we don't have the problem I mentioned if applications we don't use added types implicitly (because this part doesn't happen: "the sender will send the same object using that newly added time type").

My idea behind the proposal is that the libraries should never store data using the extended type unless specified explicitly by the programmer.

As I explained using the example of TCP/IP, it is a "separation of layer" problem. Extended types should always be used explicitly at the MessagePack codec level. Though, of cause, people are allowed to use a wrapper (or combine the wrappers) that handle the conversions to encode / decode the extended types, or some MessagePack implementations may allow developers to explicitly plug-in the use of such extensions.

@frsyuki
Copy link
Member

frsyuki commented Feb 25, 2013

@catwell @kazuho

0xc4-0xc9,0xd4,0xd5
Couldn't we avoid things like this (non-continuous ranges for fix types)? It will make the definition of the format confusing IMO.
I totally agree that it is agree. But there are no contiguous slots left any more.

I agree....
Another possible idea is to assign 0xd4 to "8bytes Extended" and 0xd5 to "16 bytes Extended":

    0xc4-0xc9 FixExtended (0bytes - 5bytes Extended type)  // new
    0xd4 8bytes extended (8bytes Extended type)  // new
    0xd5 16bytes extended (16bytes Extended type)  // new
    0xd6 extended 8 (Extended type)  // new
    0xd7 extended 16 (Extended type)  // new
    0xd8 extended 32 (Extended type)  // new

Assumption here is that 8bytes binary and 16bytes binary are be often used rather than 6 or 7 bytes (this could be wrong).
This format is still confusing, though....and slightly complex to implement serializers.

@kazuho
Copy link
Contributor

kazuho commented Feb 25, 2013

@cabo

@kazuho I wouldn't normally tag the binaries, so the numbers would be:

  • SHA-1 - 23 bytes
  • IPv4 - 6 bytes
  • IPv6 - 19 bytes
  • MAC addresses - 8 bytes

Sorry, I misunderstood that you would not tag the binaries. Would you mind explaining which spec. the numbers are calculated using?

If it is https://gist.github.com/frsyuki/5022569, then I think the numbers would be:

  • SHA-1 - 22 bytes
  • IPv4 - 6 bytes
  • IPv6 - 18 bytes
  • MAC addresses - 8 bytes

and that would be 54 bytes in total. My proposal was 56 bytes, so the additional overhead is 3.7%... I think it is something that we could afford instead of opening the possibility to extend types without sacrificing interoperability.

@frsyuki
Copy link
Member

frsyuki commented Jun 19, 2013

@Midar My idea is to use a container class to store an integer (type) and a byte array. In Ruby, it will be like:

class MessagePack::Extended
    def initialize(type, data)
      @type = type
      @data = data
    end

    attr_reader :type, :data

    def to_msgpack(packer)
      packer.buffer.write ...do serialize....
    end
end

@Midar
Copy link

Midar commented Jun 19, 2013

@frsyuki Yes, that's exactly what I meant. Ok, will go ahead, thanks!

@Midar
Copy link

Midar commented Jun 19, 2013

@frsyuki Can we maybe add that to the spec as well?
@cabo Will you bring this to the IETF?

@Midar
Copy link

Midar commented Jun 19, 2013

@frsyuki Another bug: fixtext 1 does not have a data byte in the diagram

@frsyuki
Copy link
Member

frsyuki commented Jun 19, 2013

@Midar ooh, fixed. thank you!

@Midar
Copy link

Midar commented Jun 20, 2013

@frsyuki

In a minor release, deserializers support the bin format family and str 8 format. The type of deserialized objects should be same with raw 16 (== str 16) or raw 32 (== str 32)

Shouldn't that also include ext so the parsers don't choke on it? The idea is that libraries update the deserializer to support the new format in a minor version and then the next major version should also include a serializer that outputs the new format, right?

Btw, I implemented ext now. Is there any other implementation I can test it against? I tested it against myself and that worked, of course ;).

Here is a test file containing all ext types (I tested deserializing and serializing it again and compared that both are equal): https://webkeks.org/all_ext.msgpack

@Midar
Copy link

Midar commented Jun 26, 2013

@frsyuki Any ETA on when you will make the new spec official?

@AnyBody else: Have you already implemented it and checked my test file?

@ramonmaruko
Copy link

@Midar These are the results when I unpack your test file.

@Midar
Copy link

Midar commented Jun 27, 2013

@ramonmaruko This looks good. It seems I accidentally chose type 6 for the 64K ext. I guess the last two were just cut from your paste?

For which language is that and is it already committed?

@ramonmaruko
Copy link

@Midar Yes, it looks better when viewed as raw. This is for msgpack-ruby, and it hasn't been merged yet. I'm waiting for @frsyuki to review it.

@Midar
Copy link

Midar commented Jul 9, 2013

@frsyuki Any idea when you will release the new spec?

@repeatedly
Copy link
Member

In japan, msgpack hackathon will be held.

http://www.zusaar.com/event/881006

After that, D, Ruby, C++, Java, Scala, Erlang and others will support new spec.

@skunkiferous
Copy link

Hi. I'm implementing right now own object serialisation protocol on top of MessagePack (old spec, in Java), and did not notice this new spec until now. I flew over the spec and searched in this discussion, but could not find an answer to the most important issue I see to using MessagePack efficiently; possibly I used the wrong search terms. My question is this:

Is there any way, using the new spec, to define unknown size maps, arrays or raw (now also strings)?

To put it simply, when you start turning something into bytes, says using Java's serialisation to serialise a "java.util.Random" instance (hidden internals prevent using MP directly), you do not know how much bytes it needs, but it's wasteful to allocate a temporary byte array just to copy it later in the MP stream and throw it away.

And for maps of unknown size, there is also an obvious example: let's say I want to code my object kind of like in ProtocolBuffers; I create a map of unknown size, and I go over each "property", and and if it is not default, I create a new map entry with the property ID and value. There is no way to know how big the map will be ahead of time, except for going twice over all properties.

I'm not sure how this could be done for raw/string, but the obvious way to implement that for maps/arrays is to use a marker value, and nil would be perfect for that (or maybe that one unused code?)

So, is this possible with the new spec?

@Midar
Copy link

Midar commented Jul 17, 2013

@skunkiferous Nope. Also, it would make parsing it much more expensive, as a dynamic buffer has to grow. The overhead on the serializer for copying it would be much lower than the overhead from parsing your proposed format. It's one copy on the serializer, but it's n copies on the parser! Because what realloc() does (and even in Java, at some point it comes down to this) is see if there's enough space in the buffer and if not, allocate a new buffer and copy it. So, from one useless copy, you just went to n useless copies. That's hardly an improvement.

PS: Yes, you can reduce it to log(n) copies by wasting some space. Still, that's more than 1 copy.

@YordanPavlov
Copy link

Hi, when can we expect to see the proposed standard https://gist.github.com/frsyuki/5432559 implemented in the c/c++ msgpack tools?

This was referenced Aug 17, 2013
@kuenishi
Copy link
Member

I am closing this long thread because we updated the spec. If you want more discussion, please open a new pull request to the spec file. Yay and goodbye!! :trollface:

@Midar
Copy link

Midar commented Aug 17, 2013

I think closing this too early as the new spec is not on the homepage yet, there's only a link to this discussion and proposal v5.

@kuenishi
Copy link
Member

The new spec is now on the homepage and on this repository. The "next" link will soon be updated to "New spec has been come and update yours!"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests