Msgpack can't differentiate between raw binary data and text strings #121

Closed
rasky opened this Issue Nov 12, 2012 · 309 comments

Comments

Projects
None yet

rasky commented Nov 12, 2012

It looks like the msgpack spec does not differentiate between a raw binary data buffer and text strings. This causes some problems in all high-level language wrappers, because most high-level languages have different data types for text strings and binary buffers.

For instance, the objective C wrapper is currently broken because it tries to decode all raw bytes into high-level strings (through UTF-8 decoding) because using a text string (NSString) is the only way to populate a NSDictionary (map). But it breaks because obviously some binary buffers cannot be decoded as UTF8-strings.

The same happen with Python2/3: when you serialize and deserialize a unicode string, you always get a binary string back, and this breaks simple code:

>>> a = { u"東京": True }
>>> mp = msgpack.dumps(a)
>>> b = msgpack.loads(mp)
>>> a == b
False
>>> b[u"東京"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: u'\u6771\u4eac'
>>> b
{'\xe6\x9d\xb1\xe4\xba\xac': True}

As you can see, when you deserialize, you get a different object which does not work (because internal text strings are not decoded from UTF-8).

Most wrappers have an option to specify automatic UTF-8 decoding for all raw bytes, but that is wrong because it will apply to ALL raw bytes, while you might have a mixture of text strings and binary bytes within the same messagepack. It's not at all uncommon.

As I said, this problem can be found in almost all high-level messagepack bindings, because most high-level languages have different data types for text strings and binary buffers.

I think the only final solution for this problem is to enhance the msgpack spec to explicitly differentiate between text strings and binary buffers. Is this something that msgpack authors are willing to discuss?

I am willing to implement whatever solution you decide it's the best one and submit a pull request.

Thanks!

Midar commented Dec 4, 2012

This is a serious problem and what's preventing me from implementing MessagePack in my ObjC framework. I have to know whether I should create a string object or a data object. Creating a string object for everything will fail if it is not UTF-8 and always creating a data object will be very impractical.

MessagePack is announced to be compatible with JSON and only providing what JSON provides - does that mean raw data actually means "UTF-8 string" in the author's view of things?

chakrit commented Jan 14, 2013

First-class string support is proposed 2 years ago and the issue still hasn't been closed.

If msgpack really goes by the motto "It's like JSON", I think it needs to solve this and other related issues ASAP.

For a time being tho, I think going with UTF8 and using some key convention to differentiate between binary blobs and strings might help.

e.g. append a _data to every key that should be treated as binary and otherwise decode as a UTF8 string by default.


EDIT: Found a comment related to this issue on StackOverflow: http://stackoverflow.com/questions/6355497/performant-entity-serialization-bson-vs-messagepack-vs-json#comment15798093_6357042

Generally, the raw bytes are assumed to be a string (usually utf-8), unless otherwise expected and agreed to on both sides of the channel. msgpack is used as a stream/serialization format... and less verbose that json.. though also less human readable.

so i take this to means if we need raw bytes on the wire, we should implement our own addition to the protocol.

Midar commented Jan 14, 2013

Appending _data or some convention like that means it's not possible to write a generic MsgPack implementation that can be used by application. I need to know whether it's a string or binary data, as that means I need to handle it differently. And I need to know that before I pass the data to the application, because the application will get the wrong object otherwise.

If this bug is well known for over 2 years and there is no intention to fix it, then I guess we should just move on and forget MsgPack.

Actually @Midar JSON is not binary-safe and all strings are UTF-16 there (with UTF-8 being a valid representation thereof).

No idea on msgpack though, only stumbled here because of a discussion about salt…

Midar commented Feb 5, 2013

@mirabilos Nobody is talking about JSON being binary-safe here. The problem is that while strings in JSON are UTF-8 (and UTF-16 internally), there is no specification on that in MsgPack whatsoever. It is simply impossible to know whether something is a string in UTF-8, a string in UTF-16, a string in ISO-8859-1, a string in KOI8-R or just some binary data. And that is the problem. This is completely different to binary-safety and has absolutely nothing to do with JSON.

Agreed, it is a problem which lends itself to ad-hoc workarounds. I've been using the Objective-C msgpack implementation to transfer mixed data between iOS devices and a server.
When "raw" data is detected, it tries to parse it as a UTF8 string first. The only solution I could think of was to patch msgpack-objectivec such that if the UTF8 parse produces a null result, then it simply returns that item as an array of bytes.
However, this heuristic will fail if the UTF8 parse just happens to produce a valid UTF8 string, or perhaps worse if parsing some binary data could cause "unspecified" behaviour.

chakrit commented Feb 20, 2013

Time for a new fork?

Owner

frsyuki commented Feb 20, 2013

OK. Sorry for being late.
As the initial designer of the MessagePack format, I think msgpack should not have string type.
I need to write an longer article but let me describe some points so far:

  • data format should be isolated from programs
    • it depends on applications whether the sequence of bytes is assumed as a byte array or a string.
    • lifecycle of data format is usually longer than programs:
      • example1: stored data should be consistent rather than programs
      • example2: network protocols should be compatible with old programs
    • thus data should not have string type information bit and applications should map sequence of bytes into string type only when it's necessary
  • successfully stored data must be read successfully
    • packer should validate strings to guarantee before it stores if it stores as a string
    • implementing validation code is relatively hard and make it difficult to port msgpack to other languages/architecture
    • data may not be trusted. thus unpacker also should support string validation at least optionally
    • supporting multiple encodings make it even harder
    • thus msgpack library should not consider encoding validation including string type bit
  • it doesn't be a problem in statically typed languages
    • because these languages need to specify the data type before handling the deserialized (=dynamically typed) data either way
    • see C++, Java and D implementations which type conversion mechanism
    • users think Java implementation's Value class (by @muga) is useful and it doesn't have byte array/string problem completely
  • even with dynamically languages, some committers don't think it's causing problems
    • Python (@methane), Ruby (@frsyuki = me), Erlang (@kuenishi)
    • Python implementation supports an option to return byte sequences as a string (byte array by default)
  • I think only JavaScript and Objective-C have problems
  • JavaScript doesn't have byte array type historically. it needs special handling either way
  • I suggest Objective-C/JavaScript implementations to have following solution:
    • unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString
    • the object contains a validated UTF-8 string
    • if the validation failed, it's nil or something we can tell that the validation failed
    • NSStringOrData#data returns the original byte array
  • supporting user-defined custom types is better than string type
    • 0xc1 is considered to be reserved for string type
Owner

frsyuki commented Feb 20, 2013

It took time for me to build my opinion.
My conclusion is that "it's better to support user-defined custom type rather than adding string type"

Member

muga commented Feb 20, 2013

Hi,

I'm developer of msgpack-java. the above is well-known (and complicated) problem.

@frsyuki

+1

In my opinion,

  1. the serialization core library should not implement character encoding.
  2. serialization format should not include charset information.
  3. having utility library on top of the library is good idea

If msgpack has the string type, the format and library implementations must be complicated. It means keeping the compatibility of the format and libraries becomes difficult. Actually it is really difficult to consider a serialization format for any charset. If it has bugs, we must fix not only the format but the libraries. It is critical thing..

Business logic on application-side should handle character encoding. But having extension hook points in a msgpack library is good idea so that you can extend encoding handling using some other libraries.

Member

methane commented Feb 20, 2013

-0.5 to adding string type.

For example, JSON has Integer and Number types. Application should handle Number when expecting Integer.
If msgpack has string type, application should handle raw when expecting string and handle string when expecting raw.
So, I feel inter language serialization format should have minimum types.

chakrit commented Feb 20, 2013

I disagree completely.

UTF-8 and UTF-16 is a very well-known standards that has been around for a very long time. All new implementations these day should support Unicode String encoding from day one. There shouldn't even be a question of which character encoding to use especially when msgpack wants to be just "like JSON".

There are well-known UTF string encoding routines available on nearly every platforms. It's not like every implementation has to roll another character encoding routine from zero, they could just use whatever's available on their platform of choice. And there are character encoder/decoder available on most, if not all of the platforms these days. In my opinion, implementing an encoder/decoder is non-problem: don't re-invent any wheel.

Think of this as referencing another standard in your piece of work instead of having to specify all character encoding mechanism yourself.


String is a very fundamental data type required by most (if not all) applications these days. and let me repeat this: "It's like JSON." is printed in an H2 on the very top of the msgpack website and your specification does not include the simplest thing that is a String, why?

Also, the problem exists regardless of wether or not msgpack has a string specification or not. In my opinion, it is even worse to not specify the exact character encoding in your wire protocol.

Suppose you have two applications which both use msgpack, yet they wouldn't be able to communicate because the msgpack protocol itself does not specify how a string should've been encoded thus allowing space for incompatibility whereas if the msgpack specs would just say "here, use this if you need a string, and don't forget to encode it in proper UTF-8", this problem wouldn't have existed from the start.


Let me suggest this:

  1. You should simply add a String data type. It is so fundamental that it should not be left out. Especially when you are advertising it as a faster/smaller JSON. I suggest you start with UTF-8 and/or UTF-16 as the encoding. (and personally, I don't think there is any need to support more encodings than these two.). If anyone needs absolute speed, they can still use the old raw-bytes type with their own encoding and their own acceptance of any possible incompatibilities that might arises.
  2. If you insist on not having a String data type, then there should be better documentation and "recommended practice" with regards to handling String and the encoding to use because, as I've repeated, String is very fundamental data type that should've been specified in the spec and there are many platforms where there exists both String and a normal Buffer (or byte[] array) data type in active use such as JS/node.js and ObjC/iOS. Leaving this out just causes confusion between parties trying to implement the same protocol.

TL;DR --- I think this is simply a matter of documenting the "best practice" or what's expected of the implementation properly rather than just throwing out a spec defining only binary blobs and denying all string support in fear of character encoding issue but with zero pointers on how to exactly to implement one should you need it (and you will definitely needs it, what application does not use a string?)

Member

mzp commented Feb 20, 2013

Hi, I'm developer of msgpack-ocaml. I disagree with adding string type.

One of benefits of msgpack is multi-platform. So, we should be careful for adding new type.

But, string type is not so much attractive. Although string type is fundamental type in many laungage, UTF8-encoded string type is so much. For example, OCaml doesn't suppose any encoding on string.

I don't have strong opinion about "recommended practice". But I think that it is each application's task, not msgpack's.

Owner

frsyuki commented Feb 20, 2013

@chakrit I don't think supporting UTF-8 encoding/decoding/validating is easy even if there're some well-known libraries. Remember msgpack focuses on cross-language. For example, I don't think smalltalk supports FFI by default. In JavaScript for browsers, @uupaa implemented IEEE754 and this complex code will be needed again to support UTF-8 (or UTF-16):
https://github.com/msgpack/msgpack-javascript/blob/master/msgpack.js#L135

if the msgpack specs would just say "here, use this if you need a string, and don't forget to encode it in proper UTF-8"

I agree it's good idea. I added a comment on the spec: http://wiki.msgpack.org/display/MSGPACK/Format+specification
At least Java and Ruby implementations (written by me) already use UTF-8 to serialize strings.

Regarding 1., JSON doesn't support binary type. I don't think so but do you mean msgpack should not support Raw type to say it's like JSON? Problem is that some users want to handle strings and binaries at the same time and they want to tell the difference transparently. If we want to use msgpack as the replacement of JSON, users can assume all Raw objects are string. Some msgpack libraries such as Python impl. support string-only mode (this is nice feature, I think). I want to add the feature to the msgpack-ruby v0.5.x as well.

Regarding 2., to be exact, it's a problem of JS/node.js and ObjC/iOS implementations. I mean that String is not a fundamental type in some languages such as C, C++, Ruby (at least 1.8), Erlang, and Lua (actually significant languages...right?). In Python and Ruby 1.9, the difference of strings and binaries is unclear in terms of both implementations and cultural aspects. MessagePack format itself doesn't consider the mappings between msgpack's types and language types. Implementations take the role to project msgpack's types into language specific types (this is an essential concept of msgpack). Thus as I mentioned above, JS and ObjC implementations should document about that specifically.

....But anyway, I agree that it's better msgpack documents should mention the "best practice to handle strings at certain dynamically-typed languages such as Objective-C or JavaScript."

So, TL;DR...msgpack project lacks some important documents such as: why msgpack doesn't have string type, guidelines for implementations how to handle strings, the best practice to handle strings. // TODO FIXME

Midar commented Feb 20, 2013

I strongly disagree with the position not to add the most basic type: a string.

Let's assume MsgPack is Layer 1 and our protocol is Layer 2, encoded in MsgPack. So, when I want to decode MsgPack to objects (which is Layer 1, remember?), I also need to have knowledge about Layer 2 (because otherwise I can't know what it is)? Sorry, but this is completely retarded. This is like "In order to parse TCP, you need to parse the protocol that's wrapped inside TCP. So, if you want to parse TCP, you need to parse every protocol in existence like HTTP, XMPP, SMTP, IMAP, etc.".

Saying that UTF-8 is too complicated is basically admitting defeat. If you can't implement those 20 lines of C code required for de- and encoding UTF-8, you probably shouldn't write any code at all. Especially as almost all languages have already implemented UTF-8 and you can just use it.

The strangest thing is the reason: You're saying you don't want to have a string type out of fear of being not interoperable. Well, actually, you kill interoperability by not having a string type, as therefore it's not possible to parse Layer 1 in many languages as you don't know which encoding is used or if it even is a string. There is no way to have a look at the data without some kind of schema and thus looking at Layer 2, which you really shouldn't. This violates basic rules of software design!

The advantage of MsgPack to Protocol Buffers could have been that it does not need a schema. But with this decision, MsgPack has no advantage over Protocol Buffers. It's not portable and it needs a schema, both two things you don't want from a general purpose serialization format.

Saying that UTF-8 is a problem for interoperability is really the the biggest nonsense I've heard so far. Almost all modern network protocols require UTF-8. XML requires UTF-8 and works on many more platforms and languages than MsgPack ever will. Requiring UTF-8 eliminates the pain of having to support multiple encodings. There's a reason the world moved to UTF-8…

Member

repeatedly commented Feb 20, 2013

Hello, I'm an author of msgpack-d.

I have never wanted string type in my msgpack experience.
In D, string <-> byte conversion has no problem because the application has already normalized the invalid string before serialization.
In addition, many serialization types are bad in my RPC experience. It causes the lack of interoperability.

Probably, this issue is IDL or application layer problem.

P.S.
If introducing the string type, then supporting user-defined custom types is good for me.
Because this approach resolves that someone says "I want this type in msgpack!"

rasky commented Feb 20, 2013

@frsyuki @methane I am the original issue opener. I have posted a clear Python example that show that msgpack is completely broken in Python as a very simple data structure doesn't load back. So I can't see how you can think that it is not broken in Python at the very least.

I know there is an option to return byte array by default, and that's totally useless, because it applies to all of them.

Also when you say "In Python and Ruby 1.9, the difference of strings and binaries is unclear in terms of both implementations and cultural aspects", I totally don't know what you are referring about. The difference between strings and binaries is very clear in Python (and Ruby, and Java, and Objective C and MANY of the modern languages), there is tons of documentation on it, tons of material, tons of talk. I am surprised that you can think that it is unclear.

I think @Midar nailed it. The problem is that, without a string type, MsgPack always needs a schema/IDL to be useful, because it cannot convert back to native data structures without a schema telling it how to. Vice versa, if you add a string type, it becomes possible (most of the time) to avoid a schema.

Owner

frsyuki commented Feb 20, 2013

I needed to mention another problem about UTF-8 (and unicode).

UTF-8 validation includes NFD/NFC problem. For example, "\u00e9" (NFC) and "\u0065\u0301" (NFD) represent exactly same character (you may know that Mac OS X uses NFD to represent file names and it sometimes causes troubles with Linux which usually uses NFC). If msgpack had string type, should implementations normalize characters to NFC, or NFD?

UTF-8 has verbosity as well. 0x2F could be 0xC0 0xAF. Should deserializers reject these bytes? Or normalize into another character?

Member

methane commented Feb 20, 2013

@rasky I agree with you about adding string type helps pythonistas.
But msgpack is a inter language communication format.
We should communicate with weak typed languages like php or JavaScript.

If you want to serialize Python data type perfectly, you can use pickle instead.
It can serialize and back datetime, tuple and many other types correctly.

rasky commented Feb 20, 2013

@methane I'm using msgpack specifically because it's an inter-language communication format. I communicate between Python and Objective C, and the Objective C msgpack library is totally broken because the string type is missing; in fact, the Objective C object/dictionary standard construct must have strings as keys, and thus the msgpack Objective C library tries to convert everything into string, thus breaking the transmission of binary data. If msgpack had a different string data type, the Objective C library could now what to do.

@frsyuki FIrst, I assume that all languages that implements native unicode strings will have libraries to handle this either way. My take on this is that msgpack shouldn't do anything. You convert from unicode into utf-8 using the standard behavior of the language, and convert back again with the standard behavior. The problems you cite arise only if someone is trying to use UTF-8 as-is, so it will arises in languages where Unicode is not implemented. I think that, if an implementer is going to communicate between a unicode-rich language and an unicode-poor language, it is up to the implementer himself to take care of these small details.

Midar commented Feb 20, 2013

@frsyuki None. That is not part of the serialization. Comparing strings is a completely different domain. You could convert it from UTF-8 to your preferred charset and compare it in that and lose internationalization - that's up to you. Or you could put Unicode in your raw binary and still have those problems. Completely up to you. You don't lose anything by having a type for UTF-8 strings. That's just the transfer encoding, you can recode it to whatever you want.

@methane Do you even hear what you're saying?

But msgpack is a inter language communication format.
We should communicate with weak typed languages like php or JavaScript.

So, in the first sentence, you say it should be inter-language. And in the second you say it should only be for weakly typed languages? If you say you want it inter-lanuage, you should recognize that the only way to have that is to add support for a string type.

@rasky Actually, no, everything can be a key in a dictionary as long as it implements -[copy], -[hash], and -[isEqual:]. But who wants to use binary keys in some code? That would always be "Get the bytes from an NSString and create NSData and then pass that to objectForKey:". :)

Owner

frsyuki commented Feb 20, 2013

@Midar I couldn't catch what Layer 2 means...do you have some examples? I guess Layer 2 has 2 options:

1. Layer 2 aslo doesn't tell strings and byte arrays.
2. Layer 2 implements its own type system on top of msgpack's type system.

Have you implemented UTF-8 validator (which will be required by serializers)? I don't think it fits into 20 lines of C code...

Midar commented Feb 20, 2013

@frsyuki Layer 2 is what you put inside MsgPack. A protocol that says "at this place I expect an array, a string, some bytes". Without that knowledge from a protocol that is completely apart from MsgPack, you can't parse MsgPack, and that's really broken.

Yes, I have implemented UTF-8 checking, encoding and decoding. It's easily possible in 20 lines each (decoding and encoding). Here's both with a lot of wasting space that could easily be reduced:
https://webkeks.org/git?p=objfw.git;a=blob;f=src/OFString.m;h=cc873dab3d178abd0f4ed94546a5b0d74add8171;hb=HEAD#l77

rasky commented Feb 20, 2013

Can you please explain WHY you need UTF-8 validations?

In unicode rich languages, you will convert UTF-8 into Unicode, and validation is performed by the language itself (or its standard library). No code to write.

In unicode poor languages, there is no Unicode data type, so you leave UTF-8 as-is.

Why do you ever need to include a UTF-8 validator?

Owner

frsyuki commented Feb 20, 2013

@Midar MessagePack is for all of weakly-typed, strongly-typed, dynamically-typed and statically-typed languages.

Please don't think one type system works perfectly for all languages. All implementations have to manage the inconsistency between language types and msgpack types.

The problem is that which causes more troubles: a) projecting strings and byte arrays into Raw type. b) projecting Raw type into strings or byte arrays.
I understand supporting UTF-8 has lots of merits. Why do you think the troubles caused by having UTF-8 is manageable compared to not having UTF-8?

Owner

frsyuki commented Feb 20, 2013

@rasky I suggested a way to handle binary-or-string type in dynamically typed languages without schema:

  • I suggest Objective-C/JavaScript implementations to have following solution:
    • unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString
      the object contains a validated UTF-8 string
    • if the validation failed, it's nil or something we can tell that the validation failed
    • NSStringOrData#data returns the original byte array
Owner

frsyuki commented Feb 20, 2013

@rasky > Can you please explain WHY you need UTF-8 validations?

Because:

  • successfully stored data must be read successfully

Imagine that an invalid UTF-8 string is stored on a disk with information "this is a UTF-8 string"

Midar commented Feb 20, 2013

@frsyuki

MessagePack is for all of weakly-typed, strongly-typed, dynamically-typed and statically-typed languages.

This is exactly what I'm saying, which is why I don't get why on one hand you are against a string type which is required for a lot of languages, but on the other hand praise interoperability - which you just destroyed by not having a string type!

Why do you think the troubles caused by having UTF-8 is manageable compared to not having UTF-8?

You still failed to show us where exactly UTF-8 should cause trouble for MessagePack. What exactly makes UTF-8 harder for you? Again, if you care about internationalization as much as about interoperability, you can convert it to some other non-Unicode encoding. If you use a Unicode encoding, you have these "problems" you call it anyway.

Midar commented Feb 20, 2013

unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString the object contains a validated UTF-8 string
if the validation failed, it's nil or something we can tell that the validation failed
NSStringOrData#data returns the original byte array

Oh great, now I have to implement another string class (remember: NSString is just a class cluster. If I subclass it, I have no implementation!) just because you have never heard about separation of layers? Sorry, but no, just no. If it stays this way, I just won't implement MsgPack, and I'm sure many others won't either. Not because they don't like the idea, but simply because you made it impossible to parse it in a sane matter.

Owner

frsyuki commented Feb 20, 2013

@rasky > I think that, if an implementer is going to communicate between a unicode-rich language and an unicode-poor language, it is up to the implementer himself to take care of these small details.

My proposal is that msgpack doesn't support string types but msgpack supports user-defined types. It means implementer can add string type if implementer himself needs it.
Do you think this does not work?

Member

methane commented Feb 20, 2013

So, in the first sentence, you say it should be inter-language. And in the second you say it should only be for weakly typed languages? If you say you want it inter-lanuage, you should recognize that the only way to have that is to add support for a string type.

I'm sorry about my poor english.
What I want to say is msgpack should be designed for many languages, not only for languages distinct string and bytes.

Midar commented Feb 20, 2013

@frsyuki Yes, I think this does not work, as everybody will come up with his own string type, and there will be no interoperability. Please stop claiming that not implementing a string type improves interoperability, when it clearly does the exact opposite as has been stated by many and is actually an issue which prevents many from using it or taking MsgPack serious.

@methane Yes, I agree. It should work with all languages. But for that, a string type is required. For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary.

Owner

frsyuki commented Feb 20, 2013

@rasky For example, in Ruby (1.9), following code returns a String object with UTF-8 encoding information:

require 'uri'
s = URI.unescape("%DE")
p s.encoding

This easily happens in many applications including Rails. Is this string, or binary? I think it depends on how applications handle this object.

Additionally following code returns the same object as well:

require 'msgpack'
s = MessagePack.unpack("\xA1\xDE")
p s.encoding

Midar commented Feb 20, 2013

@frsyuki And exactly that is the problem. It depends on how the applications handles it! There is no way to know that without knowledge from the Layer 2 protocol! Why do you insist on ignoring basic principles of software design?

Member

methane commented Feb 20, 2013

For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary.

Then, how they should serialize such binary?
When I send a string from Python to php, php may send it back to Python in binary type...

Midar commented Feb 20, 2013

@methane By having an optional parameter how it should be treated in encoding, by wrapping it into some object, etc. There are many ways to overcome this in languages which don't make a difference. There is absolutely no way to overcome not having a string type in languages which do make a difference.

Owner

frsyuki commented Feb 20, 2013

@Midar Whether an object should be a byte array or string depends on applications.
I said lifecycle of applications (programs) is shorter than data, and data should be isolated from applications. Do you agree with opinion?

Applications could be changed. But data should not be changed at the same time. Applications may consider that the data is a byte array which was considered string before. But we can't change stored data. We can't update the old code in the same network at the same time.

chakrit commented Feb 20, 2013

@methance you are describing the exact problem that can be solved by adding a proper string type.

Python -> STR_XXX -> PHP -> BIN_XXX -> Python

Now Python knows it is getting some binary.

And the same python server can then do:

Python -> STR_XXX -> Node.js -> STR_XXX -> Python

Now Python knows it is getting a UTF8 string.

Now, imagine the above scenario without the String type.

Python -> BIN_XXX -> PHP -> BIN_XXX -> Python

Now Python do not knows it is getting a binary or a string (because it does not and should not need to know that the source lang is PHP)

Python -> BIN_XXX -> Node.js -> BIN_XXX -> Python

Now Python do not knows it is getting a binary or a string (because it does not and should not need to know that the source lang is node.js)

We have this problem and there's no way to tell exactly because you don't have the String type in msgpack !

Owner

frsyuki commented Feb 20, 2013

@Midar @chakrit > For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary.
This doesn't work.

For example, a server program requires that data should be serialized in string type. Another program written in PHP can't tell strings from binary type. Let's say it sends data in binary type. Then the PHP program can't send requests to the server.

chakrit commented Feb 20, 2013

@Midar Whether an object should be a byte array or string depends on applications.

Yes, agree. But because you don't have String type, participating applications now gets confused.

PHP -> BIN_XXX -> Python (oh hai, is that a string or a binary? I'm just gonna make it a binary and show gibberish to my user then.)

You are totally missing the point here. It's interoperability between applications, not how a single application should be architected.

I said lifecycle of applications (programs) is shorter than data, and data should be isolated from applications. Do you agree with opinion?

Yes. But you are one step too liberal there making everything harder by not providing a way to specify a string.

Effectively a premature optimization.

Applications could be changed. But data should not be changed at the same time

Yes.

Applications may consider that the data is a byte array which was considered string before.

Theres the problem. If you have a string type, then all applications can tell if its a string or a byte array before.

But we can't change stored data. We can't update the old code in the same network at the same time.

As per reasoning above, all the more why there should be a string data type.

chakrit commented Feb 20, 2013

For example, a server program requires that data should be serialized in string type.

ah ha.

Another program written in PHP can't tell strings from binary type.

PHP will be able to tell if there is a "string" marker in msgpack telling it that the blob is a string.

Again, you have this problem exactly because you don't have String in msgpack

Let's say it sends data in binary type.

And it could then talk with other languages such as Python wether that "binary" that PHP can't differentiate is meant to be treated as a string or giant blob of data.

Then the PHP program can't send requests to the server.

If you have String in msgpack, PHP could send a binary blob and tell the server "please treat this blob as a String".

But because you don't have String in msgpack. This is then a problem.

Owner

frsyuki commented Feb 20, 2013

@chakrit > PHP could send a binary blob and tell the server "please treat this blob as a String".
It means the receiver needs to decide how to handle the received data even if it has string type information or byte array type information. In other words, the receiver knows how to handle the data. The sender doesn't (have to) know.

Plus, the receiver can't (shouldn't) trust the received data. Thus in any case, the receiver should validate the data type.

Midar commented Feb 20, 2013

@frsyuki

Whether an object should be a byte array or string depends on applications.

No, this does not depend on the application, this does depend on the protocol!

I said lifecycle of applications (programs) is shorter than data, and data should be isolated from applications. Do you agree with opinion?

I don't see what that has to do with a string type, except that with a string type, data is interoperable and you can still read it years later.

Applications may consider that the data is a byte array which was considered string before. But we can't change stored data. We can't update the old code in the same network at the same time.

Oh dear, please tell me you meant something else. Are you really just dumping your internal structure instead of having a sane protocol? If you dump your internal structure, there is no interoperability anyway. If you don't dump the internal structure, but have a well-designed format to store the data, you want to have a string type for interoperability.

For example, a server program requires that data should be serialized in string type. Another program written in PHP can't tell strings from binary type. Let's say it sends data in binary type. Then the PHP program can't send requests to the server.

I'm pretty sure PHP has a different type for strings and binary, at least it can have one. If not, it's still possible to have a class MsgPackString and MsgPackData in which you can wrap your data so the serializer knows what it is.

Having too much information is never a problem, you can just discard it. But you can't interpolate information that just is not there!

Member

methane commented Feb 20, 2013

@chakrit
datetime and bytes are also fundamental type. JSON can't serialize them.
But we can use them in JSON like "This string is base64 encoded PNG", "This string is ISO8601 datetime."

"This bytes is utf-8 encoded string" in msgpack is same thing.

Supporting how many types is format design decision.
Msgpack decides to be like JSON but bytes instead of string.
I think BSON is format what you want. It supports bytes, string, datetime and others.

Owner

frsyuki commented Feb 20, 2013

  • can the receiver use a received string as-is? I think it can't. because:
    • The receiver needs to convert it into a byte array if the receiver needs a byte array.
    • The receiver needs to confirm the data is a string if the receiver needs a string.
  • can the receiver use a received byte array as-is? I think it can't. because: same with ^

To be secure and interoperable, applications shouldn't care the received data has binary-type info or string-type info.

chakrit commented Feb 20, 2013

@methane "bytes instead of string" --> this is not the case because msgpack did not specify this exactly with the protocol. Leaving it up to possible misinterpretation by parties as illustrated by the OP question.

If that is what msgpack intended, it should have said "THIS IS MEANT TO BE USED FOR STRINGS AND STRINGS ONLY" in the spec ... not provide a raw bytes description and expect everyone to treat it correctly as string.

Because then the driver writer wouldn't be able to implement it.

With JSON we're okay bcause the specs say that it does not handle DateTime and every string should be treated as string and thus we build our own workaround knowing that in mind.

But for msgpack this is confusing and hard to implement correctly (as illustrated by the OP problem) because msgpack don't specify a String data type and we'll have to roll our own anonymous version by piggybacking on the raw data type while (being tricked to) thinking that we may be able to have both (String and Buffer) since the msgpack spec allows it and doesn't specify what to do in case if we need the fundamental data type that is the String

This totally breaks interoperability because then the driver implementer wouldn't be able to provide a sane implementation of wether to treat a blob as a data or a string as the case with ObjC. Keep in mind that String is very fundamental data type in most platforms.

If msgpack wants us to treat these binary blobs as data from the start, it should just says so in the spec.

But if msgpack wants to provide blob as well as a string, then there should be a protocol-built mechanism to differentiate between the two. Leaving it up to interpretation is bad for a standard wire protocol.

Midar commented Feb 20, 2013

@frsyuki

can the receiver use a received string as-is?

Yes. If there is a string type, that's possible, unlike without one!

The receiver needs to convert it into a byte array if the receiver needs a byte array.

Wrong. If there is a string and a binary type, it just gets the right thing unless someone broke the protocol.

The receiver needs to confirm the data is a string if the receiver needs a string.

Yes, it can verify if it's actually a string, but you'd need to do the same for a number, so this is not a valid argument.

can the receiver use a received byte array as-is? I think it can't

Yes, it can. Same as above.

To be secure and interoperable, applications shouldn't care the received data has binary-type info or string-type info.

Wrong. To be secure and interoperable, the protocol should be well-defined (i.e. saying either string or binary) and reject violations of the protocol.

I'm really getting tired of talking to a wall. I assume you don't really read what you write, because you keep up bringing arguments for a string type, then only to say they are against a string type, even though they're clearly for a string type.

Maybe we have a communication problem here and you don't know what interoperable means? Interoperable means that a format can be read by different applications in different languages - something you try to prevent by not having a string type. Interoperable does not mean just dumping your internal state!

rasky commented Feb 20, 2013

@methane then why your Python msgpack library accepts Unicode strings in input? The answer is simple: because strings are a fundamental data type available in all major languages. By discarding information on its type, you're irreparably losing information that can't be reconstructed.

saki7 commented Feb 20, 2013

@Midar says:

I'm really getting tired of talking to a wall. I assume you don't really read what you write, ...

This is not a polite statement. I think @frsyuki and other collaborators who agree with @frsyuki are trying to understand your problem. But they still have their solid opinion which is against yours.

I understand both @Midar and @frsyuki 's thoughts, but my opinion is as follows:

This is an application layer problem. The application must be aware of the encoding which it deals with, not the protocol.

Please note that, this is my fully personal answer, and it does not include any political or arbitrary meaning, since I don't belong to the MessagePack developer team.

@Midar 's opinion is like: "We must assume anything we receive is definitely correct."
I disagree. It's not the data who decides. We do. Our application decides whether the data is correct or not (or, is in a certain format).

Owner

frsyuki commented Feb 20, 2013

@Midar Sorry, sometimes I couldn't understand what you meant. But I'm not kidding.

it just gets the right thing unless someone broke the protocol.
it can verify if it's actually a string, but you'd need to do the same for a number, so this is not a valid argument.

Anyone can break the protocol. I think not having string type is better to manage following two problems: 1) how to handle the broken protocol. 2) how to prevent broken protocols.

If there was string type, and an application stored data as a byte array, and the application changed its mind to handle the data as (this often happens, right? applications should be changed as business changes), the data is considered broken. But it still represents the same data. I think it should not be considered broken.

The receiver should validate all arguments. It should not assume that all senders think the byte sequence is a string. The receiver knows it wants to handle the data as string, or byte array. It means it can validate the type.

My opinion is that sane protocol handlers should not tell strings from byte arrays. The applications should know whether need byte arrays or strings are needed.
Thus protocols don't have to tell strings from byte arrays.

To be secure, the protocol should be well-defined (i.e. saying either string or binary) and reject violations of the protocol.

I meant the protocols often change even if they're well-defined. It should reject invalid protocols but I think the changes of strings to/from byte arrays should not be considered as protocol change. Because applications decide the difference. Data itself are the same.

Member

moriyoshi commented Feb 20, 2013

The confusion may arise when two different kinds of octets, strings and binaries occur in the same set of objects and that'd be the case where the differentiation is necessary, which isn't addressed by msgpack by design. Why don't we blame HTTP spec for not specifying means to handle non-ASCII strings within the request URI? Because what it represents totally depends on the content as for HTML, and how it's encoded is actually specified by the HTML specification. That is how design decision goes.

ganwell referenced this issue in ellisonbg/zmqweb Feb 20, 2013

Closed

Serialization, integration tests and travis #2

rasky commented Feb 20, 2013

@frsyuki can you explain why a "sane protocol handler" should NOT tell strings from byte arrays, but it should tell floats from byte arrays? Floats are a sequence of bytes in the PC memory, why should msgpack care about them?

chakrit commented Feb 20, 2013

@frsyuki regarding "My opinion is that sane protocol handlers should not tell strings from byte arrays"

If you insist on that, please definitely do update the spec to properly codify that opinion and mark the objective-c handler as broken because it auto-converts buffers to String without the application developer's consent so it's much clearer on how everything should've been implemented.


That aside, I still want String in msgpack as I see no point why the application developer should need to worry about this conversion process.

This should be a job of the protocol handler but which it will not be able to do easily since the required type information is missing and must still be provided by the application developer by means of a schema -- which IMO is an ugly solution at best.

Midar commented Feb 20, 2013

@saki7

This is not a polite statement.

Sorry, I'm getting really frustrated from repeating myself over and over again and only being responded to with ignorance of the problem which is so serious that is is actually PREVENTING ME AND OTHERS FROM USING MSGPACK AT ALL!

@Midar 's opinion is like: "We must assume anything we receive is definitely correct."
I disagree. It's not the data who decides. We do. Our application decides whether the data is correct (or, is in a certain format) or not.

This is not correct. This is not something about verifying, this is something about EVEN BEING ABLE TO PARSE AND STORE IT in some languages. You still need to verify it. This is not even the topic! It's about whether something is a string or some binary data and thus should be decoded into a string or binary.

Anyway, with your argumentation, why do we even have a type for numbers? We could just store it as binary. It's up to the application to interpret it correctly! And while we're at it, why not go to the next level and only use binary, so we don't need MsgPack at all? That seems to be what you want.

And of what kind something is is really part of the protocol and not the application…

@frsyuki

Regarding 1, if there was string type, and an application stored data as a byte array, and the application changed its mind to handle the data as (this often happens, right? applications should be changed as business changes), the data is considered broken. But it still represents the same data. I think it should not be considered broken.

The problem is that it outputted it as binary instead of string in the first place! Nothing like that would have happened if it would have used the string type from the start!

The receiver should validate all arguments. It should not assume that all senders think the byte sequence is a string. The receiver knows it wants to handle the data as string, or byte array. It means it can validate the type.

Yes, it has to validate the type. But just because if has to validate the type DOES NOT MEAN THE TYPE HAS TO BE UNSPECIFIED. If you want that, why even use MsgPack? Then you don't need number, bool, etc - just binary.

My opinion is that sane protocol handlers should not tell strings from byte arrays. The applications should know whether need byte arrays or strings are needed.
Thus protocols don't have to tell strings from byte arrays.

Which totally makes it impossible to parse it just as a single layer, but instead you need a schema and thus knowledge about the inner layer. But why even talk about that anymore? It seems you clearly hate everything about good protocol or software design, otherwise you would not defend a way that breaks with tens of years of software design and protocol design principles (and was the reason for the success of protocol stacks like TCP/IP) so fiercely.

Anyway, whatever. I give up. People who only know limited languages seem to be fine with it and are unwilling to interoperate with others. I'll just give up on MsgPack then. Good luck to @rasky, @chakrit and others who tried to talk some sense into people who never dealt with a language that does make a difference between strings and binary, but it seems there are a few people who only want to use it for unportable stuff like dumping internal state, and sadly, it seems the MsgPack author is among them, so for me personally, MsgPack is just useless and I'll move on to something more useful.

rasky commented Feb 20, 2013

@frsyuki if msgpack doesn't want to handle Unicode, than my request is that ALL msgpack binding refuse to encode Unicode strings, and force people to use custom encoding/decoding code. This way, application developers will be aware of the design choice.

This would cause a rage, but I think it's exactly what you want. People will simply start using incompatible custom codes for handling Unicode strings, and a big mess will arise. Or everybody will just agree on a single custom code and thus making it "standard" for everybody but the msgpack development team. That would be fine as well, in my opinion

Member

methane commented Feb 20, 2013

@rasky

@methane then why your Python msgpack library accepts Unicode strings in input?

My implementation packs tuple into msgpack array. And unpacking it into list. (from 0.3)
It is because I feel I can naturally mapping unicode and tuple to bytes and array.

saki7 commented Feb 20, 2013

@rasky says:

@frsyuki can you explain why a "sane protocol handler" should NOT tell strings from byte arrays, but it should tell floats from byte arrays? Floats are a sequence of bytes in the PC memory, why should msgpack care about them?

I think, that is because there are the only one thing which float type represents. It is clearly stated on the standards. And one more important thing we must remember: IT IS AN PRIMITIVE TYPE.
When the stored data for float type had actually contained invalid bytes, we just receive invalid float variable after the decoding process. I think there's no problem with that, because it's our application's fault that it didn't store a valid data. And the mistake doesn't cause any serious problem.

If multiple data type for floating point had existed, like encodings for string types, maybe there were similar problems. But I still think the floating point example is an another story.

Midar commented Feb 20, 2013

@saki7

If multiple data type for floating point had existed, like encodings for string types, maybe there were similar problems. But I still think the floating point example is an another story.

Actually, there are different floating point types. There's the difference in length (float, double, long double) and format (IEEE, VAX, etc.). So can we get rid of float, number, etc. now and just replace everything with binary? According to you and others, nobody needs to know the type on the protocol anyway, as the application knows it. So why not get rid of all that bloat and just replace everything with binary? And while we're at it, an array is also just binary. So why not replace MsgPack with binary? That has to be what you guys dream about. It's just your argument followed through.

Owner

frsyuki commented Feb 20, 2013

@rasky > Floats are a sequence of bytes in the PC memory, why should msgpack care about them?
In terms of type system, I don't have strong opinions why msgpack tells floats from integers. But we can store integers in fewer bytes if it knows it's an integer.

What I care is that string "a" and byte array "a" are exactly same byte sequence and applications should decide whether it is. Data should not describe whether itself is.

rasky commented Feb 20, 2013

@frsyuki you are wrong. In all Unicode rich languages, string "a" and byte array "a" have TOTALLY different representation in memory. In fact, string "a" is a sequence of codepoints, not bytes, so the sentence "the string 'a' contains a byte sequence" has no meaning whatsoever, it is just a wrong reasoning.

In Python 2.x, the Python interpreter implements it internally as UTF-16 sequence, so it corresponds to the binary sequence 00 61 (on little-endian platform).

Owner

frsyuki commented Feb 20, 2013

@chakrit > If you insist on that, please definitely do update the spec to properly codify that opinion and mark the objective-c handler as broken because it auto-converts buffers to String without the application developer's consent so it's much clearer on how everything should've been implemented.

Sorry, I don't know much about Objective-C. But it should provide a way to take out the orignal binary. How do you think, @chrishulbert?

rasky commented Feb 20, 2013

@methane then I think that, given this discussion, your Python binding is wrong, and shouldn't convert Unicode into byte arrays. It just makes bugs happen. It should be removed, and an exception raised. Since it's up to the application to handle Unicode (this is what @frsyuki says), then please let Python application programmers handle it, don't have an automatic behavior that can be wrong.

The same applies for all language bindings for languages that have a native Unicode data type. They should refuse to encode Unicode strings (since they would be losing important information) and let application programmers handle it. @frsyuki do you agree on this solution?

chakrit commented Feb 20, 2013

@frsyuki Yes, I think it is very easy to provide binary by default as IIRC that is the default thing that you get the framework that handles internet connections already.

saki7 commented Feb 20, 2013

@Midar wrote:

Actually, there are different floating point types. There's the difference length (float, double, long double) and format (IEEE, VAX, etc.). So can we get rid of float, number, etc. now and just replace everything with binary? According to you and others, nobody needs to know the type on the protocol anyway, as the application knows it. So why not get rid of all that bloat and just replace everything with binary? And while we're at it, an array is also just binary. So why not replace MsgPack with binary?

I know there are various floating point representatives on this world. What I actually wanted to say is, that the data format which MessagePack handles, is a single type. It's just an "floating point type". It is written on Format specification - MessagePack - Confluence.
And other float types which does not fit with this specification must be formatted to fit this format. That's left for each MessagePack language bindings. And this is what we are actually talking about. It's a layer problem, too.

rasky commented Feb 20, 2013

@chakrit it's not that easy because you want a NSMutableDictionary out of a msgpack map, and NSMutableDictionary only wants NSString as key. See this fork: https://github.com/nferruzzi/msgpack-objectivec/commits/

Midar commented Feb 20, 2013

@frsyuki No, it should not, because Foundation can handle Unicode and stores the string as Unicode. That representation is - like in every other language supporting Unicode - system dependent. That can be UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE…

Oh, and of course, if you say it's UTF-8, you can't use arbitrary binary, as that would be invalid UTF-8 and you won't get any object at all!

@saki7 Yes, there is a single type MessagePack handles. So why not have a single type for strings as well? Everything that is a different representation needs to be converted, just like for floats.

rasky commented Feb 20, 2013

@saki7 you are totally missing my point. My question is: WHY is there a floating point type in the specification, at all? @frsyuki says that it makes sense NOT to have a string type in the specification. So why there should be a floating point type?

chakrit commented Feb 20, 2013

@rasky I see, then I think we supposedly need a custom NSDictionary implementation as well since (even if msgpack has a string data type) it'd still be possible to get non-string keys in msgpack right? So the handler need to handle it regardless.

Midar commented Feb 20, 2013

@rasky @chakrit That is only half-true, NSDictionary takes any object as key which implements -[hash], -[isEqual:] and -[copy]. All those are true for NSData. But is very impractical. You would need to create an NSString, get it into some buffer in some encoding and then create NSData for that buffer and use that as a key. And that's almost as good as not having any MsgPack support ;).

saki7 commented Feb 20, 2013

@rasky wrote:

@saki7 you are totally missing my point. My question is: WHY is there a floating point type in the specification, at all? @frsyuki says that it makes sense NOT to have a string type in the specification. So why there should be a floating point type?

I think that's quite a philosophical question. Isn't it just there for convinience or performance? Remember, it's a primitive type.

chakrit commented Feb 20, 2013

@saki7 i think we are all trying to tell @frsyuki that having a String type is big convenience over the little performance/abstraction gain.

ganwell commented Feb 20, 2013

I like to go in another direction. If msgpack stays as it is, how can I encode this? base64, really?

msgpack.loads(msgpack.dumps(bytes(b'\xb9'), encoding='utf-8'), encoding='utf-8')

-> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 0: invalid start byte

I set utf-8 just because I can expect the least failures.

  1. Strings are a totally ubiquitous fundamental datatype for almost all programs.
  2. re: Sadayuki "lifecycle of applications (programs) is shorter than data, and data should be isolated from applications"
    Yes, but strings are a fundamental datatype.
    The lifespan of UTF8 as a character encoding is probably longer than the lifespan of most data.
  3. Some languages don't intrinsically support UTF8. So what? It is very feasible to implement a UTF8->XYZ decoder in the msgpack implementation for that language.
    This problem will be the same either way, except worse without recognising strings as a fundamental datatype in msgpack spec, because the user then has to do everything.
    And they currently need either a schema or to GUESS whether an array of bytes is a string or something else.
    Since people actually need to transmit strings, they will do it anyway if msgpack does not support it, but in an ad-hoc way which will make things more difficult and error-prone for everybody.
  4. re: Sadayuki: "UTF-8 has verbosity as well. 0x2F could be 0xC0 0xAF. Should deserializers reject these bytes? Or normalize into another character?"
    Why? If msgpack doesn't support UTF8, then people will use byte arrays to hold the same messages, with the same results (except more probability of errors since everybody needs to write encoders/decoders).

Summary:
With the current design of msgpack, when we discover some "raw" bytes, we have NO information about what's in that data. We don't know if it is a UTF8 string, or even any type of string. Maybe it's a JPEG image.

So we have to either know the schema (and not needing a schema was supposed to be an important quality for msgpack, right?), or we have to GUESS what the type of data is (like the Objective-C parser hack I made).
This situation is quite unacceptable for such a fundamental datatype as strings, and could easily be solved. Any problems with the solution (e.g. with languages that have poor datatype support) are problems that we already have, so we have nothing to lose by fixing this.

saki7 commented Feb 20, 2013

@chakrit wrote:

@saki7 i think we are all trying to tell @frsyuki that having a String type is big convenience over the little performance/abstraction gain.

Having encoding information stored on the data does not result on better convenience, but has a serious problem. There might be various systems / various language bindings, dealing with the single data. Not every languages have a strong/safe type system nor string type (with a full encoding support).
Ultimately, the application must validate the data. Not the protocol.

rasky commented Feb 20, 2013

@saki7 let's say that MsgPack has a string type, and you are using an Unicode-poor language. How is this a serious problem? You get the binary utf-8 representation. Period. How is this a problem?

@saki7
How is that not already a problem?

If we distinguish, for example, UTF8 encoded strings in msgpack from byte arrays, then it's a SMALLER problem because at least they know how to interpret the incoming data.

ganwell commented Feb 20, 2013

Sorry my bad.

msgpack.loads(msgpack.dumps(bytes(b'\xb9')))

of course works.

chakrit commented Feb 20, 2013

@saki7 no, there is no need to have an encoding information stored on the data.

If we all agree to treat the blob as UTF-8, msgpack adds a single type marker to indicate a string blob. (0xXX something)

Problem is solved.

No more change or encoding information more than this is needed inside the protocol. It is really that simple.

Midar commented Feb 20, 2013

@saki7 Why would this require a strong type system? That's just nonsense. It's not like you add extra info for encoding, you add a new type for a string and choose one encoding like UTF-8 that is used on the wire. What you use internally is your decision.

saki7 commented Feb 20, 2013

Again, what if the data bytes were invalid even if MessagePack had an encoding bit? Ultimately, your application must be aware of the actual type stored in the data. Thus, supporting encoding for strings does not provide further convenience.

But still, support for primitives such as floats must be there, for convinience or better performance, as I said.

Member

kuenishi commented Feb 20, 2013

Hi all, I'm maintainer of msgpack-erlang. Erlang does not have native string type. If string type is added I can't maintain msgpack-erlang any more. Let alone damn Unicodes.

I don't like this kind of "Hey, I need type X for msgpack specs" where X = time, string, date, and anything you like. I like Sada's minimal design choice of types. The more types msgpack supports, the more msgpack loses language interoperability. To be honest, I want types for atom, tuple, pid (Erlang pid), BigInt, and function.
OCaml and Haskell guys might want polymorphic variant type or algebraic data types. Why don't we stop arguing around NATIVE types and move up to application layer design, or hack msgpack-idl ?

chakrit commented Feb 20, 2013

@saki7 That is besides the point. Application must validate all data type be it float, number, blob, whatever. This has nothing to do with having a String or not.

I don't think anyone here is advocating that msgpack implements unicode. It just needs to support having unicode on the wire in a good way that the protocol handler can implement. Do you understand what I'm trying to say here?

Midar commented Feb 20, 2013

@saki7 You have to handle that right now, too. So where is this is a difference from now? The only difference is that right now, you have to do it, whereas with a string type, the library could do it for you.

@kuenishi There IS a difference between lists (which are used for strings) and binary in Erlang. For binary, you could use <<>>, for strings, you could use lists. So no, it's not impossible for Erlang. But it is for many other languages the way it is right now.

Owner

frsyuki commented Feb 20, 2013

@Midar > Oh, and of course, if you say it's UTF-8, you can't use arbitrary binary, as that would be invalid UTF-8 and you won't get any object at all!

I think this is an important part but I couldn't catch what you meant.

UTF-8 strings should be read by unpackers even if it includes invalid bytes as a UTF-8 string. Then an application should decide whether it rejects the data or not, whether normalizes the data or not, how to normalize the data, etc. I believe the handling depends on applications and msgpack library can't decide it.

saki7 commented Feb 20, 2013

+1 for @kuenishi.

Determining which native types to support, actually is, a philosophy issue. That's left for MessagePack developers and @frsyuki.

chakrit commented Feb 20, 2013

@kuenishi You do not have to actually process the string.

If you are writing user-facing application that must display a string coming from msgpack, then you need string encoding anyway. So you are good.

If you are not writing user-facing application that must display a string, then you can simply store it as a binary blob. But atleast the information that the very blob should be treated as a String is still there should the protocol handler needs it (or any downstream need to processs the data further)

There is a difference there.

Midar commented Feb 20, 2013

@frsyuki

UTF-8 strings should be read by unpackers even if it includes invalid bytes as a UTF-8 string.

Actually, no. Because most languages refuse to even store invalid Unicode. Most parse UTF-8 and convert it to an internal representation - which then can't store invalid UTF-8. And this would be a gigantic waste of space: Even if there would be a way to store invalid UTF-8 (e.g. by using codepoints that are above the 21 bits of unicode), than it would take 4x the space when the internal representation is UTF-32. And you talk about efficiency?!

Member

kuenishi commented Feb 20, 2013

@Midar the problem is not difference between binary and lists but the code can't tell string and lists of integer. It doesn't matter if it is ascii or unicode.

Owner

frsyuki commented Feb 20, 2013

@rasky

please let Python application programmers handle it, don't have an automatic behavior that can be wrong.
The same applies for all language bindings for languages that have a native Unicode data type. They should refuse to encode Unicode strings (since they would be losing important information) and let application programmers handle it. @frsyuki do you agree on this solution?

It's interesting idea. I think it's a possible idea to provide an option to reject unicode strings.

Member

kuenishi commented Feb 20, 2013

@chakrit I'm sorry, I don't understand your point. I think whether facing user or not does not matter.

Midar commented Feb 20, 2013

@kuenishi You could still do something like {data, <<foo>>} vs. ['f','o','o']. It usually would assume you want to create a string, unless you need the former. Same for deserializing. Erlang's pattern matching makes that very easy. But IIRC there was some difference between <<>> and [].

saki7 commented Feb 20, 2013

I refrain from making further comments, since I have described every reason for my opinion. Please do not argue against me, but instead think about what the better design is. I agree to @frsyuki's thoughts.

chakrit commented Feb 20, 2013

@kuenishi exactly.

If unicode does not have any meaning to your application, then you can just treat it as another Blob from your point of view.

Member

kuenishi commented Feb 20, 2013

@Midar writing pattern match for every msgpack-flavoured term will make programming stupidly hard for users.

nurse commented Feb 20, 2013

As a i18n commiter of Ruby, it should be optional string (or encoding) annotation if MessagePack has a string.

First of all, a protocol MUST have error handling. A bad example is HTML4, it doesn't have error handling around parsing errors.

Now for MessagePack some people complain about it doesn't have string type and it causes users to handle string/binary handling themselves. Even if it is by design and MessagePack only treat raw bytes, it is natural that people complain that.

But adding a string type will solve it? There are some problems.

First there are some languages they don't differentiate between raw binary data and text strings. Ruby 1.8, Perl without utf8 flag, JavaScript, OCaml, C without wchar_t, PHP 5.2.0 or prior, and so on doesn't have it, and they must add some schema based translator for MessagePack with String.

Second the difference between string and binary is sometimes ambiguous. For example HTTP logs, they are usually strings and you want to treat they as strings in MessagePack. But once someone attacks your servers, those logs may contain invalid bytes.

Third a sender may send invalid UTF-8 strings. If it is string type, it is absolutely invalid data. But for archive it must be saved as is. This is difficult problem in this schema.

Therefore MessagePack should works without string type. But for ease of Unicode people it may have an annotation that express the binary shall be a string when the sender knows it should be treated as a string. If the receiver knows the annotation, it can treat as strings. If not, it also works with binaries. Moreover sender doesn't know string type, a receiver may treat it as a string or simply ignore it. This also works mixing MessagePack without String and with String.

Midar commented Feb 20, 2013

@kuenishi Well, if you really want to differentiate, that is. Let me just ask you: Do you differentiate between a number and a float? I don't think so, as Erlang does not. So, why would you differentiate between binary and string if for Erlang they are the same, but not for number and float? For you, nothing would really change. You would still just don't care if it's binary or a string.

Owner

frsyuki commented Feb 20, 2013

@Midar

Actually, no. Because most languages refuse to even store invalid Unicode. Most parse UTF-8 and convert it to an internal representation - which then can't store invalid UTF-8.

Then I have to say some languages store strings as is without converting them into UTF-8 or UTF-16, and msgpack focuses on cross-language. Anyway, if msgpack had string type, serialisers should validate strings before storing.

Member

kuenishi commented Feb 20, 2013

@chakrit If unicode type is forced to be binary, then how can forward or send back that object to other lang? In Erlang binary is coded as binary; can't be serialized as unicode without annotation.

chakrit commented Feb 22, 2013

msgpack is not casual data format like JSON

Screen Shot 2013-02-22 at 10 41 54 AM

http://msgpack.org/

mattn commented Feb 22, 2013

build could not for technical reasons (encoding, size, speed)

JSON is trusted because it don't have low layer things. I can't see that MessagePack is replacible of JSON from the image.
Can you see such as PR?

chakrit commented Feb 22, 2013

If you ever wished to use JSON for convenience (storing an image with metadata) but could not for technical reasons (encoding, size, speed...), MessagePack is a perfect replacement.

mattn commented Feb 22, 2013

I think it's hype. It should be `If you will be hard to encode binary or treat low layer protocols, MessagePack will be PERFECT REPLACEMENT OF THE HARD WORK.

mattn commented Feb 22, 2013

JSON often uses for conversation between server and client. Because JSON is trusted, JSON does not have something to make crash or insecure(If you use JSON.parse or JSON.stringify).
Do you want to use msgpack to communicate to someone stranger who doesn't know the structure of datas?
MessgePack treat binary protocols. If you want to communicate string or image data, you should do design the data format using raw field as upper layer.

Why you want change spec? Why you don't do design of upper layer?

Member

methane commented Feb 22, 2013

JSON can't contain binary. In such case, msgpack can be a perfect JSON replacement.
You can pack all string in raw.
msgpack is just a container. How to use it is application's responsibility.

I don't against adding optional hint to msgpack spec.
There are demands for mixing binary and unicode in one message.
But JSON can't be used such case too.

chakrit commented Feb 22, 2013

@methane yeah, i hope we're going forward with the hint addition. right now it's hard to implement a generic handler correctly.

najeira commented Feb 22, 2013

I would vote for frsyuki's proposal.
That proposal allows to that new reader can read old data with compatibility mode.

cabo commented Feb 22, 2013

I think we have had a pretty good discussion so far, even if it may not look like that so much :-)

I have picked up @frsyuki's proposal, simplified it somewhat, and put it into a draft next version of the Internet-Draft. (Yes a draft draft.) Please see there for the technical content. The main reason I did the small change is that I'm not sure small binary values are frequent enough to merit complicating the short-string case, so I assigned all 32 code points to short strings. Binary values will then always use raw8. I think this is about as simple as it can get.

Enjoy at http://www.tzi.de/~cabo/draft-bormann-apparea-bpack-01pre1.txt

I plan to listen some more to the discussion here the next couple of days.
Due to the timing in the runup to the Orlando IETF, I will have to submit the final version of the Internet-Draft on Monday. So I would be happy if we could get a bit of a closure on the string representation issue here until then.

The draft draft has an appendix laying out some additional work, probably to be tackled after Monday. I'm not quite sure about the best venue for discussing these. Of course, we could open/hijack msgpack issues for these as well.

Owner

frsyuki commented Feb 22, 2013

I'm thinking about the API design. This is a different problem from the format design but affects format design (meaning we need to think API as well if we discuss on format farther more).
This design needs to handle following problems (at least):

  • How to implement serializer/deserializer:
    • A) in languages which clearly distinguish strings from binaries (e.g.: Objective-C, JavaScript)
    • B) in languages which don't distinguish strings from binaries (e.g.:
    • C) in languages which optionally distinguish strings from binaries (e.g.: Ruby)
    • These implementations need to think each other: "if B uses this implementation, how to implement A?"
  • Error handling:
    • Whether serializer should validate (raise errors if the string includes invalid bytes as UTF-8) strings or not
    • Whether deserializer should validate (raise errors if the string includes invalid bytes as UTF-8) strings or not
    • Whether deserializer should normalize (replaces invalid bytes as UTF-8 in strings) strings or not

cabo commented Feb 22, 2013

A) in languages which clearly distinguish strings from binaries (e.g.: Objective-C, JavaScript)

This is pretty obvious, I think.

B) in languages which don't distinguish strings from binaries (e.g.:

Most of these are strongly typed, so you can find some way to put in this information. in the typing system.

C) in languages which optionally distinguish strings from binaries (e.g.: Ruby)

(In Ruby, the distinction is not at all optional: binary is in encoding BINARY, text is UTF-8.)

Whether serializer should validate (raise errors if the string includes invalid bytes as UTF-8) strings or not

I think an implementation should not send invalid data. Whether that means the serializer needs to validate or that it can rely on its callers to supply reasonable data is an implementation detail.

Whether deserializer should validate (raise errors if the string includes invalid bytes as UTF-8) strings or not

Again, that depends on the expectations of the caller. If the callers are able to handle invalid data, give it to them.
If they would blow up, raise the exception in the msgpack deserializer.

Whether deserializer should normalize (replaces invalid bytes as UTF-8 in strings) strings or not

Never. You might want to offer a "hand me anything in raw form" version for debugging these situations, but "defensive programming" is a mistake. Just blow up.

(I read this assuming this was about errors in the the UTF-8 encoding rules. I think we all agree there should be no Unicode normalization or normalization checking in the msgpack serializer/deserializer.)

cabo commented Feb 22, 2013

You might want to offer a "hand me anything in raw form" version

... and you want this in a msgpack API to achieve the level of backward compatibility that the recent proposals like @frsyuki's and 01pre1 provide.

Owner

frsyuki commented Feb 22, 2013

@cabo I complement regarding B:
In Ruby (at least), users can contain invalid byte sequence in a String object which has UTF-8 encoding information. For example:

require 'uri'
s = URI.unescape("%df")
p s #=> "\xDF"
p s.encoding #=> #<Encoding:UTF-8>

This often happens in Ruby on Rails programs. This is just an example and this happens. So in Ruby programs, any String objects could contain invalid UTF-8 byte sequence.

cabo commented Feb 22, 2013

Arguably, that is a bug in URI.unescape. Of course, the whole issue of character encoding in URIs is muddy, so I won't blame the authors of that code. But clearly, an URI.unescape API should contain methods to handle the uncertainty created by real-world browsers. I don't think handling this issue is a concern of an unrelated piece of software like msgpack.

Owner

frsyuki commented Feb 22, 2013

@cabo I see what you mean. I think I need to change the sentences to be clear:

  • A) in languages which clearly distinguish strings from binaries and strings can contain only valid Unicode characters (e.g.: Objective-C, JavaScript, Python)
  • B) in languages which don't distinguish strings from binaries or do distinguish but strings can contain invalid Unicode characters (e.g.: Ruby, Perl, PHP, C++, Erlang)
Owner

frsyuki commented Feb 22, 2013

@cabo Please understand one thing: I do NOT want to have two similar specification (BinaryPack and MessagePack). It's very confusing and no one will be happy. I'm prepared to extend MessagePack spec if it's appropriate. So I would ask you to NOT fix the RFC without consensus.
I think you have your own goal while I have my goal. But let's work to avoid the worst case.

I'm almost agreed with adding string type extension. But it means I make compromise on my goal. I need to build the new consistent semantics with the string type extension which was not expected originally.

Now I'm thinking about the API and the format which works well with the API.

@mattn

Do you want to use msgpack to communicate to someone stranger who doesn't know the structure of datas?
MessgePack treat binary protocols. If you want to communicate string or image data, you should do design the data format using raw field as upper layer.

Why you want change spec? Why you don't do design of upper layer?

No. Please read the information on the MessagePack site before assuming that everybody else is using it wrong:

http://wiki.msgpack.org/display/MSGPACK/Design+of+RPC

Because every MessagePack message contains the type information side-by-side, clients and servers don't need any schemas or interface definitions basically. This is handy for utilizing it both in dynamically typed and statically typed languages.

http://wiki.msgpack.org/display/MSGPACK/Overview

MessagePack is an efficient object serialization library, which are very compact and fast data format, with rich data structures compatible with JSON.

Note: rich data structures compatible with JSON. This means something. Otherwise, why have any types? Why not just have everything as "raw"? Why have MessagePack at all?

Owner

frsyuki commented Feb 22, 2013

I think we need to think the document separately from the spec. If the documents are wrong, documents should be fixed.

I originally created MessagePack to develop a distributed storage system (=backend program) in C++ and Ruby (1.8). In this case "like JSON" was true because I didn't have to tell strings from binaries. JSON has only strings while MessagePack has only raws. Thus there're no problems to replace all strings with raws because these languages don't have to tell strings from binaries.

I created the website (and some others such as @kzk added some documents). I haven't used Python, JavaScript with MessagePack (I've used MessagePack in an Objective-C program but I didn't have any problems because I used schema to project msgpack types into Objective-C types). I need to change something. It could be documents or spec but you shouldn't mix up these two separated problems.

@frsyuki
I'm not mixing up any problems; rather I'm pointing out that the use case that suits some people is not necessarily the use case that suits everyone. If we can find a solution that works for most people without causing too many new problems, that's IMO much more useful than joining the discussion only to say "you are using MessagePack wrong, just make a schema or else use BSON/etc".

JSON has only strings while MessagePack has only raws. Thus there're no problems to replace all strings with raws because these languages don't have to tell strings from binaries.

That's great, but it's also completely against your stated goal of cross-platform, cross-language compatibility, because decoders have no idea what to do with the raw bytes which might be in any random format.
I think it's pretty obvious that strings are an important datatype which deserve better support than "dump the bytes and hope that the receiver knows (or can guess) what encoding you use".

cabo commented Feb 22, 2013

@frsyuki:

@cabo Please understand one thing: I do NOT want to have two similar specification (BinaryPack and MessagePack).

That is exactly why I came here to make sure we can find common ground. In the process of doing so, we need names for the various variants being discussed. So my current variant is called 1pre01.

I did 01pre1 because I think it does address all concerns raised here and is simpler than your previous proposal. I wrote it up because it is hard to discuss unless written up. I didn't submit it to the Internet-Drafts directory because I want to discuss it here first, so it's just on my personal web server. I need to finish discussing by Monday, though, and that is when I'll send a -01 to the Internet-drafts directory.

I continued calling the current spec "binarypack" because I didn't want to misappropriate the well-known and well-regarded "msgpack" label. If, in the course of defining this, we reach agreement, I'm much happier to use the "msgpack" name. Actually I called the most recent strawman BinaryPack1pre01, because while we are still in the process of nailing down things, it is good to have a name.

In the end I'd like to have a spec that both solves the problem well that I'm trying to solve and works well for the msgpack community. (If that is not possible, there will be a spec that solves the problem well, and I'll call it something else. But right now it seems a common spec is possible.)

kenn commented Feb 23, 2013

@cabo Even when we reach to a solid agreement on a particular spec, I don't think bringing it to IETF is a good idea, be it MessagePack or BinaryPack. At least this early. Just because we agree in theory, doesn't mean that the new spec has been proven to work flawlessly in the wild.

As the inventor of msgpack clearly stated that he has no interest in taking the discussion over to a standard committee, we should respect his intent. Especially when such a move is easily considered as a political tactic to change the game to your advantage.

To clarify, I'm not opposed to bringing it to IETF in the future - it's just that now is not the right time. We'll just know when it's appropriate, when it's appropriate. That's what happened to JSON - it was there since 2002 but the RFC4627 was established in 2006, much after it's already a de facto standard. And most importantly, Douglas Crockford did it, willingly. We should wait for now and hopefully, @frsyuki will become open to working with IETF some day.

Contributor

kazuho commented Feb 23, 2013

+1 to @kenn

@cabo
Please do not get me wrong. I appreciate your efforts on working for adding a string type to MessagePack. And if @frsyuki decides not to introduce string types to MessagePack, then it is understandable that proposing a different specification through IETF is a good way to promote such a format. But it does not seem to be the case any more.

My understanding is that the steering person of MessagePack is @frsyuki. Proposing it to IETF would mean that there would be two steering persion / committee for a single specification.

Midar commented Feb 23, 2013

Just for the record, I really like @cabo's new proposal and implemented it :).

Could we get something like this into MessagePack? Ideally, I'd like to see BinaryPack get imported back into MessagePack. Then we'd have one format and that would even be a standard. That would really be the best case.

cabo commented Feb 23, 2013

@kenn, @kazuho: I actually don't think I have that much of a choice.

I need a spec for something like a binary JSON (misnomer, but close enough), in order to be able to place protocols such as SenML on top of that.

In developing that spec, I could ignore msgpack, of course. But I think it has shown great potential, and choosing a spec to start from also reduces the peril of "bikeshedding".

There are indeeed some things missing from msgpack. I tackled the Text String issue first, because that is the most obvious gap. My current draft draft outlines a small number other areas where an addition to msgpack could be considered necessary. But generally, I'm quite happy with msgpack. And I still hope we can do these other things while keeping any impact on backwards compatibility under control.

So, while going outright for a "fork" might be a useful strategy to disentangle things, I believe that doing this together will benefit both the IETF and the msgpack community. I read @frsyuki's last statement as some initial support for this approach. It is also simply the right thing for me to at least try — I don't just want to "steal" the spec.

Actually going for standardization will require the IETF to have change control. There is a danger that this could lead us away from the msgpack community. (Worse, people that want to distract from this effort might deliberately attempt to make this happen.) The role of the msgpack community and especially of @frsyuki will always be a bit delicate in this process. But the IETF is used to introducing established practice into standardization, and we know that the stewards of an existing specification brought into the IETF always have a special role and an important voice. @frsyuki can choose to actively exercise this role or stay in the background. Either way, I'm confident that we can manage this process in a way that is satisfactory to both ends.

It is not a given that the IETF will want to pursue standardization of this kind of format at all. (Again, people that want msgpack "to be left alone" might want to deliberately attempt to make the IETF process fail. But I'm trusting that this community is not of this kind.) I'm actually looking to your support to make a standard happen.

I think, in the end, msgpack will benefit from additional visibility, and from the technical scrutiny that an IETF process brings with it. Getting a standard done is a serious amount of work, though.

cabo commented Feb 23, 2013

@kenn: I am actually quite happy with the level of maturity that msgpack already has. Why do you say it is "early"?

I don't think the history of JSON is a good model for future work in this space. JSON just happened while a lot of people were still thinking XML had a solid grip of this space. So it was the right thing to do this a bit under cover. (Actually, although being written down in an RFC and having widespread consensus behind it, JSON isn't technically even an IETF standard yet; we will start the process for that in March! But it won't change in that process, we'll just get rid of the UTF-16 and UTF-32 blind alleys.)

The development of JSON was also special in that it essentially just showed how to use elements of an existing spec (ECMA 262) for its purpose. It had the advantage of never having to discuss the essence of that spec, just minor details such as whether comments should be included or not.

In the world of "binary JSONs" (sorry), from an existing standards point of view, we essentially have a green field. (Unless you want to start from ASN.1 BER. I hope you understand why I don't want to do this.) So we need a bit more active stewardship to converge on one spec.

msgpack has a lot of what is needed, and the wide implementation will help avoid extensive "bikeshedding". If I were to design a format from scratch, I'd do a few things somewhat different. But that is almost all on the level of bikeshedding. The only pain that msgpack causes me is that it already has spent almost all codepoints (I think 11 are left out of 256), maybe out of some exuberant confidence that there won't be any need for extensions. Reclaiming some of this space would cause considerable pain, so I'm happy that we found a solution for introducing Strings that just reinterprets some code points in a mostly benign way.

cabo commented Feb 23, 2013

To clarify what I'm trying to do here, I wrote up a first draft of my objectives.
Roughly in decreasing order of importance, they are:

  • Representing a reasonable set of basic data types and structures
    using binary encoding. "Reasonable" here is largely influenced by
    the capabilities of JSON, with the single addition of adding raw
    byte strings. The structures supported are limited to trees; no
    loops or lattice-style graphs.
  • Being implementable in a very small amount of code, thus being
    applicable to constrained nodes {{?I-D.ietf-lwig-terminology}}, even
    of class 1. (Complexity goal.) As a corollary: Being close to
    contemporary machine representations of data (e.g., not requiring
    binary-to-decimal conversion).
  • Being applicable to schema-less use. For schema-informed binary
    encoding, a number of approaches are already available in the IETF,
    including XDR {{?RFC4506}}. (However, schema-informed use of the
    present specification, such as for a marshalling scheme for an RPC
    IDL, is not at all excluded. Any IDL for this is out of scope for this
    specification.)
  • Being reasonably compact. "Reasonable" here is bounded by JSON as
    an upper bound, and by implementation complexity maintaining a lower
    bound. The use of general compression schemes violates both of the
    complexity goals.
  • Being reasonably frugal in CPU usage. (The other complexity goal.)
    This is relevant both for constrained nodes and for potential usage
    in high-volume applications.
  • Supporting a reasonable level of round-tripping with JSON, as long
    as the data represented are within the capabilities of JSON.
    Defining a unidirectional mapping towards JSON for all types of
    data.
Owner

frsyuki commented Feb 23, 2013

@cabo I almost understood what you mean. 1 question and 1 comment:

  • I couldn't understand this part: "thus being applicable to constrained nodes {{?I-D.ietf-lwig-terminology}}, even of class 1. (Complexity goal.)"
  • What's class 1? Could you try to explain using different words...?
  • Regarding "Supporting a reasonable level of round-tripping with JSON", there're some exceptions now:
    • json->msgpack conversion: maximum length of arrays, maps and raws is limited upto (2^32)-1 in msgpack
    • json->msgpack conversion: JSON represents numbers using decimal while msgpack uses floating points
    • msgpack->json conversion: msgpack can contain binaries
    • msgpack->json conversion: msgpack can use non-string/raw types as the keys of maps
    • msgpack->json conversion: msgpack can store only one primitive (non-map/array) value without map/array containers
    • I think none of them are the real problems.

catwell commented Feb 23, 2013

Hello,

I'm joining this long thread a bit late, having read most of it. I wrote an implementation of MessagePack for Lua, which is a dynamic language of type B) (i.e. doesn't differentiate strings and raw bytes).

The interesting thing with Lua is that it also doesn't differentiate Arrays and Maps so we have already had this kind of implementation problems.

Decoding is not much of a problem in this case, although you lose a bit of information. We could add a way to decode that exposes additional type information but I have never needed it in practice. Encoding, on the other hand, is complicated because you have to decide on the right type to use.

The basic idea I ended up going with for Arrays / Maps (which both correspond to type "table"in Lua) is that you have a specific function (in another language it could be something else, for instance an object) which takes an instance of an ambiguous type and returns how it should be encoded. In your implementation you provide a default version of that, but you allow users to override it if needed.

That works fine for tables because you can attach metadata to them, but you cannot do that for strings which are a much more basic datatype. You have to wrap them in a "more powerful" datatype. In Lua that would be either a table or a function.

This would be too complicated to explain to users so eventually the encoding library would have to abstract that, and the API would be like:

mylib.pack{
  my_string = mylib.string("this is a string"),
  my_binary_data = "this is raw bytes",
}

Note that even with the Array / Map issue I have my implementation inter-operate with lots of other languages such as Python, Ruby and C.

Actually my implementation also supports an unofficial type for byte arrays for interoperability with msgpack-js (see catwell/luajit-msgpack-pure#6). This makes the opposite assumption that the native MessagePack raw type is used to store strings, but having read this thread I think what I did back then it is actually a bad idea for a variety of reasons. I implemented it the way @creationix suggested because I didn't really care myself (since I don't use JS).

I think it would be a good idea if @creationix joined the discussion too since he has experience dealing with this on the JS side.

Owner

frsyuki commented Feb 23, 2013

@catwell Thank you for your comments. I'm really curious about how @creationix handles MessagePack data in JavaScript.

cabo commented Feb 23, 2013

@frsyuki: The class 1 terminology is defined in the referenced terminology document, http://tools.ietf.org/html/draft-ietf-lwig-terminology — essentially this is a device with about 100 KiB of code storage and about 10 KiB of RAM.
You want to be very frugal with code size on such a device.

Re the JSON roundtripping: There are some initial considerations written up in my current draft. I agree that this works quite well with msgpack. We can accommodate binaries on the JSON side by base64url-encoding them, this is how JOSE (http://tools.ietf.org/wg/jose/) runs with the binary crypto information. On the msgpack side, I want to spare my little class1 devices from having to work on these base64url strings, hence my interest in raw byte strings. Obviously, the direction of the roundtripping from JSON to msgpack can only work in a schema-informed way if you want to turn the base64url strings from JSON back into real binary on the msgpack side.

cabo commented Feb 23, 2013

Re https://github.com/creationix/msgpack-js@creationix simply defines new binary types (0xd8 = buffer16, 0xd9 = buffer32) — he did the differentiation just the other way around we have been doing this here. (He also has a "undefined" as a fourth special value.)

@cabo for #121 (comment)

Please do consider omitting some things from JSON to make it a standard (and adding none).
For example, it would greatly help if \u0000 can be formally forbidden or at least deprecated, and I believe \uFFFE and \uFFFF are not valid either (actually I don’t think there’s a clear answer on whether they are allowed right now or not, and all implementations differ).

Considering JSON is intended as a portable data interchange format, disallowing them only makes sense. I don’t just think of C strings (as you can use a buffer API with (pointer, size) tuples in C) but also high-level languages that are constrained to C strings, and security implications.

For a bit more (and, sadly, a bit “incoherent”, as my English isn’t that good) rambling on that topic: https://www.mirbsd.org/permalinks/wlog-10_e20121201-tg.htm
Although everything else I write in there (like, suggesting only a few valid encodings; suggesting to always backslash-escape C1 control characters; suggest a nesting depth limit; suggest to always sort Object keys ASCIIbetically to not leak internal hashtable state after randomisation) are requests for implementors, not for the standard. I think I also make a good point about not using that JSON5 abomination. Everything I suggest to change in that post will still result in something that’s conforming to the ECMA-262 JSON.

(This is about strings obviously. \x00 is fine in binary octet representations. My point was about JSON, which doesn’t have them.)

cabo commented Feb 23, 2013

@mirabilos — if you want to influence JSON standardization, the best way is to subscribe at https://www.ietf.org/mailman/listinfo/json and make this point to the mailing list. Better, read what already has been said on the mailing list (the above has a link to the archives) and chime in. The IETF is open to all!

@cabo: Thanks, of a sort. I have so many projects I’m working on already, in addition to a dayjob, that I cannot follow any standardisation lists. For example, as a shell maintainer I should follow the Austin mailing list (POSIX), but it’s so high-volume I gave up after piling more than 1000 mails in less than a week…

Additionally, subscribing to a mailing list just to post something one-off is both effort for me and probably not liked by the people… but I’d be happy if you can forward my points.

Wow, what a long thread. I think I was able to read about 30% of it.

So first, let me share my experience implementing msgpack for interop between browser javascript, node.js javascript and luajit lua. In all three platforms there are unique types for strings and raw binary data. In javascript, the string type is UCS-16 unicode. JSON, which is a subset of javascript requires that strings are encoded as UTF8 (which is the only sane unicode serialization format IMHO). JSON encoders and decoders for javascript already have to convert between the UTF8 encoding and the 16-bits per code points encoding used internally in the language.

Now raw binary data is a new feature to JavaScript. In the browser it comes in the form of ArrayBuffer which you can read using typed arrays or DataView instances. Also object keys in JSON and JavaScript are restricted to unicode strings. In Node.JS, we created a binary format before typed arrays were popular called "Buffer". It works somewhat like the browser's ArrayBuffer type but with a different API for getting at the data.

Regardless of the differences between the two raw types, I am able to interop perfectly fine between because their serialization format is just raw bytes. The problem arises when I want to msgpack encode a string and a buffer and get a string and a buffer out on the other side. They are not the same type and have very different meanings. The main reason most people use msgpack over json in javascript is to have support for a binary data type. The other option is to base64 encode the binary data inside a unicode string, and even then you need some out of band encoding tag to tell the consumer it should be base64 decoded. JS data tends to be schema-less and types should be introspective on their own. Having only one kind of value for strings and buffers is a real problem for JavaScript.

Now as far as lua, it's not quite as bad. Lua strings don't specify an encoding. I use UTF8 in all my code. In luajit I use the FFI to create raw char* buffers for my raw type. When interfacing with my javascript code I want my js strings to come through as utf-8 encoded lua strings and I want my javascript buffers to come through as luajit ffi char* arrays.

I implemented an extension to the msgpack protocol where strings are msgpack's raw type (because most code and JSON use strings), but also added a new type that's meant to be raw/Buffer. In practice, this has worked out very well for me. I would love it if this addition made it into the official spec so more languages could interop.

As long as msgpack is supposed to be like JSON, but also support binary data, it should really have two distinct types for unicode strings and raw binary data. Otherwise most my javascript colleagues will use other formats because they are not willing to add a schema to their protocols just to tell strings and buffers apart. The msgpack encoding and decoding layer should be standalone and not depend on user-provided schemas to know how to tell apart strings from raw. We can't just decode all msgpack raw data as buffers in javascript because not all javascript runtimes even have a binary type. Also it's very expensive to create buffers and then convert them back to strings later on based on a schema from a later layer.

Let me reiterate, dynamic typed languages require that all values hold their type data internally for primitive types. They never rely on external declared types or schemas. Read up on how dynamic language runtimes are implemented. This is the main reason they use more memory over statically typed systems, because every value needs it's type tagged somehow. To tell the user of a dynamic language, where this philosophy is engrained so deep, that they have to annotate their data with types just to tell apart two primitive types is crazy. That's like saying we should merge null, booleans, and numbers into one type, and they should use a schema to know if that 0 means null, false, or 0.

Also, in case you don't know already, here are my JS implementations with the differeneces from msgpack documented. https://github.com/creationix/msgpack-js https://github.com/creationix/msgpack-js-browser

kiyoto commented Feb 24, 2013

@cabo just curious:

"Due to the timing in the runup to the Orlando IETF, I will have to submit the final version of the Internet-Draft on Monday. So I would be happy if we could get a bit of a closure on the string representation issue here until then."

Is there any extrinsic reason you need to get this done this year? To me, getting a standard ratified seems more of a by-product of everyone agreeing on a "satisficing" design (which seems to be the case now, finally) rather than an end goal.

cabo commented Feb 24, 2013

@kiyoto: the standardization of Smart Object Networking (Internet of Things, IP Smart Objects, whatever you want to call it) is happening now. I believe adding a msgpack-like component to it would create an important basis for other standards to build on. And it also appears to be achievable in short time. Why wait? Wait for what?

Contributor

kazuho commented Feb 24, 2013

@kiyoto +1

@cabo

Please wait for consensus. Having a the protocol specified tomorrow might be important for the areas in which you work in. But it's not the only use case of MessagePack.

We should try to create a single specification that all parties can agree on, and I do not think it would be possible within such a short period.

Having support by existing developers / users of current MessagePack spec. is IMO essential to promoting the new version MessagePack with string support (or BinaryPack). But introducing a string type to MessagePack will hurt existing users no matter how, since it is actually an attempt to split a single type ("raw") which is used for storing both strings and binaries into two types. Either type of data should move to somewhere else, thus some incompatibility is inevitable.

So we should be cautious on making a final design. I think we are in a very delicate situation now whether we can reach consensus, and rushing to IETF might hurt such efforts.

I imagine your are very frustrated; my understanding is that you proposed BinaryPack by yourself since none of the MessagePack developers seemed to be interested in adding string types. And after that they have started! I can understand that.

But for the greater good, I wish you withdraw the BinaryPack proposal this year. I think we should concentrate on trying to gain support from as many existing and potential users of MessagePack as possible, before declaring the protocol as final. And after then, we should consider bringing the specification to IETF if @frsyuki thinks that is the good way to spread the protocol.

You might loose some merit by not being able to refer to an IETF protocol for a year, but once we reach to an agreement on the design, the power of existing developers / users and the name of MessagePack will help you (and all of us) from promoting applications using "MessagePack with string types".

cabo commented Feb 24, 2013

https://gist.github.com/frsyuki/5022569

Great! I have aligned my draft draft with this: http://www.tzi.de/~cabo/draft-bormann-apparea-bpack-01pre2.txt

(I'm still referring to this format by the monster name of "BinaryPack1pre2" because the msgpack spec hasn't officially changed yet. I'd love some advice how to call this when I submit this tomorrow...)

@kazuho: The IETF is not going to turn this into an RFC tomorrow. I don't even think that our consensus processes are faster than yours... I can promise you this won't be an RFC in 2013. But it is important to have something written up now so we can build consensus on the general direction of going forward on the basis of a fully fleshed out technical proposal (that is the whole point of an Internet-Draft).

It is not damaging msgpack if the consensus process in this community is visible to the IETF and vice versa. We still have to decide whether to do our own thing or go with msgpack. I'm simply not in a position to withdraw my proposal. The only thing I could do is making it deliberately incompatible with msgpack. I don't think this community would benefit from that.

cabo commented Feb 24, 2013

Oh, one more comment:

I imagine your are very frustrated; my understanding is that you proposed BinaryPack by yourself since none of the MessagePack developers seemed to be interested in adding string types. And after that they have started! I can understand that.

Standardization can be much more frustrating than this... No, I'm not easily frustrated.

Actually, the history is that I needed a binary representation format, had been toying with msgpack for a while, but couldn't really use it because of the lack of string/binary differentiation. Then I ran into Eric Zhang's BinaryPack, and decided I should simply write this up. I didn't even know at the time that msgpack-js had also added this differentiation, in a different way... It's good we are starting to converge again.

rasky commented Feb 24, 2013

@frsyuki > https://gist.github.com/frsyuki/5022569

Why you put Python 2 among the weak-string languages? It's got a Unicode type since 2.0 which is in wide usage, and anybody doing i18N programming with Python is using Unicode. The only issue (compared to 3.0) is that people tend to use the "str" type as a ASCII/UTF8 string (much more than in Python 3). But given your definition, I still think Python 2 fits strong-string languages.

I would also expect msgpack-python to correctly serialize Python 2.0 types so to fix the original issue described in this document. Your document seems to imply that, for Python 2.0 (which you call weak-string), it's unrealistic to expect that it would work; I disagree with this.

Midar commented Feb 24, 2013

@frsyuki +1 on that. As soon as the MsgPack specification will be updated, I'll rename my implementation from a BinaryPack implementation to a MsgPack implementation :)

I'm really happy would could finally find a good solution. Thus all that discussion was not for nothing :)

catwell commented Feb 24, 2013

I like @frsyuki 's proposal, as long as the implementers are careful to implement the transition plan correctly and do have a backwards compatible mode.

We have a lot (millions) of archives out there on servers but also on mobile devices of clients, and we have been using the raw type to store arbitrary binary data. That means we will have to use this backwards compatible mode for a long time.

That being said finally having an official string type is a good thing. I will implement the proposal in a few weeks if everybody looks happy with it.

Contributor

kazuho commented Feb 24, 2013

@frsyuki

Great work! Compared to others, the proposal seems to have the lowest impact to the existing applications.

@cabo

Thank you for responding, and thank you for explaining your situation and ideas in detail. Let me explain my situation and why I think proposing the spec. without @frsyuki's support.

I work for one of the largest companies in the web industry of Japan. The company has long been using MessagePack on the server side, as a data format for the key-value stores and for server-to-server communications.

Recently, there has been a rise of requirement from our client side developers (I am one among them) for a JSON-like protocol that can efficiently store binary data. Thanks to the specs of HTML 5 binary data is becoming familiar on the web browsers, and as existing users of JSON, what we would want is some kind of well-designed format that can store all the data types of JSON and binary.

MessagePack with support for string types is an execellent choice for such an requirement since our server-side engineers are already well-experienced with the the protocol, from developing using the libraries to debugging at the wire-level.

But we would also face problems once the protocol spec. gets updated.

As I mentioned, we are already using the protocol, not only as a server-side protocol but for storing data as well. Most of our "raw" data are strings for sure, but there might be binaries (such as images) as well. It is very difficult to check, we have thousands of developers working for many applications, and the combination of MessagePack and key-value stores can be found in many of them.

Adding a string type to MessagePack can never be done without moving either of the data to a new area (and I like @frsyuki's proposal very much in the fact that it does not move strings, which makes us let easier to migrate).

But we still need to find out the right way to implement the codecs that encode/decode MessagePack data, so that we could support both the legacy and the new format at the same time with minimum effort.

Although I am very optimistic esp. after looking at @frsyuki's proposal, I still need to convince my colleagues to support MessagePack with string types, or request a change if we find any problem.

And if I fail, my company would likely not use "MessagePack or similar protocols with string support" even as a the client-side-only protocol, since it would be confusing for us to have two similar but different protocols. It would make us harder to debug. We would likely go for BSON (I do not think it's well-designed) or some kind of the sort.

This is my personal situation, but I think many developers in the web industry think the same way.

And it is the reason why I think sending something to IETF now esp. without the consent of @frsyuki is a bad idea. IMO we are still in an early stage of designing the protocol and the API. Having two steering wheels for the enhancement at this moment decreases the possibility of our reaching to a single protocol.

The only thing I could do is making it deliberately incompatible with msgpack.

If you are going to propose the protocol to IETF anyway, I would appreciate it if you could make it as different as possible from MessagePack so that it would never get considered as a "variant of MessagePack", which would likely cause confusion.

In fact, if BinaryPack gains familiarity to an extent that my company starts considering of adopting the protocol, it would be better for us if the two protocols were not similar at all; it would help us distinguish the two at the wire level and debugging would not become difficult.

Member

methane commented Feb 24, 2013

@rasky Many Pythonistas use best practice: Use unicode for all string.
But it's not a language spec.
ASCII only bytes is compatible to unicode (e.g. b'id' == u'id'). Builtin uses bytes as string.

def foo(): pass
assert type(foo.__name__) is bytes

So Python 2 is weak-string language by language spec and strong-string language by practice.

Owner

frsyuki commented Feb 24, 2013

What I want to do from this time regarding this string type issue is following process:

  1. I'm about to propose an idea to change the spec
  2. Active authors of implementation projects implement the idea
  3. They release that implementation as an experimental release (could be an internal release for their use case)
  4. Active users try the implementation and validate how it works
  5. If the proposal needs fix, fix it. This fix may include changes of the proposed format
  6. Iterate 1 to 4 again until there're enough knowledge
  7. Release the already implemented experimental release officially and the proposal as the official release and official spec

My article (https://gist.github.com/frsyuki/5022569) is about to be the step 1.

This takes time. But anyway, active users can't use this release soon without validation for all projects as @catwell and @kazuo mentioned. I think this is the correct way to change a currently working spec.

I'm using MessagePack to provide a cloud-based service which stores data and runs queries on the data. My company has over TBs of customers' data in msgpack format. Changing the format of the data is almost imposible.

rasky commented Feb 24, 2013

@methane that's because in Python 2 it's not possible to define a symbol name using non-ASCII character; the fact that it's using bytes internally is an implementation detail, as there is no different in comparison as you noticed. Now, there are many places where Python 2 hasn't a clear string/unicode distinction at the API levels, but still most Pythonistas know and expect unicode for strings.

My point is that, in @frsyuki document, weak-string langauges should behave in a way that I think it's totally wrong for Python, because Python do have full Unicode support since 2.0. This is why I think @frsyuki is wrong, and Python 2.x should be moved among the strong-string languages, so that all uses of the string msgpack type are converted to Python unicode.

rasky closed this Feb 24, 2013

rasky reopened this Feb 24, 2013

cabo commented Feb 24, 2013

@kazuho: This is fascinating. Of course, I don't have any visibility into the decision processes in your organization, so it is indeed new information to me that widening the discussion to include the IETF could discourage adoption there. I don't want to sound harsh, but after thirty years of standardization I'm aware that there is sometimes collateral damage. I'll still try to minimize that, if possible at all. So if you have any proposal for me that is not equivalent to "stop doing your work", please send it to me, via e-mail.

Re your proposal of doing something deliberately incompatible: I actually thought about submitting a "msgpack done right", without any constraints by msgpack compatibility. Designing from scratch would probably yield an incrementally, slightly better encoding than what we have now. But it is unlikely that there would be a functional difference. I don't want to be guilty of http://xkcd.com/927/ — so I will consider this only in earnest if this community explicitly instructs me that compatibility with a future IETF specification is undesirable.

Midar commented Feb 24, 2013

@frsyuki For the record, I already implemented your proposal and "released" it (it's visible in Git), as it's equal to BinaryPack1pre2. So 1 - 3 is covered already :).

Owner

frsyuki commented Feb 24, 2013

@cabo

Consensus across active committers is necessary to run step 2, 3 and 4 I described above. Without that, it becomes difficult to verify the proposal works well or not in the real production environments thus we msgpack core team may offer a disappointing spec.

If another idea is proposed (such as adding time type, timezone type, uuid, 128bit int, 16bit floats, bigint, decimal, regexp, sha1 hash, whatever. There're so many requests), then it needs the same cycle again. Active msgpack users (including me and the authors) need verified compatibility (both of code and data) at any time. IETF might think msgpack is not a defined yet and easy to change but actually it's defined. I think you've understood but again, we can't assume changing or adding currently working spec is done correctly without implementations and verification in the real production environments.

I think currently proposed msgpack should be fixed (consistent) as is. JSON is working very well. JSON doesn't have any other types. They're simple. I know msgpack/json don't satisfy all cases. But it's applications business.

So, anyway, I think I need to ask you some questions:

  • What will happen next if you proposed the draft draft?
  • If you proposed, how can we change the proposed draft as we propose later? I mean:
    • We may be able to improve the draft after the verification process.
    • I think the draft needs to be based on the concept of msgpack to guide users to not misuse
  • How can the authors/stakeholders prevent the working group from adding changes into msgpack without the verification process in their production environments and their/stakeholders' consensus?
    • And, who are you? Why you know about IETF? (I just mean I need some more information to confirm why you desire to submit a proposal now)
Contributor

kazuho commented Feb 24, 2013

@cabo

@kazuho: This is fascinating. Of course, I don't have any visibility into the decision processes in your organization, so it is indeed new information to me that widening the discussion to include the IETF could discourage adoption there.

From your statement I understand that you do not know how MessagePack is actually being used (a I mentioned my company case is not something unique), and that fact makes me really scared about your aptitude as a submitter of an internet draft of MessagePack.

Do you think somebody without the knowledge of how the protocol is being used can make good decisions regarding the protocol? I do not think so.

Member

methane commented Feb 24, 2013

@rasky Why Python 2 is listed in weak string language is I told @frsyuki so.

MessagePack-update categorize languages that "bytes may represent string" as "weak-string language".
Having unicode type doesn't mean strong distinguish between binary and string because
bytes may be used for both of binary and string.

I'm big fan of Python, And no one intend to speak ill of Python 2.

Owner

frsyuki commented Feb 24, 2013

I actually agree with @kazuho:

fact makes me really scared about your aptitude as a submitter of an internet draft of MessagePack.
Do you think somebody without the knowledge of how the protocol is being used can make good decisions regarding the protocol? I do not think so.

I'm very scared how IETF easily tries to change the spec.

Member

kuenishi commented Feb 24, 2013

I have no idea that "Smart Object Networking" exactly means. If cabo-san is talking about some small devices communicating via msgpack for power savings in small devices or so, Why network communication protocol is not shown as well? It looks like traditional, authentic and awful classic protocol like CORBA or ASN.1 might be enough for that.
Serialization protocol itself have nothing to do with Networking.

cabo commented Feb 24, 2013

@kazuho: I'm sorry, but that comment of yours was off the mark. People from my culture get very nasty from ad-hominem attacks, and rightly so. (I have been using RPC since before the term was first mentioned in a publication. I've seen dozens, if not hundreds of marshalling formats. msgpack just happens to get a larger number of design decisions right than other ones. I very much understand "how MessagePack is being used". My comment was about the weird processes in your organization where gathering additional support for msgpack would endanger its adoption in your organization. Indeed, I'm not used to that kind of thinking, so that's where I expressed my surprise.)

But then, I'm not submitting MessagePack to the IETF. I'm submitting a specification that just happens to be compatible with msgpack, because msgpack is almost good enough for my requirements (which are documented in that specification). If @frsyuki wants to join me in this, he is more than welcome. I don't want to lead this at all. I just want it to happen, on a reasonable time frame, with a technically reasonable outcome.

@kazuho: I'm sorry that you need to be so protective of your turf. And I'm also sorry that I don't speak Japanese. I just can't stop my work because of that. But I'm happy to stop engaging this community if @frsyuki, the inventor and recognized steward of msgpack, instructs me to do so. The resulting confusion will not get smaller, though.

So far, I'm happy that my actions might have catalyzed the process ever so slightly that might now lead to the resolution of msgpack's string issue. If you look back at the start of this long thread, there were people leaving (or not joining) the MessagePack community, or doing random incompatible forks, because the issue wasn't being addressed. Yes, fostering needed evolution brings some disturbance to this place. But the only really quiet place is a grave.

cabo commented Feb 24, 2013

I have no idea that "Smart Object Networking" exactly means.

Sorry for not expanding this term. Try http://tools.ietf.org/html/rfc6574 for a gentle starter.
CORBA or ASN.1 are not useful in this space.

Yes, we have some protocols for message interaction, but data formats are also protocols. I already mentioned http://tools.ietf.org/html/draft-jennings-senml as one such protocol. This can be encoded as text XML, binary XML (EXI), and JSON. I don't want to get stuck with XML when I need binary. That's why I'm here.

Contributor

kazuho commented Feb 24, 2013

@cabo

I'm sorry if it sounded like insulting, but I expect people responsible for standardizing an already-working protocol to be extremely sure what he / she is doing. I also hope that you are the one as well.

In case of MessagePack, not every part of the documentation is as clear as RFC, so if the protocol were to be standardized, the vague parts of the current specification should be clarified according to how actual implementations work (or it would break compatibility), as @frsyuki has done in his newest proposal on defining how already-existing serializers / deserializers should work. To do so, you would need the help of @frsyuki and the community.

Or if anyone requests for a clarification of the meaning of the spec. through IETF, how are you going to handle it? Such kind of work can only be done by the community.

In other words, it would be dangerous to submit a proposal to IETF unless it is done by a person who knows very well about the protocol (including how it is used), or when the submitter has a good relationship between the community - and when the community is willing pay the costs of standardization.

To me it seems that you are not the former. So it the latter the case? I'm afraid it is not. It is not only me who is worried about bringing the proposal to IETF at this moment.

It would be great if MessagePack becomes RFC. But I think that should be achieved by the willingful support by the community. Streamrolling a submission and then enforcing the community to help brush up the specification is not a good way. And if such communication fails, then the specifications might fork.

Such a future isn't beneficial for anybody.

PS. as I described, it's not about language. It's about communication.

Hi.
I'm trying to implementing msgpack mruby gems since I heard that doesn't exist.
I too got stuck in the raw section and happily I found this issue.
And I read your proposal( https://gist.github.com/frsyuki/5022569 ) but I still get stucks.

Mostly from the line

don't have to validate a string on storing it.

I know it's impossible due too many limitation but I get discomfort because I would tell "sorry, maybe I broke your data" if I ignored data regulation.
So had debate with oneself how to solve it.
I end up expanding two type(binary and string) to three type:

  1. binary type(new proposal's binary)
  2. maybe string or binary type(old raw)
  3. unicode string(new proposal's string)

For example in C++ I would map
1 to std::vector,
2 to std::string,
3 to std::u32string or std::basic_string<uint32_t> (that is validated utf-32 using unicode iterator or other unicode libraries)
All types will be serialized to the same type as input.

For another example in mruby I would map all 1,2,3 to standard mruby string.
And all three type will be serialized to 2nd type.

All type are capable to cast to each other except 2nd to 3rd type conversion.
In 2nd to 3rd type conversion it must validate to unicode string.
Though I don't have any idea about default behavior of 2nd string unpacking whether to validate it in language with unicode support.
I thought 1st binary type is unnecessary sometimes but it's more rational to have that kind of internal type.

By adding one more type it will bear upon to the type chart, it needs some solution.
Adding fix/16/32 size is not a good idea(at least this time), I suggest adding single type with variable-length using variable length quntaty.

btw, I felt msgpack community is obscure even if I can speak/write Japanese when I saw this page.

cabo commented Feb 24, 2013

@kazuho Thank you, apology is appreciated.

Again, I don't want to do this. But before we discuss who should do what, let my try to clear up some potential misconceptions of the IETF process.

If (big if!) this ever becomes an IETF WG document, then the WG chair (likely not me; I'm chairing other WGs) will select a document editor. That person will follow the instructions of the WG. Anyone is welcome to be part of the WG, members of the msgpack community included. You don't have to do anything beyond subscribing to a mailing list. Contributions by WG members are generally weighted by their technical merit. If an issue comes up, technical input will be solicited from the WG and other experts. If @frsyuki doesn't want to take on any of the roles I have described, his input will certainly be very welcome, as will be everybody else's input based on technical merit. So I'm sure that the community input will be heard, just as I'm right now trying to hear it in coming up with my own little spec.

The MessagePack spec is trivial enough that I don't see big problems in arriving at an interoperable specification. Hey, you haven't even started properly writing it up yet, and you get some interoperability. For now, my only contribution is trying to do this work. But, yes, the existing community is important, and nobody wants to do something stupid that alienates existing users.

I'm not sure "steamrolling" is a good description of what I'm doing here. Progress with msgpack has been stalling badly. The inaction has been alienating some people, and other people have been forking the spec already a couple of times.
Of course, all those that didn't care about addressing the issues are now the remaining msgpack community.
That is not necessarily a very healthy situation. I'm saying this because from your insider position that may be hard to notice.

Member

kuenishi commented Feb 24, 2013

I don't want to get stuck with XML when I need binary. That's why I'm here.

@cabo This may be a reason why you choose MessagePack but this cannot be the reason why you write a draft and standardize MessagePack-like serialization protocol. It does not look worth paying such time and costs for us (existing msgpack users, maintainers and @frsyuki), because msgpack is implemented as free portable software, not hardware.

take_cheeze dixit:

  1. maybe string or binary type(old raw)

A “maybe” doesn’t have a place in a specification.

Owner

frsyuki commented Feb 24, 2013

@cabo

You might missed my comment above:

So, anyway, I think I need to ask you some questions:

I (and almost everyone here, I guess) didn't know the IETF process.
Is this my understanding on IETF WG correct?:

  • Subscribing the mailing list and posting opinions is the only way to "propose" changes of the draft
  • The editor is the only person who can edit the draft
  • The editor will not be you or me. The editor will be a neutral person who is selected by the WG chair
  • The editor writes a draft by following on the instructions of the WG

(If this ever becomes an IETF WG document.)

Owner

frsyuki commented Feb 24, 2013

I don't think the proposal (https://gist.github.com/frsyuki/5022569) is matured.
We'll see other comments as @take-cheeze commented.

Contributor

kazuho commented Feb 24, 2013

@cabo

Thank you for describing the process. It will help many of us understand it.

But whatever the process is, it still seems to me that you are trying to enforce the community to pay the cost of the standardization process, or else the protocol might fork into two, which would cause interoperability problems for sure.

And what I am afraid is that your following statement is wrong.

The MessagePack spec is trivial enough that I don't see big problems in arriving at an interoperable specification.

As many have described (and as is also is documented in @frsyuki's proposal), adding a string type to MessagePack is a huge problem since it introduces incompatibility. Under the current specification both strings and binary data are stored within the same "raw" type, but once we introduce "string" type, we should distinguish the two. In other words, either data should be moved to somewhere else, and that introduce incompatibilities. And this has been the reason why a string type has not been introduced for so long.

So editing a draft of MessagePack is not an easy thing to do, if you care about interoperability. It cannot be done without the help of the community. So please reconsider instead of trying to enforce the community to pay the cost of standardization.

Of course it would be easy if you do not care about existing implementations. But my understanding is that IETF do take care of those. Is my understanding correct?

cabo commented Feb 24, 2013

@frsyuki Thanks for the questions.

Being on the mailing list is indeed the only way to contribute to a WG process. Now we don't know yet what WG may be handling this, so if it helps that this be a low-volume, very focused mailing list, we may try to create one.

An Internet-Draft (I-D) can be in two stages: personal draft or WG draft. A personal draft is edited by whoever wrote this (that's why my bpack draft is called draft-bormann-..., Bormann is my last name). A WG draft is edited by one or more editors (e.g., draft-ietf-core-coap is a draft of the CoRE WG). That is usually, but not always the person why wrote the initial personal draft -- this is the discretion of the WG chair.

I'm sure if you want to be the editor of the msgpack draft this will be highly welcome, because this is your work. Remember, however, that such a position comes with the responsibility to carry out changes as the WG decided. It has happened in the past that an editor and the WG disagreed to the extent that the editor stopped editing or even made unwanted changes; in such cases a new editor is appointed (or, if the rough consensus we strive for is not achievable at all, the work is stopped). Many drafts have multiple editors. I wouldn't mind being an editor either, mainly because I think I'm pretty effective at the kind of editing work that remains to be done. We could both be editors; e.g., draft-ietf-core-coap has four editors. You could choose somebody else as a second editor (with the chair's approval). It doesn't hurt to have a pair of editors that represent different communities. Etc. (We are quite some ways from having to make the decision.)

@kazuho Whether the IETF cares about existing implementations is dependent on the specific work that needs to be done. E.g. Oauth 2 differs a lot from Oauth 1. But in that case, evolving the spec also was the (rough!) consensus of the people involved. If people from the msgpack community believe backwards compatibility is important, they should make that point, and I'm sure you will be heard. (Speaking just for myself: I personally don't have a requirement for backwards compatibility, but I'm also interested in maximizing interoperability.) I'm not sure how much impact the implementation considerations for backwards compatibility will need to have on the spec.

I'm sorry if I missed comments here, this github issue is a bit more active than I had planned for...

cabo commented Feb 24, 2013

@kuenishi — you act as if I created the problem that is being solved in this github issue. I didn't. This github issue is old, and it is actually a duplicate of github issue 13 that is even older. I just happened to write a specification that also needs a solution for the problem, and that picked up an earlier msgpack fork (binarypack) that was created because issue 13 was ignored. So I have no idea what cost I'm creating here.

@mirabilos

A “maybe” doesn’t have a place in a specification.

Sorry I need to use words more carefully.
I mean a type like

Languages which don't have types to distinguish strings and byte arrays (e.g.: PHP, C++, Erlang, OCaml)

in background section of @frsyuki 's proposal.

cabo commented Feb 24, 2013

Hmm, IIRC Erlang does have strings that are different from binary?
(And C++ indeed has vector vs. string etc...)
Maybe you do want to handle text strings using the same type that you use for byte strings.
But that is not a property of the language.

Owner

frsyuki commented Feb 24, 2013

@cabo Thank you for your detailed description. Let me ask one more question:

I'm a founder of a startup company in SiliconValley. Success of the company is obviously the primary goal which I have to run toward for first. It means I can't spend lots of time for creating a RFC standard for MessagePack.

Thus question is that: what type of work is expected to be the editor?

cabo commented Feb 24, 2013

@frsyuki As I said, I don't think there is much work to be done at this point. Your main job will be to be cognizant of all the issues and make sure the document is consistent. If you absolutely need to minimize your work, you probably want to have a second person on the document to handle tedious things like IANA considerations, editorial comments, somebody who knows the IETF process.
E-Mail me for more details (my e-mail address is in the draft).

Owner

frsyuki commented Feb 24, 2013

@cabo Let me comment about C++:

C++'s std::vector is used to represents an array whose elements' type is T. If the T is an integer, then it should be serialized using msgpack's Array type and Integer type.

Imagine that the type is "char". It could be an array of very small integers. It could be a byte array. The serializer can't distinguish. How about uint16_t? It could be a UTF-16 string.
C++ programmers are allowed to store byte arrays using std::string. Many programmers (including me) use it because it's easy to use. Otherwise programmers need to allocate/free memories manually.

These languages are widely used. This is a fact. I wanted to have intermediate data representation format which don't enforce extra work to set markers on these objects. MessagePack focuses on these problems. Thus adding a string type is difficult to decide.

Owner

frsyuki commented Feb 24, 2013

@cabo I'm sorry but let me be cautious...

What do you mean "at this point"?:

I don't think there is much work to be done at this point

I think this is why @kazuho says "enforce":

If you absolutely need to minimize your work, you probably want to have a second person on the document to handle tedious things like IANA considerations, editorial comments, somebody who knows the IETF process.

Finding dependable person who kindly spends time for an open source project seems hard. But you need an answer by Monday, right?

Please don't get me wrong. I'm appreciated your guides and suggestions on this project.

I think I have following 3 options but none of them seems excellent idea:

1. I'll be the editor (hard to spend time)
2. I try to found a dependable person for an editor (takes time)
3. Just say "don't propose msgpack as a draft" (result in confusing two specs)

cabo commented Feb 24, 2013

@frsyuki Well, the draft is written. You can have that. (Later, somebody has to do the small things still listed as TBD. With your help, I could do those — I have done a lot of those before.)

No, I don't need an answer by Monday. The only thing that I want to do by tomorrow is submit the next version of my draft, the -01 (you have seen the draft draft), because the -00 documents something that we both agree is no longer what we want, and I want to replace that. Then the two-week submission moratorium in front of the Orlando IETF starts.

If you aren't ready to take over at this time (and I would certainly understand that), I'll just do the -01 again under my name. We/you/whoever can submit the next version on Monday, March 11, or later. We also can let the -01 stand around unchanged for a while, while we all figure out what to do.

Again, there can be multiple people responsible for a draft, so there are multiple configurations possible beyond those you have listed.
And we aren't even close to having a WG that would need an editor appointed.
So we have plenty of time to find the right way to do this.

If possible, I would like to be able to discuss potential options with other IETFers in Orlando, in the week commencing March 11. So if we can discuss potential ways forward within the next two weeks from now, that would help a lot. But even that isn't strictly necessary. (It would help because there will be a JSON BOF in Orlando.)

I apologize if I have created the impression that the Monday deadline is the end of the world. It is just an Internet-Drafts deadline, which we like to have in the IETF so we can all read the documents before arriving at the meeting place.

kenn commented Feb 24, 2013

@frsyuki I suggest that you open a new issue with https://gist.github.com/frsyuki/5022569 so that we could focus on the details of the spec and start the debate afresh there. At this point we need a separate place for those who are only insterested in the proposed new spec itself.

I think this issue has been messed up and important people who should be reading this thread stopped reading. We can leave this thread open so that we can continue to come back when we need to talk about non-spec matters.

cabo commented Feb 24, 2013

How about issue 13?...

Owner

frsyuki commented Feb 24, 2013

@cabo OK. I misunderstood about the Monday.

The inventor and recognized steward of msgpack, I can NOT say you can use the name "MessagePack" as the name of your next draft (which is likely submitted Monday) SO FAR. You'll submit the -01 under your name on Monday.

I think this is all what I need to say now.

I'll try to hear advices from several people who have experiences on standardization.

I think I need to take a good sleep to make decisions correctly any more....

Owner

frsyuki commented Feb 24, 2013

Oh, I forgot to mention the reason: We couldn't reach a consensus on this matter so far (meaning right now).

cabo commented Feb 24, 2013

@frsyuki: Thanks, I completely understand. This would indeed be too early.

I also subscribe to the view that a specification needs to be implemented and its ramifications understood before you really can have a solid consensus.

So, for now, have a good night's sleep!

cabo commented Feb 24, 2013

Re the C++ string interoperability issue: Let me just point out that WG21 at least appears to be aware of the problem (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html proposes a char8_t (typedefed from unsigned char) and a u8string built from basic_string<char8_t>). But that may be little help for now, at least until C++/TR2 comes out or it is supported by common libraries.

kiyoto commented Feb 24, 2013

Let's heed @kenn's advice. I just created a new issue #128

Also, this is the 300th comment! It's really time to start over. Imagine loading this page on an iPhone screen...

Midar commented Feb 24, 2013

@kiyoto It's still readable there, just did that today ;).

But yes, let's split this into several tickets and close this.

kiyoto commented Feb 24, 2013

@Midar
You have a superior vision than I do (no pun intended). I did that too, got a mild headache, and decided to create a new ticket =)

Contributor

kazuho commented Feb 25, 2013

@cabo @frsyuki

Sorry, now that we have a new location to handle the issue, I have removed my last comment posted here and reposted as #128 (comment)

Owner

frsyuki commented Feb 25, 2013

Thank you. See #128 as well.

Member

tagomoris commented Feb 27, 2013

hey, can anyone close this issue?

rasky closed this Feb 27, 2013

oberstet referenced this issue in wamp-proto/wamp-proto Apr 20, 2013

Closed

Binary Payload Format #4

niemeyer commented Jun 4, 2013

My conclusion is that "it's better to support user-defined custom type rather than adding string type"

I'd be happy to have a string type, but custom types opens a relevant can of worms that I'd like to stay away from. msgpack was a great format precisely because it was simple, tight.

For example, a server program requires that data should be serialized in string type. Another program written in PHP can't tell strings from binary type.

The irony is that there are 11 ways in which the number 1 could be sent across the wire. Some of the libraries are unable to drive that distinction as well. Seems like people did okay so far.

Gah, looking at the latest spec.. why not just make the "string" type expect to be UTF-8 encoded IN THE SPEC? and keep the binary type if you want "RAW" whatever?

I use msgpack in PHP 5.6 without any problem, (my PHP files are UTF-8 encoding)
Is this issue fixed?

Member

methane commented Aug 1, 2016

This issue is about msgpack format, not PHP implementation.
AFAIK, PHP doesn't have text type and binary type. No problem at first.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment