Skip to content

HTTPS clone URL

Subversion checkout URL

You can clone with
or
.
Download ZIP

Loading…

Msgpack can't differentiate between raw binary data and text strings #121

Closed
rasky opened this Issue · 307 comments
@rasky

It looks like the msgpack spec does not differentiate between a raw binary data buffer and text strings. This causes some problems in all high-level language wrappers, because most high-level languages have different data types for text strings and binary buffers.

For instance, the objective C wrapper is currently broken because it tries to decode all raw bytes into high-level strings (through UTF-8 decoding) because using a text string (NSString) is the only way to populate a NSDictionary (map). But it breaks because obviously some binary buffers cannot be decoded as UTF8-strings.

The same happen with Python2/3: when you serialize and deserialize a unicode string, you always get a binary string back, and this breaks simple code:

>>> a = { u"東京": True }
>>> mp = msgpack.dumps(a)
>>> b = msgpack.loads(mp)
>>> a == b
False
>>> b[u"東京"]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
KeyError: u'\u6771\u4eac'
>>> b
{'\xe6\x9d\xb1\xe4\xba\xac': True}

As you can see, when you deserialize, you get a different object which does not work (because internal text strings are not decoded from UTF-8).

Most wrappers have an option to specify automatic UTF-8 decoding for all raw bytes, but that is wrong because it will apply to ALL raw bytes, while you might have a mixture of text strings and binary bytes within the same messagepack. It's not at all uncommon.

As I said, this problem can be found in almost all high-level messagepack bindings, because most high-level languages have different data types for text strings and binary buffers.

I think the only final solution for this problem is to enhance the msgpack spec to explicitly differentiate between text strings and binary buffers. Is this something that msgpack authors are willing to discuss?

I am willing to implement whatever solution you decide it's the best one and submit a pull request.

Thanks!

@Midar

This is a serious problem and what's preventing me from implementing MessagePack in my ObjC framework. I have to know whether I should create a string object or a data object. Creating a string object for everything will fail if it is not UTF-8 and always creating a data object will be very impractical.

MessagePack is announced to be compatible with JSON and only providing what JSON provides - does that mean raw data actually means "UTF-8 string" in the author's view of things?

@chakrit

First-class string support is proposed 2 years ago and the issue still hasn't been closed.

If msgpack really goes by the motto "It's like JSON", I think it needs to solve this and other related issues ASAP.

For a time being tho, I think going with UTF8 and using some key convention to differentiate between binary blobs and strings might help.

e.g. append a _data to every key that should be treated as binary and otherwise decode as a UTF8 string by default.


EDIT: Found a comment related to this issue on StackOverflow: http://stackoverflow.com/questions/6355497/performant-entity-serialization-bson-vs-messagepack-vs-json#comment15798093_6357042

Generally, the raw bytes are assumed to be a string (usually utf-8), unless otherwise expected and agreed to on both sides of the channel. msgpack is used as a stream/serialization format... and less verbose that json.. though also less human readable.

so i take this to means if we need raw bytes on the wire, we should implement our own addition to the protocol.

@Midar

Appending _data or some convention like that means it's not possible to write a generic MsgPack implementation that can be used by application. I need to know whether it's a string or binary data, as that means I need to handle it differently. And I need to know that before I pass the data to the application, because the application will get the wrong object otherwise.

If this bug is well known for over 2 years and there is no intention to fix it, then I guess we should just move on and forget MsgPack.

@mirabilos

Actually @Midar JSON is not binary-safe and all strings are UTF-16 there (with UTF-8 being a valid representation thereof).

No idea on msgpack though, only stumbled here because of a discussion about salt…

@Midar

@mirabilos Nobody is talking about JSON being binary-safe here. The problem is that while strings in JSON are UTF-8 (and UTF-16 internally), there is no specification on that in MsgPack whatsoever. It is simply impossible to know whether something is a string in UTF-8, a string in UTF-16, a string in ISO-8859-1, a string in KOI8-R or just some binary data. And that is the problem. This is completely different to binary-safety and has absolutely nothing to do with JSON.

@DestyNova

Agreed, it is a problem which lends itself to ad-hoc workarounds. I've been using the Objective-C msgpack implementation to transfer mixed data between iOS devices and a server.
When "raw" data is detected, it tries to parse it as a UTF8 string first. The only solution I could think of was to patch msgpack-objectivec such that if the UTF8 parse produces a null result, then it simply returns that item as an array of bytes.
However, this heuristic will fail if the UTF8 parse just happens to produce a valid UTF8 string, or perhaps worse if parsing some binary data could cause "unspecified" behaviour.

@chakrit

Time for a new fork?

@frsyuki
Owner

OK. Sorry for being late.
As the initial designer of the MessagePack format, I think msgpack should not have string type.
I need to write an longer article but let me describe some points so far:

  • data format should be isolated from programs
    • it depends on applications whether the sequence of bytes is assumed as a byte array or a string.
    • lifecycle of data format is usually longer than programs:
      • example1: stored data should be consistent rather than programs
      • example2: network protocols should be compatible with old programs
    • thus data should not have string type information bit and applications should map sequence of bytes into string type only when it's necessary
  • successfully stored data must be read successfully
    • packer should validate strings to guarantee before it stores if it stores as a string
    • implementing validation code is relatively hard and make it difficult to port msgpack to other languages/architecture
    • data may not be trusted. thus unpacker also should support string validation at least optionally
    • supporting multiple encodings make it even harder
    • thus msgpack library should not consider encoding validation including string type bit
  • it doesn't be a problem in statically typed languages
    • because these languages need to specify the data type before handling the deserialized (=dynamically typed) data either way
    • see C++, Java and D implementations which type conversion mechanism
    • users think Java implementation's Value class (by @muga) is useful and it doesn't have byte array/string problem completely
  • even with dynamically languages, some committers don't think it's causing problems
    • Python (@methane), Ruby (@frsyuki = me), Erlang (@kuenishi)
    • Python implementation supports an option to return byte sequences as a string (byte array by default)
  • I think only JavaScript and Objective-C have problems
  • JavaScript doesn't have byte array type historically. it needs special handling either way
  • I suggest Objective-C/JavaScript implementations to have following solution:
    • unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString
    • the object contains a validated UTF-8 string
    • if the validation failed, it's nil or something we can tell that the validation failed
    • NSStringOrData#data returns the original byte array
  • supporting user-defined custom types is better than string type
    • 0xc1 is considered to be reserved for string type
@frsyuki
Owner

It took time for me to build my opinion.
My conclusion is that "it's better to support user-defined custom type rather than adding string type"

@muga

Hi,

I'm developer of msgpack-java. the above is well-known (and complicated) problem.

@frsyuki

+1

In my opinion,
1) the serialization core library should not implement character encoding.
2) serialization format should not include charset information.
3) having utility library on top of the library is good idea

If msgpack has the string type, the format and library implementations must be complicated. It means keeping the compatibility of the format and libraries becomes difficult. Actually it is really difficult to consider a serialization format for any charset. If it has bugs, we must fix not only the format but the libraries. It is critical thing..

Business logic on application-side should handle character encoding. But having extension hook points in a msgpack library is good idea so that you can extend encoding handling using some other libraries.

@methane

-0.5 to adding string type.

For example, JSON has Integer and Number types. Application should handle Number when expecting Integer.
If msgpack has string type, application should handle raw when expecting string and handle string when expecting raw.
So, I feel inter language serialization format should have minimum types.

@chakrit

I disagree completely.

UTF-8 and UTF-16 is a very well-known standards that has been around for a very long time. All new implementations these day should support Unicode String encoding from day one. There shouldn't even be a question of which character encoding to use especially when msgpack wants to be just "like JSON".

There are well-known UTF string encoding routines available on nearly every platforms. It's not like every implementation has to roll another character encoding routine from zero, they could just use whatever's available on their platform of choice. And there are character encoder/decoder available on most, if not all of the platforms these days. In my opinion, implementing an encoder/decoder is non-problem: don't re-invent any wheel.

Think of this as referencing another standard in your piece of work instead of having to specify all character encoding mechanism yourself.


String is a very fundamental data type required by most (if not all) applications these days. and let me repeat this: "It's like JSON." is printed in an H2 on the very top of the msgpack website and your specification does not include the simplest thing that is a String, why?

Also, the problem exists regardless of wether or not msgpack has a string specification or not. In my opinion, it is even worse to not specify the exact character encoding in your wire protocol.

Suppose you have two applications which both use msgpack, yet they wouldn't be able to communicate because the msgpack protocol itself does not specify how a string should've been encoded thus allowing space for incompatibility whereas if the msgpack specs would just say "here, use this if you need a string, and don't forget to encode it in proper UTF-8", this problem wouldn't have existed from the start.


Let me suggest this:

  1. You should simply add a String data type. It is so fundamental that it should not be left out. Especially when you are advertising it as a faster/smaller JSON. I suggest you start with UTF-8 and/or UTF-16 as the encoding. (and personally, I don't think there is any need to support more encodings than these two.). If anyone needs absolute speed, they can still use the old raw-bytes type with their own encoding and their own acceptance of any possible incompatibilities that might arises.

  2. If you insist on not having a String data type, then there should be better documentation and "recommended practice" with regards to handling String and the encoding to use because, as I've repeated, String is very fundamental data type that should've been specified in the spec and there are many platforms where there exists both String and a normal Buffer (or byte[] array) data type in active use such as JS/node.js and ObjC/iOS. Leaving this out just causes confusion between parties trying to implement the same protocol.


TL;DR --- I think this is simply a matter of documenting the "best practice" or what's expected of the implementation properly rather than just throwing out a spec defining only binary blobs and denying all string support in fear of character encoding issue but with zero pointers on how to exactly to implement one should you need it (and you will definitely needs it, what application does not use a string?)

@mzp

Hi, I'm developer of msgpack-ocaml. I disagree with adding string type.

One of benefits of msgpack is multi-platform. So, we should be careful for adding new type.

But, string type is not so much attractive. Although string type is fundamental type in many laungage, UTF8-encoded string type is so much. For example, OCaml doesn't suppose any encoding on string.

I don't have strong opinion about "recommended practice". But I think that it is each application's task, not msgpack's.

@frsyuki
Owner

@chakrit I don't think supporting UTF-8 encoding/decoding/validating is easy even if there're some well-known libraries. Remember msgpack focuses on cross-language. For example, I don't think smalltalk supports FFI by default. In JavaScript for browsers, @uupaa implemented IEEE754 and this complex code will be needed again to support UTF-8 (or UTF-16):
https://github.com/msgpack/msgpack-javascript/blob/master/msgpack.js#L135

if the msgpack specs would just say "here, use this if you need a string, and don't forget to encode it in proper UTF-8"

I agree it's good idea. I added a comment on the spec: http://wiki.msgpack.org/display/MSGPACK/Format+specification
At least Java and Ruby implementations (written by me) already use UTF-8 to serialize strings.

Regarding 1., JSON doesn't support binary type. I don't think so but do you mean msgpack should not support Raw type to say it's like JSON? Problem is that some users want to handle strings and binaries at the same time and they want to tell the difference transparently. If we want to use msgpack as the replacement of JSON, users can assume all Raw objects are string. Some msgpack libraries such as Python impl. support string-only mode (this is nice feature, I think). I want to add the feature to the msgpack-ruby v0.5.x as well.

Regarding 2., to be exact, it's a problem of JS/node.js and ObjC/iOS implementations. I mean that String is not a fundamental type in some languages such as C, C++, Ruby (at least 1.8), Erlang, and Lua (actually significant languages...right?). In Python and Ruby 1.9, the difference of strings and binaries is unclear in terms of both implementations and cultural aspects. MessagePack format itself doesn't consider the mappings between msgpack's types and language types. Implementations take the role to project msgpack's types into language specific types (this is an essential concept of msgpack). Thus as I mentioned above, JS and ObjC implementations should document about that specifically.

....But anyway, I agree that it's better msgpack documents should mention the "best practice to handle strings at certain dynamically-typed languages such as Objective-C or JavaScript."

So, TL;DR...msgpack project lacks some important documents such as: why msgpack doesn't have string type, guidelines for implementations how to handle strings, the best practice to handle strings. // TODO FIXME

@Midar

I strongly disagree with the position not to add the most basic type: a string.

Let's assume MsgPack is Layer 1 and our protocol is Layer 2, encoded in MsgPack. So, when I want to decode MsgPack to objects (which is Layer 1, remember?), I also need to have knowledge about Layer 2 (because otherwise I can't know what it is)? Sorry, but this is completely retarded. This is like "In order to parse TCP, you need to parse the protocol that's wrapped inside TCP. So, if you want to parse TCP, you need to parse every protocol in existence like HTTP, XMPP, SMTP, IMAP, etc.".

Saying that UTF-8 is too complicated is basically admitting defeat. If you can't implement those 20 lines of C code required for de- and encoding UTF-8, you probably shouldn't write any code at all. Especially as almost all languages have already implemented UTF-8 and you can just use it.

The strangest thing is the reason: You're saying you don't want to have a string type out of fear of being not interoperable. Well, actually, you kill interoperability by not having a string type, as therefore it's not possible to parse Layer 1 in many languages as you don't know which encoding is used or if it even is a string. There is no way to have a look at the data without some kind of schema and thus looking at Layer 2, which you really shouldn't. This violates basic rules of software design!

The advantage of MsgPack to Protocol Buffers could have been that it does not need a schema. But with this decision, MsgPack has no advantage over Protocol Buffers. It's not portable and it needs a schema, both two things you don't want from a general purpose serialization format.

Saying that UTF-8 is a problem for interoperability is really the the biggest nonsense I've heard so far. Almost all modern network protocols require UTF-8. XML requires UTF-8 and works on many more platforms and languages than MsgPack ever will. Requiring UTF-8 eliminates the pain of having to support multiple encodings. There's a reason the world moved to UTF-8…

@repeatedly

Hello, I'm an author of msgpack-d.

I have never wanted string type in my msgpack experience.
In D, string <-> byte conversion has no problem because the application has already normalized the invalid string before serialization.
In addition, many serialization types are bad in my RPC experience. It causes the lack of interoperability.

Probably, this issue is IDL or application layer problem.

P.S.
If introducing the string type, then supporting user-defined custom types is good for me.
Because this approach resolves that someone says "I want this type in msgpack!"

@rasky

@frsyuki @methane I am the original issue opener. I have posted a clear Python example that show that msgpack is completely broken in Python as a very simple data structure doesn't load back. So I can't see how you can think that it is not broken in Python at the very least.

I know there is an option to return byte array by default, and that's totally useless, because it applies to all of them.

Also when you say "In Python and Ruby 1.9, the difference of strings and binaries is unclear in terms of both implementations and cultural aspects", I totally don't know what you are referring about. The difference between strings and binaries is very clear in Python (and Ruby, and Java, and Objective C and MANY of the modern languages), there is tons of documentation on it, tons of material, tons of talk. I am surprised that you can think that it is unclear.

I think @Midar nailed it. The problem is that, without a string type, MsgPack always needs a schema/IDL to be useful, because it cannot convert back to native data structures without a schema telling it how to. Vice versa, if you add a string type, it becomes possible (most of the time) to avoid a schema.

@frsyuki
Owner

I needed to mention another problem about UTF-8 (and unicode).

UTF-8 validation includes NFD/NFC problem. For example, "\u00e9" (NFC) and "\u0065\u0301" (NFD) represent exactly same character (you may know that Mac OS X uses NFD to represent file names and it sometimes causes troubles with Linux which usually uses NFC). If msgpack had string type, should implementations normalize characters to NFC, or NFD?

UTF-8 has verbosity as well. 0x2F could be 0xC0 0xAF. Should deserializers reject these bytes? Or normalize into another character?

@methane

@rasky I agree with you about adding string type helps pythonistas.
But msgpack is a inter language communication format.
We should communicate with weak typed languages like php or JavaScript.

If you want to serialize Python data type perfectly, you can use pickle instead.
It can serialize and back datetime, tuple and many other types correctly.

@rasky

@methane I'm using msgpack specifically because it's an inter-language communication format. I communicate between Python and Objective C, and the Objective C msgpack library is totally broken because the string type is missing; in fact, the Objective C object/dictionary standard construct must have strings as keys, and thus the msgpack Objective C library tries to convert everything into string, thus breaking the transmission of binary data. If msgpack had a different string data type, the Objective C library could now what to do.

@frsyuki FIrst, I assume that all languages that implements native unicode strings will have libraries to handle this either way. My take on this is that msgpack shouldn't do anything. You convert from unicode into utf-8 using the standard behavior of the language, and convert back again with the standard behavior. The problems you cite arise only if someone is trying to use UTF-8 as-is, so it will arises in languages where Unicode is not implemented. I think that, if an implementer is going to communicate between a unicode-rich language and an unicode-poor language, it is up to the implementer himself to take care of these small details.

@Midar

@frsyuki None. That is not part of the serialization. Comparing strings is a completely different domain. You could convert it from UTF-8 to your preferred charset and compare it in that and lose internationalization - that's up to you. Or you could put Unicode in your raw binary and still have those problems. Completely up to you. You don't lose anything by having a type for UTF-8 strings. That's just the transfer encoding, you can recode it to whatever you want.

@methane Do you even hear what you're saying?

But msgpack is a inter language communication format.
We should communicate with weak typed languages like php or JavaScript.

So, in the first sentence, you say it should be inter-language. And in the second you say it should only be for weakly typed languages? If you say you want it inter-lanuage, you should recognize that the only way to have that is to add support for a string type.

@rasky Actually, no, everything can be a key in a dictionary as long as it implements -[copy], -[hash], and -[isEqual:]. But who wants to use binary keys in some code? That would always be "Get the bytes from an NSString and create NSData and then pass that to objectForKey:". :)

@frsyuki
Owner

@Midar I couldn't catch what Layer 2 means...do you have some examples? I guess Layer 2 has 2 options:

1. Layer 2 aslo doesn't tell strings and byte arrays.
2. Layer 2 implements its own type system on top of msgpack's type system.

Have you implemented UTF-8 validator (which will be required by serializers)? I don't think it fits into 20 lines of C code...

@Midar

@frsyuki Layer 2 is what you put inside MsgPack. A protocol that says "at this place I expect an array, a string, some bytes". Without that knowledge from a protocol that is completely apart from MsgPack, you can't parse MsgPack, and that's really broken.

Yes, I have implemented UTF-8 checking, encoding and decoding. It's easily possible in 20 lines each (decoding and encoding). Here's both with a lot of wasting space that could easily be reduced:
https://webkeks.org/git?p=objfw.git;a=blob;f=src/OFString.m;h=cc873dab3d178abd0f4ed94546a5b0d74add8171;hb=HEAD#l77

@rasky

Can you please explain WHY you need UTF-8 validations?

In unicode rich languages, you will convert UTF-8 into Unicode, and validation is performed by the language itself (or its standard library). No code to write.

In unicode poor languages, there is no Unicode data type, so you leave UTF-8 as-is.

Why do you ever need to include a UTF-8 validator?

@frsyuki
Owner

@Midar MessagePack is for all of weakly-typed, strongly-typed, dynamically-typed and statically-typed languages.

Please don't think one type system works perfectly for all languages. All implementations have to manage the inconsistency between language types and msgpack types.

The problem is that which causes more troubles: a) projecting strings and byte arrays into Raw type. b) projecting Raw type into strings or byte arrays.
I understand supporting UTF-8 has lots of merits. Why do you think the troubles caused by having UTF-8 is manageable compared to not having UTF-8?

@frsyuki
Owner

@rasky I suggested a way to handle binary-or-string type in dynamically typed languages without schema:

  • I suggest Objective-C/JavaScript implementations to have following solution:
    • unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString the object contains a validated UTF-8 string
    • if the validation failed, it's nil or something we can tell that the validation failed
    • NSStringOrData#data returns the original byte array
@frsyuki
Owner

@rasky > Can you please explain WHY you need UTF-8 validations?

Because:

  • successfully stored data must be read successfully

Imagine that an invalid UTF-8 string is stored on a disk with information "this is a UTF-8 string"

@Midar

@frsyuki

MessagePack is for all of weakly-typed, strongly-typed, dynamically-typed and statically-typed languages.

This is exactly what I'm saying, which is why I don't get why on one hand you are against a string type which is required for a lot of languages, but on the other hand praise interoperability - which you just destroyed by not having a string type!

Why do you think the troubles caused by having UTF-8 is manageable compared to not having UTF-8?

You still failed to show us where exactly UTF-8 should cause trouble for MessagePack. What exactly makes UTF-8 harder for you? Again, if you care about internationalization as much as about interoperability, you can convert it to some other non-Unicode encoding. If you use a Unicode encoding, you have these "problems" you call it anyway.

@Midar

unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString the object contains a validated UTF-8 string
if the validation failed, it's nil or something we can tell that the validation failed
NSStringOrData#data returns the original byte array

Oh great, now I have to implement another string class (remember: NSString is just a class cluster. If I subclass it, I have no implementation!) just because you have never heard about separation of layers? Sorry, but no, just no. If it stays this way, I just won't implement MsgPack, and I'm sure many others won't either. Not because they don't like the idea, but simply because you made it impossible to parse it in a sane matter.

@frsyuki
Owner

@rasky > I think that, if an implementer is going to communicate between a unicode-rich language and an unicode-poor language, it is up to the implementer himself to take care of these small details.

My proposal is that msgpack doesn't support string types but msgpack supports user-defined types. It means implementer can add string type if implementer himself needs it.
Do you think this does not work?

@methane

So, in the first sentence, you say it should be inter-language. And in the second you say it should only be for weakly typed languages? If you say you want it inter-lanuage, you should recognize that the only way to have that is to add support for a string type.

I'm sorry about my poor english.
What I want to say is msgpack should be designed for many languages, not only for languages distinct string and bytes.

@Midar

@frsyuki Yes, I think this does not work, as everybody will come up with his own string type, and there will be no interoperability. Please stop claiming that not implementing a string type improves interoperability, when it clearly does the exact opposite as has been stated by many and is actually an issue which prevents many from using it or taking MsgPack serious.

@methane Yes, I agree. It should work with all languages. But for that, a string type is required. For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary.

@frsyuki
Owner

@rasky For example, in Ruby (1.9), following code returns a String object with UTF-8 encoding information:

require 'uri'
s = URI.unescape("%DE")
p s.encoding

This easily happens in many applications including Rails. Is this string, or binary? I think it depends on how applications handle this object.

Additionally following code returns the same object as well:

require 'msgpack'
s = MessagePack.unpack("\xA1\xDE")
p s.encoding
@Midar

@frsyuki And exactly that is the problem. It depends on how the applications handles it! There is no way to know that without knowledge from the Layer 2 protocol! Why do you insist on ignoring basic principles of software design?

@methane

For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary.

Then, how they should serialize such binary?
When I send a string from Python to php, php may send it back to Python in binary type...

@Midar

@methane By having an optional parameter how it should be treated in encoding, by wrapping it into some object, etc. There are many ways to overcome this in languages which don't make a difference. There is absolutely no way to overcome not having a string type in languages which do make a difference.

@frsyuki
Owner

@Midar Whether an object should be a byte array or string depends on applications.
I said lifecycle of applications (programs) is shorter than data, and data should be isolated from applications. Do you agree with opinion?

Applications could be changed. But data should not be changed at the same time. Applications may consider that the data is a byte array which was considered string before. But we can't change stored data. We can't update the old code in the same network at the same time.

@chakrit

@methance you are describing the exact problem that can be solved by adding a proper string type.

Python -> STR_XXX -> PHP -> BIN_XXX -> Python

Now Python knows it is getting some binary.

And the same python server can then do:

Python -> STR_XXX -> Node.js -> STR_XXX -> Python

Now Python knows it is getting a UTF8 string.

Now, imagine the above scenario without the String type.

Python -> BIN_XXX -> PHP -> BIN_XXX -> Python

Now Python do not knows it is getting a binary or a string (because it does not and should not need to know that the source lang is PHP)

Python -> BIN_XXX -> Node.js -> BIN_XXX -> Python

Now Python do not knows it is getting a binary or a string (because it does not and should not need to know that the source lang is node.js)

We have this problem and there's no way to tell exactly because you don't have the String type in msgpack !

@frsyuki
Owner

@Midar @chakrit > For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary.
This doesn't work.

For example, a server program requires that data should be serialized in string type. Another program written in PHP can't tell strings from binary type. Let's say it sends data in binary type. Then the PHP program can't send requests to the server.

@chakrit

@Midar Whether an object should be a byte array or string depends on applications.

Yes, agree. But because you don't have String type, participating applications now gets confused.

PHP -> BIN_XXX -> Python (oh hai, is that a string or a binary? I'm just gonna make it a binary and show gibberish to my user then.)

You are totally missing the point here. It's interoperability between applications, not how a single application should be architected.

I said lifecycle of applications (programs) is shorter than data, and data should be isolated from applications. Do you agree with opinion?

Yes. But you are one step too liberal there making everything harder by not providing a way to specify a string.

Effectively a premature optimization.

Applications could be changed. But data should not be changed at the same time

Yes.

Applications may consider that the data is a byte array which was considered string before.

Theres the problem. If you have a string type, then all applications can tell if its a string or a byte array before.

But we can't change stored data. We can't update the old code in the same network at the same time.

As per reasoning above, all the more why there should be a string data type.

@chakrit

For example, a server program requires that data should be serialized in string type.

ah ha.

Another program written in PHP can't tell strings from binary type.

PHP will be able to tell if there is a "string" marker in msgpack telling it that the blob is a string.

Again, you have this problem exactly because you don't have String in msgpack

Let's say it sends data in binary type.

And it could then talk with other languages such as Python wether that "binary" that PHP can't differentiate is meant to be treated as a string or giant blob of data.

Then the PHP program can't send requests to the server.

If you have String in msgpack, PHP could send a binary blob and tell the server "please treat this blob as a String".

But because you don't have String in msgpack. This is then a problem.

@frsyuki
Owner

@chakrit > PHP could send a binary blob and tell the server "please treat this blob as a String".
It means the receiver needs to decide how to handle the received data even if it has string type information or byte array type information. In other words, the receiver knows how to handle the data. The sender doesn't (have to) know.

Plus, the receiver can't (shouldn't) trust the received data. Thus in any case, the receiver should validate the data type.

@Midar

@frsyuki

Whether an object should be a byte array or string depends on applications.

No, this does not depend on the application, this does depend on the protocol!

I said lifecycle of applications (programs) is shorter than data, and data should be isolated from applications. Do you agree with opinion?

I don't see what that has to do with a string type, except that with a string type, data is interoperable and you can still read it years later.

Applications may consider that the data is a byte array which was considered string before. But we can't change stored data. We can't update the old code in the same network at the same time.

Oh dear, please tell me you meant something else. Are you really just dumping your internal structure instead of having a sane protocol? If you dump your internal structure, there is no interoperability anyway. If you don't dump the internal structure, but have a well-designed format to store the data, you want to have a string type for interoperability.

For example, a server program requires that data should be serialized in string type. Another program written in PHP can't tell strings from binary type. Let's say it sends data in binary type. Then the PHP program can't send requests to the server.

I'm pretty sure PHP has a different type for strings and binary, at least it can have one. If not, it's still possible to have a class MsgPackString and MsgPackData in which you can wrap your data so the serializer knows what it is.

Having too much information is never a problem, you can just discard it. But you can't interpolate information that just is not there!

@methane

@chakrit
datetime and bytes are also fundamental type. JSON can't serialize them.
But we can use them in JSON like "This string is base64 encoded PNG", "This string is ISO8601 datetime."

"This bytes is utf-8 encoded string" in msgpack is same thing.

Supporting how many types is format design decision.
Msgpack decides to be like JSON but bytes instead of string.
I think BSON is format what you want. It supports bytes, string, datetime and others.

@frsyuki
Owner
  • can the receiver use a received string as-is? I think it can't. because:
    • The receiver needs to convert it into a byte array if the receiver needs a byte array.
    • The receiver needs to confirm the data is a string if the receiver needs a string.
  • can the receiver use a received byte array as-is? I think it can't. because: same with ^

To be secure and interoperable, applications shouldn't care the received data has binary-type info or string-type info.

@chakrit

@methane "bytes instead of string" --> this is not the case because msgpack did not specify this exactly with the protocol. Leaving it up to possible misinterpretation by parties as illustrated by the OP question.

If that is what msgpack intended, it should have said "THIS IS MEANT TO BE USED FOR STRINGS AND STRINGS ONLY" in the spec ... not provide a raw bytes description and expect everyone to treat it correctly as string.

Because then the driver writer wouldn't be able to implement it.

With JSON we're okay bcause the specs say that it does not handle DateTime and every string should be treated as string and thus we build our own workaround knowing that in mind.

But for msgpack this is confusing and hard to implement correctly (as illustrated by the OP problem) because msgpack don't specify a String data type and we'll have to roll our own anonymous version by piggybacking on the raw data type while (being tricked to) thinking that we may be able to have both (String and Buffer) since the msgpack spec allows it and doesn't specify what to do in case if we need the fundamental data type that is the String

This totally breaks interoperability because then the driver implementer wouldn't be able to provide a sane implementation of wether to treat a blob as a data or a string as the case with ObjC. Keep in mind that String is very fundamental data type in most platforms.

If msgpack wants us to treat these binary blobs as data from the start, it should just says so in the spec.

But if msgpack wants to provide blob as well as a string, then there should be a protocol-built mechanism to differentiate between the two. Leaving it up to interpretation is bad for a standard wire protocol.

@Midar

@frsyuki

can the receiver use a received string as-is?

Yes. If there is a string type, that's possible, unlike without one!

The receiver needs to convert it into a byte array if the receiver needs a byte array.

Wrong. If there is a string and a binary type, it just gets the right thing unless someone broke the protocol.

The receiver needs to confirm the data is a string if the receiver needs a string.

Yes, it can verify if it's actually a string, but you'd need to do the same for a number, so this is not a valid argument.

can the receiver use a received byte array as-is? I think it can't

Yes, it can. Same as above.

To be secure and interoperable, applications shouldn't care the received data has binary-type info or string-type info.

Wrong. To be secure and interoperable, the protocol should be well-defined (i.e. saying either string or binary) and reject violations of the protocol.

I'm really getting tired of talking to a wall. I assume you don't really read what you write, because you keep up bringing arguments for a string type, then only to say they are against a string type, even though they're clearly for a string type.

Maybe we have a communication problem here and you don't know what interoperable means? Interoperable means that a format can be read by different applications in different languages - something you try to prevent by not having a string type. Interoperable does not mean just dumping your internal state!

@rasky

@methane then why your Python msgpack library accepts Unicode strings in input? The answer is simple: because strings are a fundamental data type available in all major languages. By discarding information on its type, you're irreparably losing information that can't be reconstructed.

@saki7

@Midar says:

I'm really getting tired of talking to a wall. I assume you don't really read what you write, ...

This is not a polite statement. I think @frsyuki and other collaborators who agree with @frsyuki are trying to understand your problem. But they still have their solid opinion which is against yours.

I understand both @Midar and @frsyuki 's thoughts, but my opinion is as follows:

This is an application layer problem. The application must be aware of the encoding which it deals with, not the protocol.

Please note that, this is my fully personal answer, and it does not include any political or arbitrary meaning, since I don't belong to the MessagePack developer team.

@Midar 's opinion is like: "We must assume anything we receive is definitely correct."
I disagree. It's not the data who decides. We do. Our application decides whether the data is correct or not (or, is in a certain format).

@frsyuki
Owner

@Midar Sorry, sometimes I couldn't understand what you meant. But I'm not kidding.

it just gets the right thing unless someone broke the protocol.
it can verify if it's actually a string, but you'd need to do the same for a number, so this is not a valid argument.

Anyone can break the protocol. I think not having string type is better to manage following two problems: 1) how to handle the broken protocol. 2) how to prevent broken protocols.

If there was string type, and an application stored data as a byte array, and the application changed its mind to handle the data as (this often happens, right? applications should be changed as business changes), the data is considered broken. But it still represents the same data. I think it should not be considered broken.

The receiver should validate all arguments. It should not assume that all senders think the byte sequence is a string. The receiver knows it wants to handle the data as string, or byte array. It means it can validate the type.

My opinion is that sane protocol handlers should not tell strings from byte arrays. The applications should know whether need byte arrays or strings are needed.
Thus protocols don't have to tell strings from byte arrays.

To be secure, the protocol should be well-defined (i.e. saying either string or binary) and reject violations of the protocol.

I meant the protocols often change even if they're well-defined. It should reject invalid protocols but I think the changes of strings to/from byte arrays should not be considered as protocol change. Because applications decide the difference. Data itself are the same.

@moriyoshi

The confusion may arise when two different kinds of octets, strings and binaries occur in the same set of objects and that'd be the case where the differentiation is necessary, which isn't addressed by msgpack by design. Why don't we blame HTTP spec for not specifying means to handle non-ASCII strings within the request URI? Because what it represents totally depends on the content as for HTML, and how it's encoded is actually specified by the HTML specification. That is how design decision goes.

@ganwell ganwell referenced this issue in ellisonbg/zmqweb
Closed

Serialization, integration tests and travis #2

@rasky

@frsyuki can you explain why a "sane protocol handler" should NOT tell strings from byte arrays, but it should tell floats from byte arrays? Floats are a sequence of bytes in the PC memory, why should msgpack care about them?

@chakrit

@frsyuki regarding "My opinion is that sane protocol handlers should not tell strings from byte arrays"

If you insist on that, please definitely do update the spec to properly codify that opinion and mark the objective-c handler as broken because it auto-converts buffers to String without the application developer's consent so it's much clearer on how everything should've been implemented.


That aside, I still want String in msgpack as I see no point why the application developer should need to worry about this conversion process.

This should be a job of the protocol handler but which it will not be able to do easily since the required type information is missing and must still be provided by the application developer by means of a schema -- which IMO is an ugly solution at best.

@Midar

@saki7

This is not a polite statement.

Sorry, I'm getting really frustrated from repeating myself over and over again and only being responded to with ignorance of the problem which is so serious that is is actually PREVENTING ME AND OTHERS FROM USING MSGPACK AT ALL!

@Midar 's opinion is like: "We must assume anything we receive is definitely correct."
I disagree. It's not the data who decides. We do. Our application decides whether the data is correct (or, is in a certain format) or not.

This is not correct. This is not something about verifying, this is something about EVEN BEING ABLE TO PARSE AND STORE IT in some languages. You still need to verify it. This is not even the topic! It's about whether something is a string or some binary data and thus should be decoded into a string or binary.

Anyway, with your argumentation, why do we even have a type for numbers? We could just store it as binary. It's up to the application to interpret it correctly! And while we're at it, why not go to the next level and only use binary, so we don't need MsgPack at all? That seems to be what you want.

And of what kind something is is really part of the protocol and not the application…

@frsyuki

Regarding 1, if there was string type, and an application stored data as a byte array, and the application changed its mind to handle the data as (this often happens, right? applications should be changed as business changes), the data is considered broken. But it still represents the same data. I think it should not be considered broken.

The problem is that it outputted it as binary instead of string in the first place! Nothing like that would have happened if it would have used the string type from the start!

The receiver should validate all arguments. It should not assume that all senders think the byte sequence is a string. The receiver knows it wants to handle the data as string, or byte array. It means it can validate the type.

Yes, it has to validate the type. But just because if has to validate the type DOES NOT MEAN THE TYPE HAS TO BE UNSPECIFIED. If you want that, why even use MsgPack? Then you don't need number, bool, etc - just binary.

My opinion is that sane protocol handlers should not tell strings from byte arrays. The applications should know whether need byte arrays or strings are needed.
Thus protocols don't have to tell strings from byte arrays.

Which totally makes it impossible to parse it just as a single layer, but instead you need a schema and thus knowledge about the inner layer. But why even talk about that anymore? It seems you clearly hate everything about good protocol or software design, otherwise you would not defend a way that breaks with tens of years of software design and protocol design principles (and was the reason for the success of protocol stacks like TCP/IP) so fiercely.

Anyway, whatever. I give up. People who only know limited languages seem to be fine with it and are unwilling to interoperate with others. I'll just give up on MsgPack then. Good luck to @rasky, @chakrit and others who tried to talk some sense into people who never dealt with a language that does make a difference between strings and binary, but it seems there are a few people who only want to use it for unportable stuff like dumping internal state, and sadly, it seems the MsgPack author is among them, so for me personally, MsgPack is just useless and I'll move on to something more useful.

@rasky

@frsyuki if msgpack doesn't want to handle Unicode, than my request is that ALL msgpack binding refuse to encode Unicode strings, and force people to use custom encoding/decoding code. This way, application developers will be aware of the design choice.

This would cause a rage, but I think it's exactly what you want. People will simply start using incompatible custom codes for handling Unicode strings, and a big mess will arise. Or everybody will just agree on a single custom code and thus making it "standard" for everybody but the msgpack development team. That would be fine as well, in my opinion

@methane

@rasky

@methane then why your Python msgpack library accepts Unicode strings in input?

My implementation packs tuple into msgpack array. And unpacking it into list. (from 0.3)
It is because I feel I can naturally mapping unicode and tuple to bytes and array.

@saki7

@rasky says:

@frsyuki can you explain why a "sane protocol handler" should NOT tell strings from byte arrays, but it should tell floats from byte arrays? Floats are a sequence of bytes in the PC memory, why should msgpack care about them?

I think, that is because there are the only one thing which float type represents. It is clearly stated on the standards. And one more important thing we must remember: IT IS AN PRIMITIVE TYPE.
When the stored data for float type had actually contained invalid bytes, we just receive invalid float variable after the decoding process. I think there's no problem with that, because it's our application's fault that it didn't store a valid data. And the mistake doesn't cause any serious problem.

If multiple data type for floating point had existed, like encodings for string types, maybe there were similar problems. But I still think the floating point example is an another story.

@Midar

@saki7

If multiple data type for floating point had existed, like encodings for string types, maybe there were similar problems. But I still think the floating point example is an another story.

Actually, there are different floating point types. There's the difference in length (float, double, long double) and format (IEEE, VAX, etc.). So can we get rid of float, number, etc. now and just replace everything with binary? According to you and others, nobody needs to know the type on the protocol anyway, as the application knows it. So why not get rid of all that bloat and just replace everything with binary? And while we're at it, an array is also just binary. So why not replace MsgPack with binary? That has to be what you guys dream about. It's just your argument followed through.

@frsyuki
Owner

@rasky > Floats are a sequence of bytes in the PC memory, why should msgpack care about them?
In terms of type system, I don't have strong opinions why msgpack tells floats from integers. But we can store integers in fewer bytes if it knows it's an integer.

What I care is that string "a" and byte array "a" are exactly same byte sequence and applications should decide whether it is. Data should not describe whether itself is.

@rasky

@frsyuki you are wrong. In all Unicode rich languages, string "a" and byte array "a" have TOTALLY different representation in memory. In fact, string "a" is a sequence of codepoints, not bytes, so the sentence "the string 'a' contains a byte sequence" has no meaning whatsoever, it is just a wrong reasoning.

In Python 2.x, the Python interpreter implements it internally as UTF-16 sequence, so it corresponds to the binary sequence 00 61 (on little-endian platform).

@frsyuki
Owner

@chakrit > If you insist on that, please definitely do update the spec to properly codify that opinion and mark the objective-c handler as broken because it auto-converts buffers to String without the application developer's consent so it's much clearer on how everything should've been implemented.

Sorry, I don't know much about Objective-C. But it should provide a way to take out the orignal binary. How do you think, @chrishulbert?

@rasky

@methane then I think that, given this discussion, your Python binding is wrong, and shouldn't convert Unicode into byte arrays. It just makes bugs happen. It should be removed, and an exception raised. Since it's up to the application to handle Unicode (this is what @frsyuki says), then please let Python application programmers handle it, don't have an automatic behavior that can be wrong.

The same applies for all language bindings for languages that have a native Unicode data type. They should refuse to encode Unicode strings (since they would be losing important information) and let application programmers handle it. @frsyuki do you agree on this solution?

@chakrit

@frsyuki Yes, I think it is very easy to provide binary by default as IIRC that is the default thing that you get the framework that handles internet connections already.

@saki7

@Midar wrote:

Actually, there are different floating point types. There's the difference length (float, double, long double) and format (IEEE, VAX, etc.). So can we get rid of float, number, etc. now and just replace everything with binary? According to you and others, nobody needs to know the type on the protocol anyway, as the application knows it. So why not get rid of all that bloat and just replace everything with binary? And while we're at it, an array is also just binary. So why not replace MsgPack with binary?

I know there are various floating point representatives on this world. What I actually wanted to say is, that the data format which MessagePack handles, is a single type. It's just an "floating point type". It is written on Format specification - MessagePack - Confluence.
And other float types which does not fit with this specification must be formatted to fit this format. That's left for each MessagePack language bindings. And this is what we are actually talking about. It's a layer problem, too.

@rasky

@chakrit it's not that easy because you want a NSMutableDictionary out of a msgpack map, and NSMutableDictionary only wants NSString as key. See this fork: https://github.com/nferruzzi/msgpack-objectivec/commits/

@Midar

@frsyuki No, it should not, because Foundation can handle Unicode and stores the string as Unicode. That representation is - like in every other language supporting Unicode - system dependent. That can be UTF-8, UTF-16BE, UTF-16LE, UTF-32BE, UTF-32LE…

Oh, and of course, if you say it's UTF-8, you can't use arbitrary binary, as that would be invalid UTF-8 and you won't get any object at all!

@saki7 Yes, there is a single type MessagePack handles. So why not have a single type for strings as well? Everything that is a different representation needs to be converted, just like for floats.

@rasky

@saki7 you are totally missing my point. My question is: WHY is there a floating point type in the specification, at all? @frsyuki says that it makes sense NOT to have a string type in the specification. So why there should be a floating point type?

@chakrit

@rasky I see, then I think we supposedly need a custom NSDictionary implementation as well since (even if msgpack has a string data type) it'd still be possible to get non-string keys in msgpack right? So the handler need to handle it regardless.

@Midar

@rasky @chakrit That is only half-true, NSDictionary takes any object as key which implements -[hash], -[isEqual:] and -[copy]. All those are true for NSData. But is very impractical. You would need to create an NSString, get it into some buffer in some encoding and then create NSData for that buffer and use that as a key. And that's almost as good as not having any MsgPack support ;).

@saki7

@rasky wrote:

@saki7 you are totally missing my point. My question is: WHY is there a floating point type in the specification, at all? @frsyuki says that it makes sense NOT to have a string type in the specification. So why there should be a floating point type?

I think that's quite a philosophical question. Isn't it just there for convinience or performance? Remember, it's a primitive type.

@chakrit

@saki7 i think we are all trying to tell @frsyuki that having a String type is big convenience over the little performance/abstraction gain.

@ganwell

I like to go in another direction. If msgpack stays as it is, how can I encode this? base64, really?

msgpack.loads(msgpack.dumps(bytes(b'\xb9'), encoding='utf-8'), encoding='utf-8')

-> UnicodeDecodeError: 'utf-8' codec can't decode byte 0xb9 in position 0: invalid start byte

I set utf-8 just because I can expect the least failures.

@DestyNova
  1. Strings are a totally ubiquitous fundamental datatype for almost all programs.

  2. re: Sadayuki "lifecycle of applications (programs) is shorter than data, and data should be isolated from applications"
    Yes, but strings are a fundamental datatype.
    The lifespan of UTF8 as a character encoding is probably longer than the lifespan of most data.

  3. Some languages don't intrinsically support UTF8. So what? It is very feasible to implement a UTF8->XYZ decoder in the msgpack implementation for that language.
    This problem will be the same either way, except worse without recognising strings as a fundamental datatype in msgpack spec, because the user then has to do everything.
    And they currently need either a schema or to GUESS whether an array of bytes is a string or something else.
    Since people actually need to transmit strings, they will do it anyway if msgpack does not support it, but in an ad-hoc way which will make things more difficult and error-prone for everybody.

  4. re: Sadayuki: "UTF-8 has verbosity as well. 0x2F could be 0xC0 0xAF. Should deserializers reject these bytes? Or normalize into another character?"
    Why? If msgpack doesn't support UTF8, then people will use byte arrays to hold the same messages, with the same results (except more probability of errors since everybody needs to write encoders/decoders).

Summary:
With the current design of msgpack, when we discover some "raw" bytes, we have NO information about what's in that data. We don't know if it is a UTF8 string, or even any type of string. Maybe it's a JPEG image.

So we have to either know the schema (and not needing a schema was supposed to be an important quality for msgpack, right?), or we have to GUESS what the type of data is (like the Objective-C parser hack I made).
This situation is quite unacceptable for such a fundamental datatype as strings, and could easily be solved. Any problems with the solution (e.g. with languages that have poor datatype support) are problems that we already have, so we have nothing to lose by fixing this.

@saki7

@chakrit wrote:

@saki7 i think we are all trying to tell @frsyuki that having a String type is big convenience over the little performance/abstraction gain.

Having encoding information stored on the data does not result on better convenience, but has a serious problem. There might be various systems / various language bindings, dealing with the single data. Not every languages have a strong/safe type system nor string type (with a full encoding support).
Ultimately, the application must validate the data. Not the protocol.

@rasky

@saki7 let's say that MsgPack has a string type, and you are using an Unicode-poor language. How is this a serious problem? You get the binary utf-8 representation. Period. How is this a problem?

@DestyNova

@saki7
How is that not already a problem?

If we distinguish, for example, UTF8 encoded strings in msgpack from byte arrays, then it's a SMALLER problem because at least they know how to interpret the incoming data.

@ganwell

Sorry my bad.

msgpack.loads(msgpack.dumps(bytes(b'\xb9')))

of course works.

@chakrit

@saki7 no, there is no need to have an encoding information stored on the data.

If we all agree to treat the blob as UTF-8, msgpack adds a single type marker to indicate a string blob. (0xXX something)

Problem is solved.

No more change or encoding information more than this is needed inside the protocol. It is really that simple.

@Midar

@saki7 Why would this require a strong type system? That's just nonsense. It's not like you add extra info for encoding, you add a new type for a string and choose one encoding like UTF-8 that is used on the wire. What you use internally is your decision.

@saki7

Again, what if the data bytes were invalid even if MessagePack had an encoding bit? Ultimately, your application must be aware of the actual type stored in the data. Thus, supporting encoding for strings does not provide further convenience.

But still, support for primitives such as floats must be there, for convinience or better performance, as I said.

@kuenishi
Owner

Hi all, I'm maintainer of msgpack-erlang. Erlang does not have native string type. If string type is added I can't maintain msgpack-erlang any more. Let alone damn Unicodes.

I don't like this kind of "Hey, I need type X for msgpack specs" where X = time, string, date, and anything you like. I like Sada's minimal design choice of types. The more types msgpack supports, the more msgpack loses language interoperability. To be honest, I want types for atom, tuple, pid (Erlang pid), BigInt, and function.
OCaml and Haskell guys might want polymorphic variant type or algebraic data types. Why don't we stop arguing around NATIVE types and move up to application layer design, or hack msgpack-idl ?

@chakrit

@saki7 That is besides the point. Application must validate all data type be it float, number, blob, whatever. This has nothing to do with having a String or not.

I don't think anyone here is advocating that msgpack implements unicode. It just needs to support having unicode on the wire in a good way that the protocol handler can implement. Do you understand what I'm trying to say here?

@Midar

@saki7 You have to handle that right now, too. So where is this is a difference from now? The only difference is that right now, you have to do it, whereas with a string type, the library could do it for you.

@kuenishi There IS a difference between lists (which are used for strings) and binary in Erlang. For binary, you could use <<>>, for strings, you could use lists. So no, it's not impossible for Erlang. But it is for many other languages the way it is right now.

@frsyuki
Owner

@Midar > Oh, and of course, if you say it's UTF-8, you can't use arbitrary binary, as that would be invalid UTF-8 and you won't get any object at all!

I think this is an important part but I couldn't catch what you meant.

UTF-8 strings should be read by unpackers even if it includes invalid bytes as a UTF-8 string. Then an application should decide whether it rejects the data or not, whether normalizes the data or not, how to normalize the data, etc. I believe the handling depends on applications and msgpack library can't decide it.

@saki7

+1 for @kuenishi.

Determining which native types to support, actually is, a philosophy issue. That's left for MessagePack developers and @frsyuki.

@chakrit

@kuenishi You do not have to actually process the string.

If you are writing user-facing application that must display a string coming from msgpack, then you need string encoding anyway. So you are good.

If you are not writing user-facing application that must display a string, then you can simply store it as a binary blob. But atleast the information that the very blob should be treated as a String is still there should the protocol handler needs it (or any downstream need to processs the data further)

There is a difference there.

@Midar

@frsyuki

UTF-8 strings should be read by unpackers even if it includes invalid bytes as a UTF-8 string.

Actually, no. Because most languages refuse to even store invalid Unicode. Most parse UTF-8 and convert it to an internal representation - which then can't store invalid UTF-8. And this would be a gigantic waste of space: Even if there would be a way to store invalid UTF-8 (e.g. by using codepoints that are above the 21 bits of unicode), than it would take 4x the space when the internal representation is UTF-32. And you talk about efficiency?!

@kuenishi
Owner

@Midar the problem is not difference between binary and lists but the code can't tell string and lists of integer. It doesn't matter if it is ascii or unicode.

@frsyuki
Owner

@rasky

please let Python application programmers handle it, don't have an automatic behavior that can be wrong.
The same applies for all language bindings for languages that have a native Unicode data type. They should refuse to encode Unicode strings (since they would be losing important information) and let application programmers handle it. @frsyuki do you agree on this solution?

It's interesting idea. I think it's a possible idea to provide an option to reject unicode strings.

@kuenishi
Owner

@chakrit I'm sorry, I don't understand your point. I think whether facing user or not does not matter.

@Midar

@kuenishi You could still do something like {data, <<foo>>} vs. ['f','o','o']. It usually would assume you want to create a string, unless you need the former. Same for deserializing. Erlang's pattern matching makes that very easy. But IIRC there was some difference between <<>> and [].

@saki7

I refrain from making further comments, since I have described every reason for my opinion. Please do not argue against me, but instead think about what the better design is. I agree to @frsyuki's thoughts.

@chakrit

@kuenishi exactly.

If unicode does not have any meaning to your application, then you can just treat it as another Blob from your point of view.

@kuenishi
Owner

@Midar writing pattern match for every msgpack-flavoured term will make programming stupidly hard for users.

@nurse

As a i18n commiter of Ruby, it should be optional string (or encoding) annotation if MessagePack has a string.

First of all, a protocol MUST have error handling. A bad example is HTML4, it doesn't have error handling around parsing errors.

Now for MessagePack some people complain about it doesn't have string type and it causes users to handle string/binary handling themselves. Even if it is by design and MessagePack only treat raw bytes, it is natural that people complain that.

But adding a string type will solve it? There are some problems.

First there are some languages they don't differentiate between raw binary data and text strings. Ruby 1.8, Perl without utf8 flag, JavaScript, OCaml, C without wchar_t, PHP 5.2.0 or prior, and so on doesn't have it, and they must add some schema based translator for MessagePack with String.

Second the difference between string and binary is sometimes ambiguous. For example HTTP logs, they are usually strings and you want to treat they as strings in MessagePack. But once someone attacks your servers, those logs may contain invalid bytes.

Third a sender may send invalid UTF-8 strings. If it is string type, it is absolutely invalid data. But for archive it must be saved as is. This is difficult problem in this schema.

Therefore MessagePack should works without string type. But for ease of Unicode people it may have an annotation that express the binary shall be a string when the sender knows it should be treated as a string. If the receiver knows the annotation, it can treat as strings. If not, it also works with binaries. Moreover sender doesn't know string type, a receiver may treat it as a string or simply ignore it. This also works mixing MessagePack without String and with String.

@Midar

@kuenishi Well, if you really want to differentiate, that is. Let me just ask you: Do you differentiate between a number and a float? I don't think so, as Erlang does not. So, why would you differentiate between binary and string if for Erlang they are the same, but not for number and float? For you, nothing would really change. You would still just don't care if it's binary or a string.

@frsyuki
Owner

@Midar

Actually, no. Because most languages refuse to even store invalid Unicode. Most parse UTF-8 and convert it to an internal representation - which then can't store invalid UTF-8.

Then I have to say some languages store strings as is without converting them into UTF-8 or UTF-16, and msgpack focuses on cross-language. Anyway, if msgpack had string type, serialisers should validate strings before storing.

@kuenishi
Owner

@chakrit If unicode type is forced to be binary, then how can forward or send back that object to other lang? In Erlang binary is coded as binary; can't be serialized as unicode without annotation.

@kuenishi
Owner

+1 to @nurse

@chakrit

@kuenishi If you need to forward or send back that object to other lang what you need to do is make sure to preserve the bit in msgpack that says the blob is a string (supposing that if msgpack has it), you can leave everything else in the string as is.

You just forward the binary coded as binary as is as you've received them. I don't quite get what you mean there.

@jj1bdx

+1 to @nurse

@saki7

+1 for @nurse.

@kzk

@Midar @chakrit @rasky Why don't you guys just use BSON? By sacrificing the size and the performance, you'll have more types like datetime (we have same discussions for that), and UTF-8 string. > http://bsonspec.org/#/specification Am just curious why you guys have spent so much energy to try adding string type to MessagePack, rather just switching to BSON? Still need performance or less size than BSON?

As far as I know, MessagePack is originally designed for the serialization format for RPC (Remote Procedure Call), and the data interoperability is the first priority by nature. The project founder @frsyuki's ultimate goal of MessagePack project is, designing and implementing the interoperable data exchange format across any languages. Adding string type seems to prevent this PRIMARY project goal.

BTW, having user-defined custom types could broaden the MessagePack's use case and adoption, that would be ideal.

@saki7

+1 for user-defined custom types in future; it would be convenient for advanced usages.

@chakrit

@kzk because, barring this request, I like msgpack a lot. It will have much much more potential for what I see as a very little change.

And BSON is a long way from here, as you've said. And that's exactly why I'm arguing here.

@kuenishi
Owner

@Midar as you might not know well about Erlang types, it does not have a type named number(). there are just integer and float (this is double precision so Erlang cant preserve float). If you're talking about example cases, what about imaginary numbers, real numbers with precision more than double? We can't support all types of numbers.

@kuenishi
Owner

Adding notation like MUST, MAY, OPTIONAL from RFC 2119 may help this problem.

@kuenishi
Owner

It may be good to declare that C1 byte is for extension with precise byte length, rather than leaving it as reserved.

@methane

@methane then I think that, given this discussion, your Python binding is wrong, and shouldn't convert Unicode into byte arrays. It just makes bugs happen. It should be removed, and an exception raised. Since it's up to the application to handle Unicode (this is what @frsyuki says), then please let Python application programmers handle it, don't have an automatic behavior that can be wrong.

You're right. But there are tradeoff.

As you know, Python's json serialize tuple into json array, and back to Python list.
Bijection is not a strict requirement for inter language serialization. Practicality beats purity.

In this case, I'll make a decision about changing default behavior after this thread is closed.

@gfx

Hello, I am a maintainer of Perl's Data::MessagePack.

I agree with @nurse. That is, string type as an annotation.

Because Perl does differentiate text strings and binary data, string type is required and it will become a reason for Perl users to use MessagePack instead of JSON. Even if it is just an annotation, it's better than nothing.

@Midar

The project founder @frsyuki's ultimate goal of MessagePack project is, designing and implementing the interoperable data exchange format across any languages. Adding string type seems to prevent this PRIMARY project goal.

This is so untrue. Actually, NOT having a string type is preventing this goal. How could having a string type prevent it? Nobody even tried to explain it because this is just bullsh*t.

@kuenishi I do know about Erlang types, and there is not even double. That is exactly what I meant. MsgPack has int, float, etc. Erlang only bignum. So, how do you handle those? If you don't care if it's an int or a float, then why suddenly care if it's binary or a list?

I hate how everybody here's a hypocrite and wants to have something different for float and int, but not for string and binary, even though the former is waaaaay less important, as both are numbers, whereas string and binary are a string and just anything.

@DestyNova

@kzk "Why don't you guys just use BSON?"

Because I've already implemented a client-server architecture in Java (desktop), Java (Android) and Objective-C (iOS) which sends numeric, binary AND string data over HTTP requests using MessagePack, which I presumed was up to the job of serialising basic datatypes without needing a schema.

Have a look at the msgpack.org front page, right at the top:
"It's like JSON." ... and in the first paragraph, it talks about exchanging strings. Now it turns out that you CANNOT EXCHANGE STRINGS. You can actually only exchange byte arrays, and the receiver has to either guess or know in advance what the format is.

Only when writing the Objective-C client I discovered the problem of ambiguity surrounding strings, which the Obj-C implementation handles incorrectly by treating all binary data as UTF8-encoded strings. I made a quick patch to have it return an array of bytes if this parsing fails, but of course it is a bad solution.

However, changing to BSON or some other binary encoding format at this stage would be a lot of work, so of course I would rather not. Instead, I'll live with the hack for now. But of course I think supporting strings would be better for most people, and not really worse for anybody, as long as the implementations for problematic languages are good.

I don't understand the arguments about why having strings in MessagePack would be a problem for languages like Erlang.
How would it be worse than how we encode strings in MessagePack at the moment, which is basically an ad-hoc non-portable mess which requires a schema or encoding guessing?

@moriyoshi

I'd somewhat be convinced that adding utf-8 string type to msgpack wouldn't actually cause more problems because we anyway have to handle strings with some care if we need to send every string as a raw byte sequence in terms of interoperability as long as the goal of msgpack is to create a convenient, non-strict protocol and what actually determines being interoperable is up to how the applications are implemented. That said, it sounds like the point is out of the scope of what msgpack is supposed to guarantee.

@kuenishi
Owner

@Midar No, Erlang has unsigned 8bit int, signed 32bit int, IEEE double float, bignums. See http://www.erlang.org/doc/apps/erts/erl_ext_dist.html

@frsyuki
Owner

I think it's good opportunity to discuss more.

  1. How do you guys think about Time type? "doesn't handle date or datetime"
  2. Any thoughts on this proposal submitted by someone to IETF?: http://www.ietf.org/mail-archive/web/json/current/msg00003.html
  3. Any thoughts on this old article? "Updates on the MessagePack Project"
@najeira

I think string type is convenient and useful.
But some languages do not have a string type.
Anybody has any ideas for those languages?

@frsyuki
Owner

I think adding user-defined custom type specs is good idea.
If you have some specific format spec (plus guidelines for implementations), please propose it.

Actually, I also want to add the user-defined custom type to support Time and string types optionally into the Ruby implementation, for the cases which focus on transparent communication between Ruby programs rather than interoperable data exchange (as @kzk mentiond).

Problems will be:

  • spec: How to store strings efficiently?
  • spec: Should it supports multiple encodings?
  • spec: Which encoding should it support?
  • guideline: How to handle format errors?
  • guideline: How to implement validation at the serializers?
  • guideline: How to keep backward compatibility?
  • guideline: How to keep forward compatibility?
@cabo

Wow, this is fun.

In October, without knowing that this github issue would be opened, I wrote a spec for a msgpack variant that solves the real problem that a large number of commenters here have, me including.

http://tools.ietf.org/html/draft-bormann-apparea-bpack

Enjoy.

(If you ever want to add custom types, don't fail to consider the recent problems the Rails people have had with type references in YAML. But this is a completely different issue from having a binary string, which just needs to be done.)

@frsyuki
Owner

I think it should consider the backward compatibility as @nurse commented.
Because there're already many working implementations. It should be able to mix them even if some part of a system uses an original user-defined type.

Therefore MessagePack should works without string type. But for ease of Unicode people it may have an annotation that express the binary shall be a string when the sender knows it should be treated as a string. If the receiver knows the annotation, it can treat as strings. If not, it also works with binaries. Moreover sender doesn't know string type, a receiver may treat it as a string or simply ignore it. This also works mixing MessagePack without String and with String.

@muga

@Midar wrote:

This is so untrue. Actually, NOT having a string type is preventing this goal. How could having a string type prevent it?

When implementing RPC's protocol, the most important thing is its performance. String type is never used for implementing RPC's protocol because of the performance penalty. But if MessagePack users want it, we should think user-defined custom types.

+1 for @nurse, @frsyuki, @kzk

@cabo

Maybe I should add that we are looking at this space in the IETF. If we actually choose to standardize something, IETF will need change control, and change is likely. In other words, complete backwards compatibility is somewhat unlikely. I'd prefer to shoot for the best standard for a wider audience (i.e., not just focused at RPC, but for just about everything JSON is used for today). If that is not possible within the msgpack community, it seems that forking is the best choice. In any case, it is likely that the changes will be small enough that any single implementation will be able to do both msgpack and the new format. (And we certainly won't forget that msgpack is where the new format started.)

@chakrit

@frsyuki i think that still raises the same issue unless you specifically mark one user-defined type as "the string type" (effectively adding a spec-approved string type)

@kazuho

@Midar Won't using a BOM solve the problem? IMO encoders / decoders that need to mark data as unicode strings could use BOM to mark them.

@nurse

@kazuho A raw bytes may contain \xFE\xFF or something at the first bytes, and a Unicode string may contain ZWNBSP as the first character. In other words it requires consensus before communication.

@cabo

@kazuho No, you don't want to send a three-byte BOM with every single string. Those three bytes also do occur in binary data. (Also, using BOMs is incorrect usage in UTF-8 anyway.)

Fortunately, the world has converged on UTF-8, so we don't need any other string encodings. We just need a bit distinguishing binary byte strings and UTF-8 strings. http://tools.ietf.org/html/draft-bormann-apparea-bpack shows one way to do this. (Of course, there are infinitely many other ways — I stole this one from Eric Zhang's binaryjs project, which is already forking msgpack's format.)

@kazuho

@nurse @cabo Thank you for the response. Sorry I got confused if this was a problem of character encoding (which was actually binary vs. text).

@nurse

FYI, BOM is useful only when the first byte of the data is already known; XML's first byte is always "<".

@kazuho

@nurse kind of off topic but I was thinking of 0xEF 0xBB 0xBF, which is (as @cabo mentioned) incorrect but IMO a convenient way to mark strings as UTF-8.

@Midar

@kuenishi Hm, last time I looked at Erlang, it said everything is a bignum. It might be stored differently internal for performance reasons, but you never got to see that back then.

@frsyuki Thanks for pointing out BinaryPack, that looks like something I want to implement. Especially if there's chances of it being specified by the IETF, that is a huge win!

@cabo Thanks for writing up BinaryPack and submitting it to the IETF! Really appreciated!

@kazuho Well, it would work as a hack, but pretty much defeat the advantage of MsgPack over BSON or JSON of being short. Every key would essentially have those 3 bytes.

@nurse

@chakrit

i think that still raises the same issue unless you specifically mark one user-defined type as "the string type" (effectively adding a spec-approved string type)

Personally, user-defined types should be disabled by default for a while, and is used only for optional string type. It will care what you worry.

@rasky

@frsyuki >I think adding user-defined custom type specs is good idea.

If you have some specific format spec (plus guidelines for implementations), please propose it.

I don't undertand; if we agree on one user-defined custom type spec for unicode, and we lobby all main bindings to implement it, wouldn't it be the same of adding it to the official specs?

This is my last take: please reconsider your position. I think it's a cultural thing here; Japanese people are against it, while others are in favor. Please think of it one more night, and remember that the IETF proposal already includes unicode strings, and if you Google "msgpack unicode" you will find endless of people having problems with this issue. It's a global issue of msgpack, widely acknowledged.

I wait for your answer tomorrow. Otherwise, I will see if we can draft a plan B with user-defined custom types.

@cabo

@rasky Indeed a cultural thing, most likely. The uptake of UTF-8 has been delayed by about a decade in Japan compared to the rest of the world. The fact that there is only one type of text strings left worth talking about, i.e., UTF-8, maybe hasn't sunk in Japan as much as it has here. So text strings may seem more exotic and varied to a Japanese developer than they are to most of us. This might also explain the (to me pretty much unfathomable) idea of treating text strings as a "custom type", when they are about as fundamental as it gets.

@Midar

@rasky I think no matter what the outcome of this is, it would be better work work on the IETF draft so we have a real standard. This would be even more of a standard than JSON then. So if this fails, don't waste time on a plan B, an IETF-approved serialization format is the future :).

@cabo

Summarizing the discussion here so far, let's lay some of the other red herrings to rest:

  • "we might need UTF-16". No, you only need UTF-8. UTF-16 is used by JavaScript and Java internally, but there is never a good reason to send this over the wire.

  • "we might need NFC and NFD". No, you only need NFC. Some systems may be stuck with NFD in some of their components (OSX HFS+, duuh), but that needs to be fixed at the interface layers to those components. NFD has long been laid to rest. RFC 5198 is the basis for any meaningful interchange here.

  • "we might need Unicode validation or normalization". Well, if you need it, you need it, but there is no relationship to the question whether text strings are identified on the wire or not. Identifying bytes as text or binary on the wire isn't going to change that need at all.

  • "you can fix it in the IDL". Now, yes, that's the point! You'll have to fix it on one of the ends, in the receiver or in the sender. It is much better to fix it on the sender side than on the receiver side: This allows more information about the actual semantics to be present on the wire. The trend in most relevant languages is to have different types for binary strings (e.g., BINARY in Ruby) and text strings (UTF-8 in Ruby), so it is also the easiest thing to do. It is easier to make fish sticks out of an aquarium than the other way around.

  • "the receiver can try to validate the byte sequence as UTF-8 and treat it as a text string if that works and as a binary string otherwise". You've got to be kidding, no further comment.

  • "the receiver shouldn't have to care about whether the byte sequence is a text string or binary goop". Welcome to the implementation languages of the current era: They do care. So you can either reduce msgpack to a solution for IDL/schema-based applications (which is all what the opposers here seem to care about) or you can put information on the wire that allows one to use current implementation languages such as Objective-C, Python, Ruby, JavaScript, or Java.

@nurse

@cabo
It is maybe a fact that Japanese is careful about strings, but it has reasons.

First, most of Japanese is polytheism.

Second, Japanese long had pains around strings.
For example NFC, it sometimes breaks information like

  • http://www.mediawiki.org/wiki/Unicode_normalization_considerations
  • http://blogs.adobe.com/CCJKType/2012/03/cjk-compatibility-ideographs.html Moreover for example file names, imagine following situation:
  • you made a file and carelessly named it with accented alphabets on OS X
  • store the file to some storage like git, NFS, samba and so on
  • send the filename by some way with NFC
  • receive the message and open a file with the filename from Windows Linux... you can't find HTML5 is also annoyed by such traps: for example https://www.w3.org/Bugs/Public/show_bug.cgi?id=14526 (binarypack draft refers net unicode, RFC 5198 says strings SHOULD be normalized by NFC) There are many other problems like CJK Ambiguous Width, Yen sign problem, Wave dash problem, Emoji round-trip, Compatibility Ideograph, and so on. Some of them especially CJK Ambiguous Width and Yen sign problem still annoys us and personally it won't be resolved forever. If you are masochist, study details of them and you'll greatly enjoy it.

Third, this is not because of Japanese but people who commented here are familiar to binary.
Because "whether it is string or binary" is sometimes ambiguous. For example, whether HTTP Header values are binary or strings? If you think they are strings, you'll fail to catch attack logs by crackers. If you think they are binary, you must walk the border of binary/string like http://lucumr.pocoo.org/2010/5/25/wsgi-on-python-3/

If you hadn't annoyed by Unicode and thought Unicode is simple, it is happy thing and you should thank your God.
But real Unicode is dirty and complex because the real world is dirty and complex.
If Unicode is simple, why Unicode Standard is large? http://www.unicode.org/versions/Unicode6.2.0/
(anyway Unicode is very good thing, because it is the character/wiring/printing/internationalization framework where all the people can collaborate)

@cabo

@nurse: Unicode is complex because real world writing systems are complex, indeed. But none of this is taken away or added to by refusing to identify text strings as text strings on the wire.

If your logging system is spewing binary goop, please do identify it as binary goop when sending it over the wire.

It is interesting to see your examples of bugs caused by local software not taking care of properly converting local variants of NFD to NFC. Well, if something hurts, maybe stop doing it? Just send NFC always, and you won't have that particular problem any more. I'm sure the aberrant behavior of Firefox and Opera cited in that bug has been fixed since.

I'm fully aware that mistakes have been made in defining Unicode, and I feel your pain about the information loss in Unicode normalization (some of the interesting cases are not limited to Japanese characters; I agree the UTC has been a bit insensitive to the issues here). But again, msgpack doesn't really care how exactly you normalize your Unicode at the application layer, and that's why RFC 5198 cops out on that SHOULD — just go ahead and normalize a bit more carefully then, but stay close to what NFC is about (the C part more than the N part). The msgpack code should never normalize, that is an application layer issue. Again, you would have had the problem described in that bug independent of whether the filename was identified as a text string in msgpack or not.

I'm sorry about all this mess, but the world is moving to UTF-8 to get rid of a much, much larger mess, and you can get with the program or stay outside.

@methane

@cabo
+1 on "unicode text is sequence of unicode codepoint encoded in UTF-8".
+0 on "unicode text is shortest form UTF-8".
-1 on "unicode text is NFC normalized UTF-8".

JSON spec doesn't require NFC.
If spec requires NFC, all encoders and decoders become slow and big.
If your application needs NFC, please normalize in application layer. Not in wire format layer.

@methane

There are two suggestion for adding unicode type.

1. Use 0xc1 as a utf-8 hint.

Treat 0xc1 as a "next raw is utf-8 encoded string" hint.
Some languages don't having unicode type can just skip hint.

2. Use 0xc1 as a header for type hint.

[0xc1] [type code] [value] may convert value based on type code.
If decoder don't support received type code, returns value as is.
Some type codes are officiail: utf-8, posix time maybe.

@methane

And I think we need recommendation for easy-to-use API to add hint for languages doesn't have such types.

To inter operate with languages doesn't have unicode string, just adding unicode string type in
msgpack may cause sadly result.

For example, this Python code:

msg = msgpack.unpackb(packed_bytes)
msg['name'] = msg['name'].decode('utf8')

may become this:

msg = msgpack.unpackb(packed_bytes)
if isinstance(msg['name'], bytes):
    msg['name'] = msg['name'].decode('utf8')

I believe no one loves later code.

@kazuho

@cabo @rasky

To me it seems the different positions do not come from cultural differences but from how the MessagePack format is being used.

My understanding is that many (maybe most) of the developers opposing to the idea of introducing a string type is using the format for sending / storing / analyzing logs, especially HTTP.

As some have pointed out, HTTP URIs do not specify the character set to be used. Instead it is encoded at the application level (as an example, google search queries look like "q=keyword&ie=utf-8").

Even a string type gets introduced, they would still need to decode the information to proper character encoding; in other words, introducing a string type does not ease their problem. OTOH, having two types looking alike (string vs binary) would make the problem complicated, since it is hard for ordinary developers to determine which type should be used.

Considering the facts, I think opposing to the idea does have ground.

So a proposal that takes care of their concerns would make things easier IMO.

@kiyoto

@rasky

I am just a passive user of MessagePack, so I really take no stand in this debate, but one of your comments caught my attention:

Quoting you:

"This is my last take: please reconsider your position. I think it's a cultural thing here; Japanese people are against it, while others are in favor."

This comment, unless better explicated, is merit-less if not offensive (just in case you would erroneously surmise based on my name: I am Japanese American, not Japanese).

True, most people on this thread who are against adding a string type to MessagePack are Japanese (or of Japanese descent). The other attribute they share is that they are committers and/or long-time users of MessagePack and have mulled over this issue over the last several years. Fine, one can claim that they may have fallen pray to groupthink from working on MessagePack together for so long (not that I endorse this view). But I don't see how being Japanese figures into the current discussion.

I can certainly see your frustration, but calling it a quit and imputing the opposition's view to this vague idea of "cultural difference" is disappointing. If you are tired of arguing with them, that's fine. But I hope you stick to a technically sound, civilized discourse.

@kazuho

When we look at MessagePack as a framework for building RPC protocols (the other common use case than handling the logs), the pros / cons of introducing a string type would vary depending on how the RPC protocol is designed.

For RPCs with predefined schema, using binary would be just fine, since the type information can be obtained from the schema. Introducing a string type might cause performance degradation in such cases (if the protocol spec. requires validation or canonicalization).

For RPC without predefined schema, introducing a string type (i.e. introspection that could distinguish string vs. binary) would be beneficial. IMO the "Objective-C" issue falls in this area.

Personally, I think introducing a optional flag to indicate "charset" (binary, utf-8, etc.) to the binary type would be a modest approach that all parties could agree.

@cabo

@methane: Good point.

The current version of the binarypack spec does not mandate NFC.
It simply defers to RFC 5198, which is the relevant spec here.

In the end the specific form of Unicode to be used is an application
decision; my comment about NFC here is mostly about saying that
sending decomposed stuff (like what you get from OSX HFS+) is not OK,
not about requiring full normalization.

If your application needs NFC, please normalize in application layer. Not in wire format layer.

Exactly.

binarypack code itself should never be concerned with normalization;
the only discussion here can be about what should be expected from
application code shipping text using binarypack.

I believe that RFC 5198 made the right decision to say NFC SHOULD be
used on the net. In IETF parlance "SHOULD" is a "MUST" with
exceptions where the mandate doesn't make sense. While we are waiting
for the normalization bugs to be fixed (SC2/WG2/N4246 and all that), I
would indeed understand if applications weren't actually normalizing
down to NFC. But, again, this battle will not be fought in this spec.

In the next version of the spec, I'll add a short appendix explaining some of this.

Re the idea with the 0xc1 encoding: In JSON-related usage, most small
objects will be keys, and these will be text. So adding a byte to
each of these is suboptimal. There also is no advantage over the
solution in the current binarypack spec, as adding a hinting prefix
like 0xc1 will break backwards compatibility exactly as much as the
change done in binarypack. (By the way, I'd love to see some
probability distribution functions for raw-byte sizes in current
msgpack deployments. Has anyone collected something like this?)

@kazuho: Re Logs: If treating these as binary is important, I indeed
wouldn't mark them as text. So fluentd may not benefit from this
change; I agree. (I personally actually don't care about RPC that
much; I'm interested in Web usage of msgpack. But I don't see a
conflict between the two usages at all.)

@chakrit

So the opposers here who don't want a String type do not want to agree on UTF-8 because it is broken for japanese at least, am I correct?

Then may I propose the next best alternative, which I think is to have a String type with and encoding specifier or a number referencing the old code page system added.

Those who wish to roll their own custom string format, can still do so with the raw bytes encoding method just as before.

Let's say, make C1 the marker for the string type (not the vague definition of "user-defined type" which is too open to misinterpretation and bad for interoperability) and also have two bytes following C1 to denote the exact code page to use.

So it will be something like this:

C1 FD E9 00 04 FF FF FF FF

This would means, that this chunk is a String (C1) with Unicode (FD E9) and it has four bytes of length (00 04) with the data (FF FF FF FF)

Another example might be:

C1 03 A4 00 04 FF FF FF FF

This would means, this is also a String (C1) with Shift-JIS (03 A4 == codepage 932) and it has four bytes of length (00 04) with the data (FF FF FF FF).

As I said, if you think having codepage still is not enough to represent all of hiragana/katakana/symbols/numbers then you can still use the raw bytes method which you have now, but if anyone needs to use a String they can do so without having to implement Unicode and if anyone wants to use Unicode, they can still do so by specifying the right codepage.

How does this sound?


Reference:

@repeatedly

@rasky

Some developers use MessagePack to communicate across languages, e.g. via tcp, msgpack-rpc, zero-rpc and etc.
In this case, MessagePack is a data container and binary type is fit for such situations (@kazuho already mentioned this point).
In addition, a msgpack-xxx developer often is a msgpack-rpc-xxx developer.

But using MessagePack instead of JSON for internal data structure, these developers who use unicode supported language may want string type.
BinaryPack, BSON too, seems to purpose this cases.

It is not a cultural thing. Probably, the difference is MessagePack usage and experience.

@najeira

Please do not break compatibility.

BinaryPack and the IETF has a compatibility problem with msgpack at fix raw. They can not load current msgpack-ed data that include 16-31 fix raw.

I think methane's suggestion "Use 0xc1 as a utf-8 hint" is good idea to keep compatibility.

@6502

The main problem seems to me that there is no such a thing as "a string" for computers (yet). There are utf8 strings, ascii strings, iso8859-1 strings, utf16 strings with explicit or implicit BO and so on...

Just having binary dumps in unspecified format however is not a solution, simply means closing your eyes hoping the problem doesn't exist. That in my opinion is rarely a good idea.

Encoding hell is terrible and we waste quite a big number of hours fighting with it (as probably does anyone dealing with software used in different countries). I've seen things that you wouldn't believe... like virtual machines crashing because handed an USB drive with a badly encoded filename (not program crashing or even BSODs... the actual virtual machine crashed... the equivalent of a physical computer exploding).

A protocol that just says there is a binary chunk that may possibly be a string is simply less powerful and requires a schema.

Adding an encoding specifier becomes a politically correct solution but heavy on everyone because everyone will have to support multiple encodings and also leaves the door open for subtle incompatibilities.

Just specifying a single universal encoding for the protocol to me seems the best technical solution because minimizes the impact and I add that utf8 would me IMO the best candidate. If your language doesn't handle all unicode points you can decide what is the best compromise and only implement that.

@rasky

@kiyoto > This comment, unless better explicated, is merit-less if not offensive (just in case you would erroneously surmise based on my name: I am Japanese American, not Japanese).

There was absolutely no intention to offend anyone. I'm deeply respectful of different cultures (including Japanese one). But I don't think programming design is a lingua franca all over the world, and it might well be influenced by cultural / national diversities. I would not be surprised if most Japanese agree on a design principle that Westerns find debatable. For instance, the fact that UTF8 penetration has been much slower in Japan than in Europe might very well influence any decision on this issue.

And I'm not frustrated. I've been working with open source far too long to get frustrated by a maintainer I don't agree with. I just wanted to make this thread converge one way or another, and to get a final pronouncement from the main maintainer, so that I can decide how I can proceed to implement my software.

@rasky

@kazuho are you suggesting that, since there is one use case (storing logs, maybe HTTP logs) that doesn't require the string format, than it should not be added to MsgPack? Storing HTTP log messages doesn't probably require float type as well, but it's part of MsgPack. And I don't see the documentation mentioning that MsgPack main usage and focus is storing HTTP logs. So I don't see a point in your reasoning.

There are probably many use cases that don't require a string type, and many others that do. Since strings are so ubiquitous in programming language, we argue that it's important to add a string type for those use cases that need it, instead of asking the programmer to go through layers of workarounds.

@kazuho

@rasky

are you suggesting that, since there is one use case (storing logs, maybe HTTP logs) that doesn't require the string format, than it should not be added to MsgPack? Storing HTTP log messages doesn't probably require float type as well, but it's part of MsgPack. And I don't see the documentation mentioning that MsgPack main usage and focus is storing HTTP logs. So I don't see a point in your reasoning.

No. I said that one of the major use case of MessagePack is storing logs, and within such use case problem may arise by introducing a string type.

If you are interested in reaching an agreement to introduce a string type (I assume you do), you should better try to understand how others use the format and what there concerns are, instead of saying things like "how could I have known it, it's not in the documentation."

@rasky

@kazuho I still fail to see your point. Are you saying that adding a string type can create confusion in users that 1) don't know very well the difference between strings and bytes, and 2) store HTTP logs?

I think that, if anything, this is a support/documentation issue, not a design issue. If storing HTTP logs is such an important use case for MsgPack, then by all means add a documentation page about it, and you can mention that it makes sense to store them as bytes and not strings for multiple reasons.

@kazuho

@rasky

I still fail to see your point. Are you saying that adding a string type can create confusion in users that 1) don't know very well the difference between strings and bytes, and 2) store HTTP logs?

Yes. Developers often erroneously handle bytes as strings, or vise versa. It's a common mistake. And the sad fact is that such error cannot be caught during development, but only when the problem arises when some software detects an invalid character sequence. Not introducing a string type is a cautious approach to the problem (I understand that it may seem too cautious to some).

The other problem is that the use of a string type would be slower than using a binary type, since a string stream needs to be validated, since as @frsyuki pointed out, "successfully stored data must be read successfully."

Others and I have pointed out that such validation is an unnecessary burden for RPC wire formats with external scheme (which is also a common use case of MessagePack).

BTW, my understanding is that some (or many if not most) of the people here seem to tend to think that @nurse 's proposal is a good idea (the approach would not cause the two problems I pointed out), to introduce a hint that marks a binary type as a string (well, at the API level; binary notations could be different).

@methane has proposed using 0xC1 as the hint, which @cabo considers too much an overhead in terms of memory consumption.

@rasky How do you think about the proposal? Do you think introducing string type as a subtype of binary is a good idea? If yes, how do you think about the proposed encoding?

@chakrit

Developers often erroneously handle strings as bytes, or vise versa. It's a common mistake.

May I ask, what kind of developers? That is definitely not the case where I come from.

Plus wouldn't adding a String type help with this case instead of hurt? I think my experience is flip opposite on this point.

Also, may I ask what are your main development platform / environment ? This way I can understand your motivation better.

The other problem is that the use of a string type would be slower than using a binary type, since a string stream needs to be validated, since as @frsyuki pointed out, "successfully stored data must be read successfully."

Do you imply that by not having a String type, and storing Strings as raw bytes then you do not need to validate it and it will be much faster?

String must be validated wether or not it is transported as binary or string and regardless of wether we have a string marker or not.

No offense, but I do not think that this is even a valid argument to start with due to the same argument being made numerous times in other comments.

Data must be validated wether or not it is a string or binary but by not having a String type msgpack have explicitly made this much much harder.

Others and I have pointed out that such validation is unnecessary for RPC wire formats with external scheme..

I think the point was that it is impossible to come up with a good schema that handles strings without resorting to adding custom extension the msgpack specification because you do not provide a sane way to differentiate between strings and raw bytes thus every time we need strings we need to "hack" the specs to meet our needs which goes against interoperability.

Now every piece of software that wants to use msgpack will now have their separate version of a way to transport string and thus the application developer will need to take care of the translation for these different pieces of software whereas if there is a string type and an indicated marker to process it, this could've been done at the protocol handler level solving the problem once and for all for everyone instead of again and again for each implementer.


@6502

Just specifying a single universal encoding for the protocol to me seems the best technical solution because minimizes the impact and I add that utf8 would me IMO the best candidate. If your language doesn't handle all unicode points you can decide what is the best compromise and only implement that.

I agree completely with this. But seems the opposers here seems to have a lot of problems with Unicode/UTF8 or simply are not in an environment where this is in widespread use. Wonders.

@DestyNova

@kazuho

Developers often erroneously handle bytes as strings, or vise versa. It's a common mistake. And the sad fact is that such error cannot be caught during development, but only when the problem arises when some software detects an invalid character sequence. Not introducing a string type is a cautious approach to the problem (I understand that it may seem too cautious to some)."

Are you saying that because some MessagePack users get confused between strings and bytes, Messagepack should not have a string type at all?
That seems extreme, and frankly a little selfish.

Do you think people already make this type of programming error sometimes? If so, what can really be lost by adding a string type? What type of bugs do think will become common that are not common now?
Certainly, many people have something to gain, since sending strings is a common use case (i.e. JSON-like usage). I really think the vast, vast majority of MessagePack users would benefit from it, and a small number of users (e.g. sending HTTP logs or RPC stuff) would be in (at least) the same position they are now.

@kazuho

@chakrit

Developers often erroneously handle strings as bytes, or vise versa. It's a common mistake.

May I ask, what kind of developers? That is definitely not the case where I come from.

A classical mistake in this area is handling binary data as ASCIIZ string (like http://www.informit.com/articles/article.aspx?p=430402&seqNum=3, for example).

Another example would be data encoded twice or not encoding at all, where encoding it once is the correct.

IMO the problem is mostly due to the lack of types in some interpreted programming languages (or how people miss to type the data correctly), but we are living in such a world where those languages are often used.

Plus wouldn't adding a String type help with this case instead of hurt? I think my experience is flip opposite on this point.

Yes in general, but sometimes no.

In a large scale asynchronous messaging system, corrupt data sneak in no matter how clear the protocol is being defined.

Consider e-mail. Sometimes you receive mails containing a corrupt character sequence. Should the mail be discarded by the MUA? Or should an error be raised? But what is the reliable way raise the error, what is the guaranteed way to notify the sender that the message could not be handled?

In case of e-mail, general approach is to display the mail as much as possible, replacing invalid sequence with substitution characters. Many MUAs also provide a method to look at the raw data.

This is an example of a large scale messaging system with character encoding support, and how it deals with corrupt data.

When designing an asynchronous messaging middleware, there are two approaches to tackle the problem of corrupt data. One way is to add an API to access raw data. The other is to not deal with the problem within the middleware, and let application layer handle the problem. Substitition as I explained in the e-mail example is not a good way for messages that are handled by machines.

The other problem is that the use of a string type would be slower than using a binary type, since a string stream needs to be validated, since as @frsyuki pointed out, "successfully stored data must be read successfully."
Do you imply that by not having a String type, and storing Strings as raw bytes then you do not need to validate it and it will be much faster?

No. What I mean is that current spec. of MessagePack lets the developer choose whether or not to validate the data. By introducing a string type, we might need to add validation to some implementations of the protocol which would be an enforced overhead (or else, we would start seeing many corrupt string streams).

Anyways, please let me make my position clear. I am not against introducing string types. I am saying that by taking such concerns into consideration it would be easier to reach the goal to introducing string types; and as I mentioned before, there is already a proposal by @nurse and @methane that covers such concerns. (cc: @DestyNova)

@frsyuki
Owner

@najeira says:

BinaryPack and the IETF has a compatibility problem with msgpack at fix raw. They can not load current msgpack-ed data that include 16-31 fix raw.

I agree. Backward compatibility is essential

  • I don't agree with string type implementations wich change the current format spec
  • @nurse suggested a way to keep backward compatibility.
  • @methane suggested a spec plan
  • BinaryPack (by @cabo) is NOT compatible with MessagePack
    • BinaryPack assigns 0xb0-0xbf to the string type. These region is assigned to Raw format now. It means data stored/sent by msgpack can't read/received by BinaryPack library. Data stored/sent by BinaryPack can't read/received by msgpack library.
    • I don't agree to change (not add) the data format while a couterproposal is offered
@frsyuki
Owner

@kazuho says:

I am not against introducing string types. I am saying that by taking such concerns into consideration it would be easier to reach the goal to introducing string types; and as I mentioned before, there is already a proposal by @nurse and @methane that covers such concerns. (cc: @DestyNova)

It's same with me. My opinion is that:

  • It's ok to add string type as an optional feature but it should be optional
    • I don't agree to add big change to the spec without confirmation of its befenits and traps on industries using real implementations

The optional feature should take care of backward compatibility such as one suggested by @methane.

@frsyuki
Owner

@kazuho says:

Developers often erroneously handle bytes as strings, or vise versa. It's a common mistake. And the sad fact is that such error cannot be caught during development, but only when the problem arises when some software detects an invalid character sequence. Not introducing a string type is a cautious approach to the problem (I understand that it may seem too cautious to some).

I almost agree. I don't think it's a mistake. For example, msgpack for C++ doesn't have a standard way to tell strings from binaries. And I will not care the data is a string or binary in C++ programs. (Is this field a string? or binary? I don't care. It's just a std::string)

IMO the problem is mostly due to the lack of types in some interpreted programming languages (and how they are used), but we are living in such a world where those languages are often used.

I agree. PHP.

@cabo

BinaryPack (by @cabo) is NOT compatible with MessagePack

No proposal is "compatible" with msgpack 1.0.

There are two directions of compatibility:

  • Old data to new servers (forward compatibility) and
  • new data to old servers (backward compatibility).

Both are a good thing, of course.

There is no way to have backward compatibility; msgpack just wasn't designed for that.
So, given that, why do you think that forward compatibility is so important that it trumps all other design considerations?

I tried to make clear why I think that 0xc1 prefixing is a non-starter.

Here is one simple benchmark, a slightly simplified piece of JSON out of draft-jennings-senml-10.txt:

{"e":[{"n":"80063","v":23}]}

This is 28 bytes in JSON. 16 bytes in binarypack (or in msgpack 1.0, with strings as raw bytes).

  • binarypack => "81 b1 65 91 82 b1 6e b5 38 30 30 36 33 b1 76 17"
  • msgpack 1.0 => "81 a1 65 91 82 a1 6e a5 38 30 30 36 33 a1 76 17"

With the simple hint version of 0xc1, 20 bytes.

  • 0xc1 hint => "81 C1 a1 65 91 82 C1 a1 6e C1 a5 38 30 30 36 33 C1 a1 76 17"

With the tagging proposal, 32 bytes, more than JSON.

  • 0xc1 tagged => "81 C1 FD E9 00 01 65 91 82 C1 FD E9 00 01 6e C1 FD E9 00 05 38 30 30 36 33 C1 FD E9 00 01 76 17"

(I have no idea why I should be sending lots of bytes that reiterate for each single string what everyone already knows: Text is encoded in UTF-8. I also don't think there should be a hard limit of 64KiB for strings.)

There are some other things that people have asked for binarypack.
One thing is being able to send large strings in a chunked form.

Whatever we do, we most likely won't have forward compatibility in a
2.0, whether that is still called msgpack or needs to be called
binarypack.

@frsyuki
Owner

@cabo says:

Unicode is complex because real world writing systems are complex

the world is moving to UTF-8 to get rid of a much, much larger mess, and you can get with the program or stay outside.

I agree.

But none of this is taken away or added to by refusing to identify text strings as text strings on the wire.

I don't agree. Data schema should be consistent even if the real world is complex and changing, while applications should be changed. My opinion is that application should handle the string types.

@DestyNova says:

Are you saying that because some MessagePack users get confused between strings and bytes, Messagepack should not have a string type at all?

I think applications are always confused with strings, and between strings and bytes. Data should not have a string type at all to keep loosely-coupled from applications. I don't want data to include add marks which distinguish strings from binaries. Thus MessagePack (which is a data representation format) should not have a string type by default.

Please don't get me wrong. I don't say applications should not deal with strings. Applications can project msgpack's Raw type into string type. I meant applications should always project Raw types into string types to keep data loosely-coupled from applications, even though it's not convenient.

@jodastephen

For me, the key point the authors of MsgPack need to accept is that the claim "It's like JSON. but fast and small" is misleading and inaccurate. JSON supports "string", see the specification, but MsgPack does not. This makes the claim nonsense. Please change the web page now!

Beyond that, I hope the authors understand that many, if not most, people disagree with them. I saw MsgPack years ago, but because it did not have a string type, I put it in the "stupid project" pile. I'm sure many others have done the same.

Strings are a fundamental datatype, and UTF-8 is the well-defined unique way to store/send them. From the looks of it, we need a new spec, inspired by MsgPack, and hopefully ratified by the IETF.

@frsyuki
Owner

@cabo sorry, I couldn't understand what you meant.

binarypack => "81 **b1** 65 91 82 b1 6e b5 38 30 30 36 33 b1 76 17"
msgpack 1.0 => "81 **a1** 65 91 82 a1 6e a5 38 30 30 36 33 a1 76 17"

Old programs can't read binarypack data, right? You mean binarypack is not backward compatible.

Then, what happens with this data? (16 bytes Raw in msgpack, 16 bytes string in binarypack):

{"a"=>"aaaaaaaaaaaaaaaa"}

binarypack => "81 b1 61 d8 00 10 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61"
msgpack    => "81 a1 61 b0 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61 61"

New programs can't read msgpack data. BinaryPack is not forward compatible as well.
How wrong is this?

@kuenishi
Owner

Guys. Be happy. Here's reference implementation. Simple is best.

https://github.com/msgpack/msgpack-erlang/blob/string/src/msgpack.erl#L226

@kuenishi
Owner

And, I don't like proposing an RFC to IETF. Totally useless and waste of time. MessagePack is NOT network protocol. MessagePack-RPC IS the network protocol. Anyone using that and passionate to write reference implementation of servers and clients in multiple languages?

@repeatedly

@jodastephen

JSON supports "string", see the specification

Yes. And what dou you think about the handling of broken string?
AFAIK, JSON spec doesn't define such things.
How do you handle the parsing error and recover broken string?

Collecting the use cases is needed to next step.

@cabo

@frsyuki: sorry for having been too terse.

I meant that 0xc1 is forward-, but not backward compatible. So we have broken compatibility already (msgpack does not have extension points).

Binarypack -00 is indeed neither backward nor forward compatible. I think that is not much worse than just being forward compatible. I also believe there is potential for putting in some more "protocols 101" things, e.g. controlled extension points, that will make any attempt to maintain forward compatibility futile. I'd rather do this right once than piecemeal.

It may be a bit disappointing that evolution of msgpack might involve breaking forward compatibility. Believe me, I've been through this kind of disappointment often enough throughout a third century of protocol design; each time when I thought I had it done right.

(@kuenishi: Many of us want to use msgpack outside of IDL-style RPC. Some of us don't even believe in RPC as a concept.)
(@repeatedly: A JSON document is by definition UTF-8 (or -16 or -32, which IETF is about to fix). So it can't be "broken UTF-8" by definition. Handling of broken JSON is an interesting thing by itself, of course.)

@kuenishi
Owner

@cabo So IETF is wrong place to promote msgpack, isn't it? It's originally for network things. You should go ISO or ITU, I think.

@cabo

Oh, and I'm not at all wedded to the idea of stealing the 16 code points for short strings from short byte sequences.
It's just what Eric Zhang had done, and it sounded reasonable enough. Unfortunately, msgpack 1.0 does not have enough codepoints left to just add a reasonable short string encoding, and even if it had, this would consume too many free code points, leaving little possibility for future extension. (If I hadn't had Eric's implementation already, I'd steal from the fixnums instead — 0..127 is probably more than what is needed — and I'd leave at least 32 code more code points reserved.)

(@kuenishi: ISO and ITU are irrelevant here. W3C might be a venue, but I doubt they are interested. Note that JSON is an IETF document, so there is precedent. BTW, I don't want to "promote msgpack" either, I want to solve a problem, and msgpack seemed to be 95 % of the solution. IETF would need to have change control (and thus the power to break compatibility) in any case, if they want to pick it up.)

@kuenishi
Owner

Moreover, we don't need such standardization and NO TIME to invest on that thing. Who's gonna be happy with standardization? De facto is enough.

@cabo what problem are you solving? This storm of flaming? String? Then going to IETF definitely solve nothing. Just the place changes. IETF WGs are place for people who want to interconnect each other's system. I have no idea why JSON people are doing standardisation.

@frsyuki
Owner

@cabo > (msgpack does not have extension points).

I agree. I think it's better to have extension points.
I think @methane suggested this. I mean, having extension points is not compatible but easy to implement. If it has extension points, adding string types is compatible.

@cabo: How do you think this format? I think this is still better than current BinaryPack-00 in terms of both si ze and compatibility:

0xa0-0xaf FixString (0bytes - 15bytes raw type with hint)  // changed
0xb0-0xbf FixRaw    (16bytes - 31bytes raw type)

0xd6-string 8       (16bytes - 255bytes raw type with hint)  // new
0xd7-string 16
0xd8-string 32

0xd9-raw 8          (0bytes - 15bytes, 32bytes - 255bytes byte array)  // new
0xda-raw 16
0xdb-raw 32

Point is that FixRaw can't store 0bytes - 15bytes of byte arrays. It uses raw 16 or raw 8. This is compatible with current msgpack (I used verbosity left in msgpack format).
If an user doesn't want to tell strings from binaries, it can handle all strings as binaries.

New implementations can read old data. They can't tell strings from binaries (because data don't include the information) but it's same with the current situation. So it's forward compatible.
Unless the new implementations don't use raw 8 or string types (=new features), it's backward compatible as well. Because Old implementations just assume 0xa0-0xaf is a byte array. But basically, this is not backward compatible.

Regarding efficiency, an assumption here is that byte arrays are usually longer than 15 bytes while many strings are shorter than 16 bytes. On the other hand, raw 8 will reduce size:

raws 0-15 bytes: + 1 byte
raws 32-255 bytes: -1 bytes
@kuenishi
Owner

@frsyuki I think FixString for small strings won't improve performance so much because Unicode string tend to be longer than FixRaw and processing unicodes are heavy. I think 4 bytes with 32bit length just after C1 is enough.

@cabo

@frsyuki this is a bit weird, but indeed workable. I think adding string8/raw8 is a great idea as the range of the short string/raw is diminished by the split. Having a longer representation for raw 0-15 than for 16-31 may raise cognitive dissonance, though.

If we could generate a probability distribution function for sizes from a real live system, this could be used to dispel such criticism. Do you know any system we could tap here? We really just need a histogram of raw sizes.

@repeatedly

Hmm... I implemented the prototype of frsyuki's proposal in D.

https://github.com/msgpack/msgpack-d/tree/string-support

If we need the test of size or speed, I will try it.
But I think other approaches are stille alive. So we need more discussion about the better format.

@frsyuki
Owner

@kuenishi I assumed most of "short" strings are ASCII. Size of characters in ASCII range is same in UTF-8.

@cabo I thought it's beautiful:-) Cognitive dissonance is still better than forward incompatibility. There're already many MessagePack implementations.

@frsyuki
Owner

I just said the format I suggested should be better than BinaryPack-00. Discussion is still open.

@Midar

@kazuho

A classical mistake in this area is handling binary data as ASCIIZ string (like http://www.informit.com/articles/article.aspx?p=430402&seqNum=3, for example).

What you linked it a description of a buffer overflow. That has nothing, absolutely nothing, to do with strings vs. binary.

@frsyuki

(Is this field a string? or binary? I don't care. It's just a std::string)

Thanks you so much for this! With that, you demonstrated how much we need a string type, because you just created something broken! std::string goes horribly wrong if there's \0 in it! Correct would be an std::vector - which is binary. So, you see, even C++ does make a significant difference between string and binary - it's a completely different type!

@cabo
I started an implementation of BinaryPack, my decoding is already done. Do you have some test files that test all possible types? That would be really helpful.

@frsyuki
Owner

@Midar your understanding on std::string is wrong. std::string stores a sequence of bytes and length of it.

@Midar

@frsyuki Actually, yours is. std::string is meant for strings that are compatible to C strings and are often just passed to functions that take C strings. If a 0-byte is inside there and it's passed to a function taking a C string, things go horribly wrong. This already caused a lot of problems in the wild and lead to remote code execution etc! std::string is not meant for binary. It's meant for strings. If you actually use std::string to store arbitrary binary, not only are you a bad programmer, but also you don't care about security in the least bit, as this almost always goes wrong at some point.

@Midar

@nobu-k Please read section 17.3.2.1.3.1 of ISO/IEC 14882:2003. Also, please read my rationale why this is a very bad idea even if your std::string implementation accepts \0. It has caused serious disasters in the wild already!

@kazuho

@Midar

A classical mistake in this area is handling binary data as ASCIIZ string (like http://www.informit.com/articles/article.aspx?p=430402&seqNum=3, for example).
What you linked it a description of a buffer overflow. That has nothing, absolutely nothing, to do with strings vs. binary.

Do you really think "buffer overflow" is the only problem you'd face when an binary data is being treated as an ASCIIZ string? Don't you know of other problems that arise under such mishandling? Frankly speaking, I wonder if you are even trying to understand what others are saying.

@Midar

@kazuho The link you gave talked about gets(), not about ASCIZ.

I guess what you mean is using string functions like strcmp instead of memcmp? And how would NOT having a string type help with that? But this is not what your link talks about!

Frankly speaking, I wonder if you are even trying to understand what others are saying.

That's exactly what I'm thinking about you…

@frsyuki
Owner

@Midar I think your're mixing up the problem of C strings.
I meant C++ doesn't have appropriate standard classes to tell strings from binaries. This is a fact. I think it's not the problem how horrible std::string is or adding string types to C++ standard is better or not.

Please understand some languages don't have string or binary types in fact. and:

Developers often erroneously handle bytes as strings, or vise versa. It's a common mistake. And the sad fact is that such error cannot be caught during development, but only when the problem arises when some software detects an invalid character sequence. Not introducing a string type is a cautious approach to the problem (I understand that it may seem too cautious to some).

@kazuho

@Midar

I guess what you mean is using string functions like strcmp instead of memcmp? And how would NOT having a string type help with that? But this is not what your link talks about!

Misuse of strcmp or memcmp against binary data does not cause "buffer overflow" vulnerabilities (note "buffer overflow" generally only refers to write overflows; anyways, it's part of the problem caused by the mishandling of the two formats). Its the misuse of the copy functions that cause "buffer overrun" vulnerabilities.

To me it seems that you are too excited. Please relax and try to understand what others say.

@nobu-k

@Midar As @frsyuki mentioned, you're confused by the spec. It's talking about null-terminated byte strings, not std::string. I understand what you're saying and know that it's sometimes very dangerous to handle std::string containing '\0'.

@Midar

@frsyuki
Yes, theoretically, you can store \0 in std::string. But this is a very bad idea, as in C++, NTBS is often used, and thus std::string is very often converted to NTBS, at which point things fail horribly.

@kazuho

Misuse of strcmp or memcmp against binary data does not cause "buffer overflow" vulnerabilities (the latter may cause "buffer overrun"; note the difference between the two in my context is that one is a read problem and the other is write problem; anyways, it's part of the problem caused by the mishandling of the two formats). Its the misuse of the copy functions that cause "buffer overrun" vulnerabilities.

I never said that, but your link was talking about buffer overflows. And you were talking about ASCIZ, and I assume you were meaning stuff like strcmp vs memcmp.

To me it seems that you are too excited. Please relax and try to understand what others say.

For fk's sake, can you ***please* read your own link? Please?

Is really nobody reading what he writes?! I'm really tired of talking to a wall :(.

@nobu-k
Yes, theoretically, you can. But the point is that you shouldn't, as it's too dangerous. NTBS is used in so many places.

@kazuho

@Midar

Whatever you say, "buffer overflow" (or "buffer overrun") vulnerability is only a tip of an iceberg of bugs when a binary data is handled errorneously as strings. There are non-vulnerable bugs such as partial copies imposed by the mishandling.

And I have pointed out other problem as well such as double encoding and no encoding, not to mention the performance problem with RPC with schema.

Please do try to understand what others are saying instead of trying to overwhelm. Thank you.

@methane

@Midar Why should we discuss about ASCIZ?
AFAIK, NUL is a valid Unicode character.
Do you propose prohibit NUL char in string type?

@Midar

@methane Why do you ask me that? @kazuho brought it up.
Anyway, Unicode does forbid 0, so this would even solve @kazuho's problem. So I don't get why he's against string…

@methane
>>> u"\u0000".encode('utf-8')
'\x00'

Then, let's stop discussing about ASCIZ.

@Midar

@methane Oh, nice finding! So it seems Python does not care if it's valid Unicode, only if it's valid UTF-8.

@DestyNova

@kazuho

Whatever you say, "buffer overflow" (or "buffer overrun") vulnerability is only a tip of an iceberg of bugs when a binary data is handled errorneously as strings. There are non-vulnerable bugs such as partial copies imposed by the mishandling.

Having no string type in MessagePack will not improve this situation.
In fact it will make things worse, as users will be forced to perform more manual conversion between bytes and strings when they just wanted to transmit strings, and these manual conversions will sometimes be wrong. Don't you agree?

@Midar

Just for the record, in case someone is interested, I implemented BinaryPack (only reading so far):
https://github.com/Midar/objfw/blob/master/src/OFDataArray%2BBinaryPackValue.m

Here's a test file I wrote which utilizes all types in case someone else wants to check his implementation:
https://webkeks.org/test.binarypack

After parsing, it should look something like this:

[2013-02-21 17:07:30.452 t(75500)] {
    array32 = (
        a,
        b,
        c,
        d
    );
    tiny = 15;
    shortarray = (
        xyz
    );
    table16 = {
        w = d;
        y = v;
        x = a;
        z = c;
    };
    array16 = (
        0,
        1,
        2,
        3
    );
    double = 5.75;
    uint16 = 260;
    stiny = -30;
    bin16 = <01 23 45 67 89 ab cd ef 01 23 45 67 89 ab cd ef 01 23 45 67 89 ab cd ef 01 23 45 67 89 ab cd ef>;
    uint32 = 305419896;
    int8 = -1;
    uint64 = 1311768467463790320;
    tinybin = <01 23 45 67 89 ab cd ef>;
    string16 = Hallo Unicode wörld!;
    true = YES;
    float = 5.53125;
    bin32 = <01 23 45 67 89 ab cd ef 01 23 45 67 89 ab cd ef>;
    string32 = ユニコードこんにちは;
    false = NO;
    int16 = -1;
    nil = <null>;
    int32 = -1;
    uint8 = 64;
    int64 = -1;
    tinytable = {
        b = 1;
        c = 2;
        a = 0;
    };
}

Please note that the order of tables is not kept due to randomization for security reasons.

@kazuho

@methane I have not checked the Unicode spec. recently but AFAIK ECMA262 has always permitted NUL within strings and that's how the web browsers have been implemented (see http://labs.cybozu.co.jp/blog/kazuho/archives/2006/11/js_string_literal2.php).

@DestyNova
I pointed out the mishandling of ASCIIZ and binary data just to show an example of how binary data and strings are misused.
And I agree that introducing a string type is good idea (please see my previous comments), I am just pointing out what the concerns are (by the opponents), and the fact that a proposal has already been made that takes care of the worries.

I hope we could reach a conclusion to introduce some king of a string type even if some of us would not understand others' problems.

@DestyNova

@kazuho
Apologies, I misinterpreted your discussion of these possible problems as an indicator that you were against the idea. This is probably because so many people seemed to start this discussion by immediately dismissing the whole question of strings as "application layer responsibility", which is honestly quite frustrating!
You are right that we should try to understand the issue and current proposals properly.

@methane

As an user of msgpack and maintainer of msgpack-python, I think we shouldn't change msgpack spec dramatically.
I believe all of core msgpack maintainers agree about that. Changing msgpack to binarypack is not an option. OK?

I understand that sometimes saving unicode string type (or 'utf-8' encoding information) in some language including
Python, Obj-C, Ruby, Perl, etc..
I also understand sometimes and some languages including php, the terrible world dominator treat strings
as just encoded bytes.

So, I propose again about adding UTF-8 "hint" to raw bytes.

The \xC1 byte preceding to raw means the raw is UTF-8 encoded string.
Msgpack unpackers may just skip this byte or use it to recover unicode type (or encoding information).

Any thoughts on this proposal?

@kuenishi
Owner
@Midar

-1, as every key is usually a string, thus this wastes huge amounts of bytes. How's that better than having a new type? It breaks compatibility just like a new type would. I really don't see the point of this?

@methane

@Midar Then, how about my proposal v1.1:

The \xC1 byte preceding to map means all keys of the map are UTF-8.

@Midar

@methane That would make it impossible to have binary and string keys in a map.

Can you please elaborate why you want to add an extra byte if you break compatibility with this anyway? If you're going to break compatibility anyway, why not just add a new type?

@methane

@Midar My proposal v1.1 is addition to original proposal. Mixing bytes key and string key is very rare case.
In such rare case, adding hint only to string is good enough.

Yes, my proposal is somewhat backward incompatible. But minimum impact.
Just skipping \xC1 makes all current implementation compatible to new format.

@methane methane referenced this issue from a commit in msgpack/msgpack-php
@methane methane Skip `\xC1` to accept new msgpack format proposal.
`\xC1` is hint for UTF-8 string. But php doesn't unicode type and string
doesn't have encoding information.
Just skipping this hint may be enough.
e821972
@methane

For example, msgpack-php may be able to accepts new format I porposed by 2 line patch.
msgpack/msgpack-php@e821972

@najeira

@frsyuki wrote:

0xa0-0xaf FixString (0bytes - 15bytes raw type with hint) // changed
0xb0-0xbf FixRaw (16bytes - 31bytes raw type)

New msgpack can not read data that packed current msgpack with short FixRaw.

That is too big impact to users who stores data with current msgpack.
Ofcause, they can ignore new version or migrate their data.

"Old reader can NOT read new data" is acceptable for me
more than "New reader can NOT read old data".

But I know the frsyuki's proposal make small packed data.

@methane 's proposal make it - "New reader can read old data".
Because the proposal won't change current types.

@najeira

Adding string types without changing FixRaw like this:

0xd6-string 8
0xd7-string 16
0xd8-string 32

This proposal has overheads more than frsyuki's proposal.

FixString:
"hello" => 0xa5 0x68 0x65 0x6c 0x6c 0x6f

String 8:
"hello" => 0xd6 0x05 0x68 0x65 0x6c 0x6c 0x6f

0xC1 hint:
"hello" => 0xc1 0xa5 0x68 0x65 0x6c 0x6c 0x6f

When packing 32bytes - 255bytes strings,
String 8 is smaller than 0xC1 hint, one byte.

Which one do you prefer, guys? or do you have any other idea?

@Midar

@methane And why should it be harder to handle the new type as binary if the language does not make a difference? That's not more work than skipping that one byte. You could still parse old data and it would be shorter. So, where's the advantage of your solution over a new type?

@kiyoto

@rasky
Understood =) I just think we all need to be very mindful when discussing something like cultural/social groups.

"And I'm not frustrated. I've been working with open source far too long to get frustrated by a maintainer I don't agree with. I just wanted to make this thread converge one way or another, and to get a final pronouncement from the main maintainer, so that I can decide how I can proceed to implement my software."

That was some presumptuousness on my part. You have my apologies. By the way, looking at how many comments have flooded in between our comments, I don't think the "final pronouncement" is anywhere in sight ;-p

@kazuho

@najeira @frsyuki
If you are going to introduce new tags to mark strings, it would be worthwhile to consider adding a specialized map type that only allows strings to be used as the keys. Many languages with predefined map types (and JSON) only permits strings to be used as the keys. By introducing such a restricted variant of a map type you would be ale to squeeze couple of bits per each key, which are small one by one, but would be a substantial gain as a whole.

@Midar

For what it's worth, I also implemented generating BinaryPack:
Midar/objfw@4d4fbe6

So, there is a BinaryPack implementation now with which people can play around and can be used to test BinaryPack against the various suggestions here. This also proves that implementing BinaryPack in a C-based language is easily possible in a single day.

If anybody is interested in a C(++) binding for that because they don't know Objective-C and want to play around with it, I'll happily write one (no Mac required, this is using ObjFW instead of Apple's Foundation, so it runs everywhere).

@kenn

Sorry for being a noob, but can someone explain why we don't want to simply introduce new types (FixString, String 16 and String 32), given that we now come to agree that it's a good thing to have a UTF-8 hint?

It doesn't seem like a big deal to me considering that for languages that don't care strings, they can just treat those new types as FixRaw, Raw 16 and Raw 32 respectively. Am I wrong?

I think the spec should remain clean and succinct, and that's exactly why I like msgpack. Symmetry between Raw and String would help us application developers to choose right types without having unexpected traps.

@cabo:

If we could generate a probability distribution function for sizes from a real live system, this could be used to dispel such criticism. Do you know any system we could tap here? We really just need a histogram of raw sizes.

As you may be aware, that depends on the use cases. But collectively, it is empirically apparent that Zipf's law and geometric distribution is observed. I like the idea that we give shorter word size to shorter values. Let's keep our FixRaw as it is, so that it can continue to handle 0-15 bytes. :)

@methane

OK, There are some proposals now:

proposal impact efficiency
0. no change zero best
1. c1 is hint minimum not bad
2. c1 is string32 minimum bad
3. @najeira's medium good
4. @frsyuki's medium? best
5. binarypack Fatal good

notes on impact

minimum means ~10 lines of code cam make old decoders compatible to new format.
Tests for that is also ~50 lines of code.

medium means 30~ lines of code can make old decoders compatible to new format.
Tests for that may ~150 lines of code.
@frsyuki 's proposal uses some area currently assigned to raw in spec.
But these area is non-minimum representation of raw.
If no one use these area, it's a backward compatible.

Fatal means backward incompatible data format. It is not an our option.

notes on efficiency

Bad means worse than JSON on small strings.
Not bad means equal or better than JSON in most case.
Good means efficient like current raw format.

@Midar

@methane Thanks for that overview. If it were a vote, I'd vote for 4. This would be even preferable to BinaryPack due to Raw8/String8. But then again, BinaryPack could become an RFC.

So how about going with 4 and trying to get that an RFC instead?

Otherwise, if 4 gets approved, I'll happily switch from BinaryPack to MessagePack - the difference is so little that this is almost no work.

@kenn

Ah, I just realized that I was dumb indeed - FixString as I described was impossible. Now I understand why @frsyuki suggested that. Sorry for confusion.

@methane

Sorry for being a noob, but can someone explain why we don't want to simply introduce new types (FixString, String 16 and String 32), given that we now come to agree that it's a good thing to have a UTF-8 hint?

I don't against adding string types.

Cons of this proposal is impact to implementation and test, especially for decoders just decoding raw for strings.

Adding one type requires adding two test cases at least (shortest and longest).
Possibility to introduce bug may increase. Adding three types makes three times cost.

Additionally, adding three types consumes lot of reserved area. (especially FixString)
But @frsyuki's proposal minimize this impact.

@methane

Additional pros of hinting is it clearly means "just hinted raw".
No one claims "Packer should produce valid shortest form (or NFC) UTF-8." or
"Unpacker should validate it's a canonical UTF-8".
This is one reason of why I like my proposal.

But again, I don't against @frsyuki and @najeira 's proposals.
0. +0, 1. +1, 2. -1, 3. +0.5, 4. +0.5, 5. -100

@kenn

Thanks for the wrap up, @methane. Now I know where I was wrong. JSON string has a minimum of 3 bytes (e.g. "a"), so String / Raw can be better than JSON in every case. I'd vote for 4.

@kuenishi kuenishi referenced this issue in msgpack/msgpack-erlang
Closed

Encode/Decode proplists #9

@repeatedly

@methan thanks for the summarization. I just followed new comments.

Adding string type mainly helps some dynamically-typed languages for raw and string mixed cases (not json case).
So next step is implementation level discussion and feedback to the spec.

@mattn

msgpack is not casual data format like JSON, YAML, or other text protocol formats. I'm thinking you shouldn't treat both is same categorized format. Most of text protocol formats are trusted to users as SAFETY. I guess, who decode msgpack should be who encode them. Text protocol formats doesn't contains hardware specific value. ex: data length, flag value, etc.
So I'm feeling that this discussion is not purposeful to us. string type shouldn't be treated in low layer protocols.

@chakrit

msgpack is not casual data format like JSON

Screen Shot 2013-02-22 at 10 41 54 AM

http://msgpack.org/

@mattn

build could not for technical reasons (encoding, size, speed)

JSON is trusted because it don't have low layer things. I can't see that MessagePack is replacible of JSON from the image.
Can you see such as PR?

@chakrit

If you ever wished to use JSON for convenience (storing an image with metadata) but could not for technical reasons (encoding, size, speed...), MessagePack is a perfect replacement.

@mattn

I think it's hype. It should be `If you will be hard to encode binary or treat low layer protocols, MessagePack will be PERFECT REPLACEMENT OF THE HARD WORK.

@mattn

JSON often uses for conversation between server and client. Because JSON is trusted, JSON does not have something to make crash or insecure(If you use JSON.parse or JSON.stringify).
Do you want to use msgpack to communicate to someone stranger who doesn't know the structure of datas?
MessgePack treat binary protocols. If you want to communicate string or image data, you should do design the data format using raw field as upper layer.

Why you want change spec? Why you don't do design of upper layer?

@methane

JSON can't contain binary. In such case, msgpack can be a perfect JSON replacement.
You can pack all string in raw.
msgpack is just a container. How to use it is application's responsibility.

I don't against adding optional hint to msgpack spec.
There are demands for mixing binary and unicode in one message.
But JSON can't be used such case too.

@chakrit

@methane yeah, i hope we're going forward with the hint addition. right now it's hard to implement a generic handler correctly.

@najeira

I would vote for frsyuki's proposal.
That proposal allows to that new reader can read old data with compatibility mode.

@cabo

I think we have had a pretty good discussion so far, even if it may not look like that so much :-)

I have picked up @frsyuki's proposal, simplified it somewhat, and put it into a draft next version of the Internet-Draft. (Yes a draft draft.) Please see there for the technical content. The main reason I did the small change is that I'm not sure small binary values are frequent enough to merit complicating the short-string case, so I assigned all 32 code points to short strings. Binary values will then always use raw8. I think this is about as simple as it can get.

Enjoy at http://www.tzi.de/~cabo/draft-bormann-apparea-bpack-01pre1.txt

I plan to listen some more to the discussion here the next couple of days.
Due to the timing in the runup to the Orlando IETF, I will have to submit the final version of the Internet-Draft on Monday. So I would be happy if we could get a bit of a closure on the string representation issue here until then.

The draft draft has an appendix laying out some additional work, probably to be tackled after Monday. I'm not quite sure about the best venue for discussing these. Of course, we could open/hijack msgpack issues for these as well.

@frsyuki
Owner

I'm thinking about the API design. This is a different problem from the format design but affects format design (meaning we need to think API as well if we discuss on format farther more).
This design needs to handle following problems (at least):

  • How to implement serializer/deserializer:
    • A) in languages which clearly distinguish strings from binaries (e.g.: Objective-C, JavaScript)
    • B) in languages which don't distinguish strings from binaries (e.g.:
    • C) in languages which optionally distinguish strings from binaries (e.g.: Ruby)
    • These implementations need to think each other: "if B uses this implementation, how to implement A?"
  • Error handling:
    • Whether serializer should validate (raise errors if the string includes invalid bytes as UTF-8) strings or not
    • Whether deserializer should validate (raise errors if the string includes invalid bytes as UTF-8) strings or not
    • Whether deserializer should normalize (replaces invalid bytes as UTF-8 in strings) strings or not
@cabo

A) in languages which clearly distinguish strings from binaries (e.g.: Objective-C, JavaScript)

This is pretty obvious, I think.

B) in languages which don't distinguish strings from binaries (e.g.:

Most of these are strongly typed, so you can find some way to put in this information. in the typing system.

C) in languages which optionally distinguish strings from binaries (e.g.: Ruby)

(In Ruby, the distinction is not at all optional: binary is in encoding BINARY, text is UTF-8.)

Whether serializer should validate (raise errors if the string includes invalid bytes as UTF-8) strings or not

I think an implementation should not send invalid data. Whether that means the serializer needs to validate or that it can rely on its callers to supply reasonable data is an implementation detail.

Whether deserializer should validate (raise errors if the string includes invalid bytes as UTF-8) strings or not

Again, that depends on the expectations of the caller. If the callers are able to handle invalid data, give it to them.
If they would blow up, raise the exception in the msgpack deserializer.

Whether deserializer should normalize (replaces invalid bytes as UTF-8 in strings) strings or not

Never. You might want to offer a "hand me anything in raw form" version for debugging these situations, but "defensive programming" is a mistake. Just blow up.

(I read this assuming this was about errors in the the UTF-8 encoding rules. I think we all agree there should be no Unicode normalization or normalization checking in the msgpack serializer/deserializer.)

@cabo

You might want to offer a "hand me anything in raw form" version

... and you want this in a msgpack API to achieve the level of backward compatibility that the recent proposals like @frsyuki's and 01pre1 provide.

@frsyuki
Owner

@cabo I complement regarding B:
In Ruby (at least), users can contain invalid byte sequence in a String object which has UTF-8 encoding information. For example:

require 'uri'
s = URI.unescape("%df")
p s #=> "\xDF"
p s.encoding #=> #<Encoding:UTF-8>

This often happens in Ruby on Rails programs. This is just an example and this happens. So in Ruby programs, any String objects could contain invalid UTF-8 byte sequence.

@cabo

Arguably, that is a bug in URI.unescape. Of course, the whole issue of character encoding in URIs is muddy, so I won't blame the authors of that code. But clearly, an URI.unescape API should contain methods to handle the uncertainty created by real-world browsers. I don't think handling this issue is a concern of an unrelated piece of software like msgpack.

@frsyuki
Owner

@cabo I see what you mean. I think I need to change the sentences to be clear:

  • A) in languages which clearly distinguish strings from binaries and strings can contain only valid Unicode characters (e.g.: Objective-C, JavaScript, Python)
  • B) in languages which don't distinguish strings from binaries or do distinguish but strings can contain invalid Unicode characters (e.g.: Ruby, Perl, PHP, C++, Erlang)
@frsyuki
Owner

@cabo Please understand one thing: I do NOT want to have two similar specification (BinaryPack and MessagePack). It's very confusing and no one will be happy. I'm prepared to extend MessagePack spec if it's appropriate. So I would ask you to NOT fix the RFC without consensus.
I think you have your own goal while I have my goal. But let's work to avoid the worst case.

I'm almost agreed with adding string type extension. But it means I make compromise on my goal. I need to build the new consistent semantics with the string type extension which was not expected originally.

Now I'm thinking about the API and the format which works well with the API.

@DestyNova

@mattn

Do you want to use msgpack to communicate to someone stranger who doesn't know the structure of datas?
MessgePack treat binary protocols. If you want to communicate string or image data, you should do design the data format using raw field as upper layer.

Why you want change spec? Why you don't do design of upper layer?

No. Please read the information on the MessagePack site before assuming that everybody else is using it wrong:

http://wiki.msgpack.org/display/MSGPACK/Design+of+RPC

Because every MessagePack message contains the type information side-by-side, clients and servers don't need any schemas or interface definitions basically. This is handy for utilizing it both in dynamically typed and statically typed languages.

http://wiki.msgpack.org/display/MSGPACK/Overview

MessagePack is an efficient object serialization library, which are very compact and fast data format, with rich data structures compatible with JSON.

Note: rich data structures compatible with JSON. This means something. Otherwise, why have any types? Why not just have everything as "raw"? Why have MessagePack at all?

@frsyuki
Owner

I think we need to think the document separately from the spec. If the documents are wrong, documents should be fixed.

I originally created MessagePack to develop a distributed storage system (=backend program) in C++ and Ruby (1.8). In this case "like JSON" was true because I didn't have to tell strings from binaries. JSON has only strings while MessagePack has only raws. Thus there're no problems to replace all strings with raws because these languages don't have to tell strings from binaries.

I created the website (and some others such as @kzk added some documents). I haven't used Python, JavaScript with MessagePack (I've used MessagePack in an Objective-C program but I didn't have any problems because I used schema to project msgpack types into Objective-C types). I need to change something. It could be documents or spec but you shouldn't mix up these two separated problems.

@DestyNova

@frsyuki
I'm not mixing up any problems; rather I'm pointing out that the use case that suits some people is not necessarily the use case that suits everyone. If we can find a solution that works for most people without causing too many new problems, that's IMO much more useful than joining the discussion only to say "you are using MessagePack wrong, just make a schema or else use BSON/etc".

JSON has only strings while MessagePack has only raws. Thus there're no problems to replace all strings with raws because these languages don't have to tell strings from binaries.

That's great, but it's also completely against your stated goal of cross-platform, cross-language compatibility, because decoders have no idea what to do with the raw bytes which might be in any random format.
I think it's pretty obvious that strings are an important datatype which deserve better support than "dump the bytes and hope that the receiver knows (or can guess) what encoding you use".

@cabo

@frsyuki:

@cabo Please understand one thing: I do NOT want to have two similar specification (BinaryPack and MessagePack).

That is exactly why I came here to make sure we can find common ground. In the process of doing so, we need names for the various variants being discussed. So my current variant is called 1pre01.

I did 01pre1 because I think it does address all concerns raised here and is simpler than your previous proposal. I wrote it up because it is hard to discuss unless written up. I didn't submit it to the Internet-Drafts directory because I want to discuss it here first, so it's just on my personal web server. I need to finish discussing by Monday, though, and that is when I'll send a -01 to the Internet-drafts directory.

I continued calling the current spec "binarypack" because I didn't want to misappropriate the well-known and well-regarded "msgpack" label. If, in the course of defining this, we reach agreement, I'm much happier to use the "msgpack" name. Actually I called the most recent strawman BinaryPack1pre01, because while we are still in the process of nailing down things, it is good to have a name.

In the end I'd like to have a spec that both solves the problem well that I'm trying to solve and works well for the msgpack community. (If that is not possible, there will be a spec that solves the problem well, and I'll call it something else. But right now it seems a common spec is possible.)

@kenn

@cabo Even when we reach to a solid agreement on a particular spec, I don't think bringing it to IETF is a good idea, be it MessagePack or BinaryPack. At least this early. Just because we agree in theory, doesn't mean that the new spec has been proven to work flawlessly in the wild.

As the inventor of msgpack clearly stated that he has no interest in taking the discussion over to a standard committee, we should respect his intent. Especially when such a move is easily considered as a political tactic to change the game to your advantage.

To clarify, I'm not opposed to bringing it to IETF in the future - it's just that now is not the right time. We'll just know when it's appropriate, when it's appropriate. That's what happened to JSON - it was there since 2002 but the RFC4627 was established in 2006, much after it's already a de facto standard. And most importantly, Douglas Crockford did it, willingly. We should wait for now and hopefully, @frsyuki will become open to working with IETF some day.

@kazuho

+1 to @kenn

@cabo
Please do not get me wrong. I appreciate your efforts on working for adding a string type to MessagePack. And if @frsyuki decides not to introduce string types to MessagePack, then it is understandable that proposing a different specification through IETF is a good way to promote such a format. But it does not seem to be the case any more.

My understanding is that the steering person of MessagePack is @frsyuki. Proposing it to IETF would mean that there would be two steering persion / committee for a single specification.

@Midar

Just for the record, I really like @cabo's new proposal and implemented it :).

Could we get something like this into MessagePack? Ideally, I'd like to see BinaryPack get imported back into MessagePack. Then we'd have one format and that would even be a standard. That would really be the best case.

@cabo

@kenn, @kazuho: I actually don't think I have that much of a choice.

I need a spec for something like a binary JSON (misnomer, but close enough), in order to be able to place protocols such as SenML on top of that.

In developing that spec, I could ignore msgpack, of course. But I think it has shown great potential, and choosing a spec to start from also reduces the peril of "bikeshedding".

There are indeeed some things missing from msgpack. I tackled the Text String issue first, because that is the most obvious gap. My current draft draft outlines a small number other areas where an addition to msgpack could be considered necessary. But generally, I'm quite happy with msgpack. And I still hope we can do these other things while keeping any impact on backwards compatibility under control.

So, while going outright for a "fork" might be a useful strategy to disentangle things, I believe that doing this together will benefit both the IETF and the msgpack community. I read @frsyuki's last statement as some initial support for this approach. It is also simply the right thing for me to at least try — I don't just want to "steal" the spec.

Actually going for standardization will require the IETF to have change control. There is a danger that this could lead us away from the msgpack community. (Worse, people that want to distract from this effort might deliberately attempt to make this happen.) The role of the msgpack community and especially of @frsyuki will always be a bit delicate in this process. But the IETF is used to introducing established practice into standardization, and we know that the stewards of an existing specification brought into the IETF always have a special role and an important voice. @frsyuki can choose to actively exercise this role or stay in the background. Either way, I'm confident that we can manage this process in a way that is satisfactory to both ends.

It is not a given that the IETF will want to pursue standardization of this kind of format at all. (Again, people that want msgpack "to be left alone" might want to deliberately attempt to make the IETF process fail. But I'm trusting that this community is not of this kind.) I'm actually looking to your support to make a standard happen.

I think, in the end, msgpack will benefit from additional visibility, and from the technical scrutiny that an IETF process brings with it. Getting a standard done is a serious amount of work, though.

@cabo

@kenn: I am actually quite happy with the level of maturity that msgpack already has. Why do you say it is "early"?

I don't think the history of JSON is a good model for future work in this space. JSON just happened while a lot of people were still thinking XML had a solid grip of this space. So it was the right thing to do this a bit under cover. (Actually, although being written down in an RFC and having widespread consensus behind it, JSON isn't technically even an IETF standard yet; we will start the process for that in March! But it won't change in that process, we'll just get rid of the UTF-16 and UTF-32 blind alleys.)

The development of JSON was also special in that it essentially just showed how to use elements of an existing spec (ECMA 262) for its purpose. It had the advantage of never having to discuss the essence of that spec, just minor details such as whether comments should be included or not.

In the world of "binary JSONs" (sorry), from an existing standards point of view, we essentially have a green field. (Unless you want to start from ASN.1 BER. I hope you understand why I don't want to do this.) So we need a bit more active stewardship to converge on one spec.

msgpack has a lot of what is needed, and the wide implementation will help avoid extensive "bikeshedding". If I were to design a format from scratch, I'd do a few things somewhat different. But that is almost all on the level of bikeshedding. The only pain that msgpack causes me is that it already has spent almost all codepoints (I think 11 are left out of 256), maybe out of some exuberant confidence that there won't be any need for extensions. Reclaiming some of this space would cause considerable pain, so I'm happy that we found a solution for introducing Strings that just reinterprets some code points in a mostly benign way.

@cabo

To clarify what I'm trying to do here, I wrote up a first draft of my objectives.

Roughly in decreasing order of importance, they are:

  • Representing a reasonable set of basic data types and structures
    using binary encoding. "Reasonable" here is largely influenced by
    the capabilities of JSON, with the single addition of adding raw
    byte strings. The structures supported are limited to trees; no
    loops or lattice-style graphs.

  • Being implementable in a very small amount of code, thus being
    applicable to constrained nodes {{?I-D.ietf-lwig-terminology}}, even
    of class 1. (Complexity goal.) As a corollary: Being close to
    contemporary machine representations of data (e.g., not requiring
    binary-to-decimal conversion).

  • Being applicable to schema-less use. For schema-informed binary
    encoding, a number of approaches are already available in the IETF,
    including XDR {{?RFC4506}}. (However, schema-informed use of the
    present specification, such as for a marshalling scheme for an RPC
    IDL, is not at all excluded. Any IDL for this is out of scope for this
    specification.)

  • Being reasonably compact. "Reasonable" here is bounded by JSON as
    an upper bound, and by implementation complexity maintaining a lower
    bound. The use of general compression schemes violates both of the
    complexity goals.

  • Being reasonably frugal in CPU usage. (The other complexity goal.)
    This is relevant both for constrained nodes and for potential usage
    in high-volume applications.

  • Supporting a reasonable level of round-tripping with JSON, as long
    as the data represented are within the capabilities of JSON.
    Defining a unidirectional mapping towards JSON for all types of
    data.

@frsyuki
Owner

@cabo I almost understood what you mean. 1 question and 1 comment:

  • I couldn't understand this part: "thus being applicable to constrained nodes {{?I-D.ietf-lwig-terminology}}, even of class 1. (Complexity goal.)"
    • What's class 1? Could you try to explain using different words...?
  • Regarding "Supporting a reasonable level of round-tripping with JSON", there're some exceptions now:
    • json->msgpack conversion: maximum length of arrays, maps and raws is limited upto (2^32)-1 in msgpack
    • json->msgpack conversion: JSON represents numbers using decimal while msgpack uses floating points
    • msgpack->json conversion: msgpack can contain binaries
    • msgpack->json conversion: msgpack can use non-string/raw types as the keys of maps
    • msgpack->json conversion: msgpack can store only one primitive (non-map/array) value without map/array containers
    • I think none of them are the real problems.
@catwell

Hello,

I'm joining this long thread a bit late, having read most of it. I wrote an implementation of MessagePack for Lua, which is a dynamic language of type B) (i.e. doesn't differentiate strings and raw bytes).

The interesting thing with Lua is that it also doesn't differentiate Arrays and Maps so we have already had this kind of implementation problems.

Decoding is not much of a problem in this case, although you lose a bit of information. We could add a way to decode that exposes additional type information but I have never needed it in practice. Encoding, on the other hand, is complicated because you have to decide on the right type to use.

The basic idea I ended up going with for Arrays / Maps (which both correspond to type "table"in Lua) is that you have a specific function (in another language it could be something else, for instance an object) which takes an instance of an ambiguous type and returns how it should be encoded. In your implementation you provide a default version of that, but you allow users to override it if needed.

That works fine for tables because you can attach metadata to them, but you cannot do that for strings which are a much more basic datatype. You have to wrap them in a "more powerful" datatype. In Lua that would be either a table or a function.

This would be too complicated to explain to users so eventually the encoding library would have to abstract that, and the API would be like:

mylib.pack{
  my_string = mylib.string("this is a string"),
  my_binary_data = "this is raw bytes",
}

Note that even with the Array / Map issue I have my implementation inter-operate with lots of other languages such as Python, Ruby and C.

Actually my implementation also supports an unofficial type for byte arrays for interoperability with msgpack-js (see catwell/luajit-msgpack-pure#6). This makes the opposite assumption that the native MessagePack raw type is used to store strings, but having read this thread I think what I did back then it is actually a bad idea for a variety of reasons. I implemented it the way @creationix suggested because I didn't really care myself (since I don't use JS).

I think it would be a good idea if @creationix joined the discussion too since he has experience dealing with this on the JS side.

@frsyuki
Owner

@catwell Thank you for your comments. I'm really curious about how @creationix handles MessagePack data in JavaScript.

@cabo

@frsyuki: The class 1 terminology is defined in the referenced terminology document, http://tools.ietf.org/html/draft-ietf-lwig-terminology — essentially this is a device with about 100 KiB of code storage and about 10 KiB of RAM.
You want to be very frugal with code size on such a device.

Re the JSON roundtripping: There are some initial considerations written up in my current draft. I agree that this works quite well with msgpack. We can accommodate binaries on the JSON side by base64url-encoding them, this is how JOSE (http://tools.ietf.org/wg/jose/) runs with the binary crypto information. On the msgpack side, I want to spare my little class1 devices from having to work on these base64url strings, hence my interest in raw byte strings. Obviously, the direction of the roundtripping from JSON to msgpack can only work in a schema-informed way if you want to turn the base64url strings from JSON back into real binary on the msgpack side.

@cabo

Re https://github.com/creationix/msgpack-js@creationix simply defines new binary types (0xd8 = buffer16, 0xd9 = buffer32) — he did the differentiation just the other way around we have been doing this here. (He also has a "undefined" as a fourth special value.)

@mirabilos

@cabo for #121 (comment)

Please do consider omitting some things from JSON to make it a standard (and adding none).
For example, it would greatly help if \u0000 can be formally forbidden or at least deprecated, and I believe \uFFFE and \uFFFF are not valid either (actually I don’t think there’s a clear answer on whether they are allowed right now or not, and all implementations differ).

Considering JSON is intended as a portable data interchange format, disallowing them only makes sense. I don’t just think of C strings (as you can use a buffer API with (pointer, size) tuples in C) but also high-level languages that are constrained to C strings, and security implications.

For a bit more (and, sadly, a bit “incoherent”, as my English isn’t that good) rambling on that topic: https://www.mirbsd.org/permalinks/wlog-10_e20121201-tg.htm
Although everything else I write in there (like, suggesting only a few valid encodings; suggesting to always backslash-escape C1 control characters; suggest a nesting depth limit; suggest to always sort Object keys ASCIIbetically to not leak internal hashtable state after randomisation) are requests for implementors, not for the standard. I think I also make a good point about not using that JSON5 abomination. Everything I suggest to change in that post will still result in something that’s conforming to the ECMA-262 JSON.

@mirabilos

(This is about strings obviously. \x00 is fine in binary octet representations. My point was about JSON, which doesn’t have them.)

@cabo

@mirabilos — if you want to influence JSON standardization, the best way is to subscribe at https://www.ietf.org/mailman/listinfo/json and make this point to the mailing list. Better, read what already has been said on the mailing list (the above has a link to the archives) and chime in. The IETF is open to all!

@mirabilos

@cabo: Thanks, of a sort. I have so many projects I’m working on already, in addition to a dayjob, that I cannot follow any standardisation lists. For example, as a shell maintainer I should follow the Austin mailing list (POSIX), but it’s so high-volume I gave up after piling more than 1000 mails in less than a week…

Additionally, subscribing to a mailing list just to post something one-off is both effort for me and probably not liked by the people… but I’d be happy if you can forward my points.

@creationix

Wow, what a long thread. I think I was able to read about 30% of it.

So first, let me share my experience implementing msgpack for interop between browser javascript, node.js javascript and luajit lua. In all three platforms there are unique types for strings and raw binary data. In javascript, the string type is UCS-16 unicode. JSON, which is a subset of javascript requires that strings are encoded as UTF8 (which is the only sane unicode serialization format IMHO). JSON encoders and decoders for javascript already have to convert between the UTF8 encoding and the 16-bits per code points encoding used internally in the language.

Now raw binary data is a new feature to JavaScript. In the browser it comes in the form of ArrayBuffer which you can read using typed arrays or DataView instances. Also object keys in JSON and JavaScript are restricted to unicode strings. In Node.JS, we created a binary format before typed arrays were popular called "Buffer". It works somewhat like the browser's ArrayBuffer type but with a different API for getting at the data.

Regardless of the differences between the two raw types, I am able to interop perfectly fine between because their serialization format is just raw bytes. The problem arises when I want to msgpack encode a string and a buffer and get a string and a buffer out on the other side. They are not the same type and have very different meanings. The main reason most people use msgpack over json in javascript is to have support for a binary data type. The other option is to base64 encode the binary data inside a unicode string, and even then you need some out of band encoding tag to tell the consumer it should be base64 decoded. JS data tends to be schema-less and types should be introspective on their own. Having only one kind of value for strings and buffers is a real problem for JavaScript.

Now as far as lua, it's not quite as bad. Lua strings don't specify an encoding. I use UTF8 in all my code. In luajit I use the FFI to create raw char* buffers for my raw type. When interfacing with my javascript code I want my js strings to come through as utf-8 encoded lua strings and I want my javascript buffers to come through as luajit ffi char* arrays.

I implemented an extension to the msgpack protocol where strings are msgpack's raw type (because most code and JSON use strings), but also added a new type that's meant to be raw/Buffer. In practice, this has worked out very well for me. I would love it if this addition made it into the official spec so more languages could interop.

As long as msgpack is supposed to be like JSON, but also support binary data, it should really have two distinct types for unicode strings and raw binary data. Otherwise most my javascript colleagues will use other formats because they are not willing to add a schema to their protocols just to tell strings and buffers apart. The msgpack encoding and decoding layer should be standalone and not depend on user-provided schemas to know how to tell apart strings from raw. We can't just decode all msgpack raw data as buffers in javascript because not all javascript runtimes even have a binary type. Also it's very expensive to create buffers and then convert them back to strings later on based on a schema from a later layer.

Let me reiterate, dynamic typed languages require that all values hold their type data internally for primitive types. They never rely on external declared types or schemas. Read up on how dynamic language runtimes are implemented. This is the main reason they use more memory over statically typed systems, because every value needs it's type tagged somehow. To tell the user of a dynamic language, where this philosophy is engrained so deep, that they have to annotate their data with types just to tell apart two primitive types is crazy. That's like saying we should merge null, booleans, and numbers into one type, and they should use a schema to know if that 0 means null, false, or 0.

@creationix

Also, in case you don't know already, here are my JS implementations with the differeneces from msgpack documented. https://github.com/creationix/msgpack-js https://github.com/creationix/msgpack-js-browser

@kiyoto

@cabo just curious:

"Due to the timing in the runup to the Orlando IETF, I will have to submit the final version of the Internet-Draft on Monday. So I would be happy if we could get a bit of a closure on the string representation issue here until then."

Is there any extrinsic reason you need to get this done this year? To me, getting a standard ratified seems more of a by-product of everyone agreeing on a "satisficing" design (which seems to be the case now, finally) rather than an end goal.

@cabo

@kiyoto: the standardization of Smart Object Networking (Internet of Things, IP Smart Objects, whatever you want to call it) is happening now. I believe adding a msgpack-like component to it would create an important basis for other standards to build on. And it also appears to be achievable in short time. Why wait? Wait for what?

@kazuho

@kiyoto +1

@cabo

Please wait for consensus. Having a the protocol specified tomorrow might be important for the areas in which you work in. But it's not the only use case of MessagePack.

We should try to create a single specification that all parties can agree on, and I do not think it would be possible within such a short period.

Having support by existing developers / users of current MessagePack spec. is IMO essential to promoting the new version MessagePack with string support (or BinaryPack). But introducing a string type to MessagePack will hurt existing users no matter how, since it is actually an attempt to split a single type ("raw") which is used for storing both strings and binaries into two types. Either type of data should move to somewhere else, thus some incompatibility is inevitable.

So we should be cautious on making a final design. I think we are in a very delicate situation now whether we can reach consensus, and rushing to IETF might hurt such efforts.

I imagine your are very frustrated; my understanding is that you proposed BinaryPack by yourself since none of the MessagePack developers seemed to be interested in adding string types. And after that they have started! I can understand that.

But for the greater good, I wish you withdraw the BinaryPack proposal this year. I think we should concentrate on trying to gain support from as many existing and potential users of MessagePack as possible, before declaring the protocol as final. And after then, we should consider bringing the specification to IETF if @frsyuki thinks that is the good way to spread the protocol.

You might loose some merit by not being able to refer to an IETF protocol for a year, but once we reach to an agreement on the design, the power of existing developers / users and the name of MessagePack will help you (and all of us) from promoting applications using "MessagePack with string types".

@cabo

https://gist.github.com/frsyuki/5022569

Great! I have aligned my draft draft with this: http://www.tzi.de/~cabo/draft-bormann-apparea-bpack-01pre2.txt

(I'm still referring to this format by the monster name of "BinaryPack1pre2" because the msgpack spec hasn't officially changed yet. I'd love some advice how to call this when I submit this tomorrow...)

@kazuho: The IETF is not going to turn this into an RFC tomorrow. I don't even think that our consensus processes are faster than yours... I can promise you this won't be an RFC in 2013. But it is important to have something written up now so we can build consensus on the general direction of going forward on the basis of a fully fleshed out technical proposal (that is the whole point of an Internet-Draft).

It is not damaging msgpack if the consensus process in this community is visible to the IETF and vice versa. We still have to decide whether to do our own thing or go with msgpack. I'm simply not in a position to withdraw my proposal. The only thing I could do is making it deliberately incompatible with msgpack. I don't think this community would benefit from that.

@cabo

Oh, one more comment:

I imagine your are very frustrated; my understanding is that you proposed BinaryPack by yourself since none of the MessagePack developers seemed to be interested in adding string types. And after that they have started! I can understand that.

Standardization can be much more frustrating than this... No, I'm not easily frustrated.

Actually, the history is that I needed a binary representation format, had been toying with msgpack for a while, but couldn't really use it because of the lack of string/binary differentiation. Then I ran into Eric Zhang's BinaryPack, and decided I should simply write this up. I didn't even know at the time that msgpack-js had also added this differentiation, in a different way... It's good we are starting to converge again.

@rasky

@frsyuki > https://gist.github.com/frsyuki/5022569

Why you put Python 2 among the weak-string languages? It's got a Unicode type since 2.0 which is in wide usage, and anybody doing i18N programming with Python is using Unicode. The only issue (compared to 3.0) is that people tend to use the "str" type as a ASCII/UTF8 string (much more than in Python 3). But given your definition, I still think Python 2 fits strong-string languages.

I would also expect msgpack-python to correctly serialize Python 2.0 types so to fix the original issue described in this document. Your document seems to imply that, for Python 2.0 (which you call weak-string), it's unrealistic to expect that it would work; I disagree with this.

@Midar

@frsyuki +1 on that. As soon as the MsgPack specification will be updated, I'll rename my implementation from a BinaryPack implementation to a MsgPack implementation :)

I'm really happy would could finally find a good solution. Thus all that discussion was not for nothing :)

@catwell

I like @frsyuki 's proposal, as long as the implementers are careful to implement the transition plan correctly and do have a backwards compatible mode.

We have a lot (millions) of archives out there on servers but also on mobile devices of clients, and we have been using the raw type to store arbitrary binary data. That means we will have to use this backwards compatible mode for a long time.

That being said finally having an official string type is a good thing. I will implement the proposal in a few weeks if everybody looks happy with it.

@kazuho

@frsyuki

Great work! Compared to others, the proposal seems to have the lowest impact to the existing applications.

@cabo

Thank you for responding, and thank you for explaining your situation and ideas in detail. Let me explain my situation and why I think proposing the spec. without @frsyuki's support.

I work for one of the largest companies in the web industry of Japan. The company has long been using MessagePack on the server side, as a data format for the key-value stores and for server-to-server communications.

Recently, there has been a rise of requirement from our client side developers (I am one among them) for a JSON-like protocol that can efficiently store binary data. Thanks to the specs of HTML 5 binary data is becoming familiar on the web browsers, and as existing users of JSON, what we would want is some kind of well-designed format that can store all the data types of JSON and binary.

MessagePack with support for string types is an execellent choice for such an requirement since our server-side engineers are already well-experienced with the the protocol, from developing using the libraries to debugging at the wire-level.

But we would also face problems once the protocol spec. gets updated.

As I mentioned, we are already using the protocol, not only as a server-side protocol but for storing data as well. Most of our "raw" data are strings for sure, but there might be binaries (such as images) as well. It is very difficult to check, we have thousands of developers working for many applications, and the combination of MessagePack and key-value stores can be found in many of them.

Adding a string type to MessagePack can never be done without moving either of the data to a new area (and I like @frsyuki's proposal very much in the fact that it does not move strings, which makes us let easier to migrate).

But we still need to find out the right way to implement the codecs that encode/decode MessagePack data, so that we could support both the legacy and the new format at the same time with minimum effort.

Although I am very optimistic esp. after looking at @frsyuki's proposal, I still need to convince my colleagues to support MessagePack with string types, or request a change if we find any problem.

And if I fail, my company would likely not use "MessagePack or similar protocols with string support" even as a the client-side-only protocol, since it would be confusing for us to have two similar but different protocols. It would make us harder to debug. We would likely go for BSON (I do not think it's well-designed) or some kind of the sort.

This is my personal situation, but I think many developers in the web industry think the same way.

And it is the reason why I think sending something to IETF now esp. without the consent of @frsyuki is a bad idea. IMO we are still in an early stage of designing the protocol and the API. Having two steering wheels for the enhancement at this moment decreases the possibility of our reaching to a single protocol.

The only thing I could do is making it deliberately incompatible with msgpack.

If you are going to propose the protocol to IETF anyway, I would appreciate it if you could make it as different as possible from MessagePack so that it would never get considered as a "variant of MessagePack", which would likely cause confusion.

In fact, if BinaryPack gains familiarity to an extent that my company starts considering of adopting the protocol, it would be better for us if the two protocols were not similar at all; it would help us distinguish the two at the wire level and debugging would not become difficult.

@methane

@rasky Many Pythonistas use best practice: Use unicode for all string.
But it's not a language spec.
ASCII only bytes is compatible to unicode (e.g. b'id' == u'id'). Builtin uses bytes as string.

def foo(): pass
assert type(foo.__name__) is bytes

So Python 2 is weak-string language by language spec and strong-string language by practice.

@frsyuki
Owner

What I want to do from this time regarding this string type issue is following process:

  1. I'm about to propose an idea to change the spec
  2. Active authors of implementation projects implement the idea
  3. They release that implementation as an experimental release (could be an internal release for their use case)
  4. Active users try the implementation and validate how it works
  5. If the proposal needs fix, fix it. This fix may include changes of the proposed format
  6. Iterate 1 to 4 again until there're enough knowledge
  7. Release the already implemented experimental release officially and the proposal as the official release and official spec

My article (https://gist.github.com/frsyuki/5022569) is about to be the step 1.

This takes time. But anyway, active users can't use this release soon without validation for all projects as @catwell and @kazuo mentioned. I think this is the correct way to change a currently working spec.

I'm using MessagePack to provide a cloud-based service which stores data and runs queries on the data. My company has over TBs of customers' data in msgpack format. Changing the format of the data is almost imposible.

@rasky

@methane that's because in Python 2 it's not possible to define a symbol name using non-ASCII character; the fact that it's using bytes internally is an implementation detail, as there is no different in comparison as you noticed. Now, there are many places where Python 2 hasn't a clear string/unicode distinction at the API levels, but still most Pythonistas know and expect unicode for strings.

My point is that, in @frsyuki document, weak-string langauges should behave in a way that I think it's totally wrong for Python, because Python do have full Unicode support since 2.0. This is why I think @frsyuki is wrong, and Python 2.x should be moved among the strong-string languages, so that all uses of the string msgpack type are converted to Python unicode.

@rasky rasky closed this
@rasky rasky reopened this
@cabo

@kazuho: This is fascinating. Of course, I don't have any visibility into the decision processes in your organization, so it is indeed new information to me that widening the discussion to include the IETF could discourage adoption there. I don't want to sound harsh, but after thirty years of standardization I'm aware that there is sometimes collateral damage. I'll still try to minimize that, if possible at all. So if you have any proposal for me that is not equivalent to "stop doing your work", please send it to me, via e-mail.

Re your proposal of doing something deliberately incompatible: I actually thought about submitting a "msgpack done right", without any constraints by msgpack compatibility. Designing from scratch would probably yield an incrementally, slightly better encoding than what we have now. But it is unlikely that there would be a functional difference. I don't want to be guilty of http://xkcd.com/927/ — so I will consider this only in earnest if this community explicitly instructs me that compatibility with a future IETF specification is undesirable.

@Midar

@frsyuki For the record, I already implemented your proposal and "released" it (it's visible in Git), as it's equal to BinaryPack1pre2. So 1 - 3 is covered already :).

@frsyuki
Owner

@cabo

Consensus across active committers is necessary to run step 2, 3 and 4 I described above. Without that, it becomes difficult to verify the proposal works well or not in the real production environments thus we msgpack core team may offer a disappointing spec.

If another idea is proposed (such as adding time type, timezone type, uuid, 128bit int, 16bit floats, bigint, decimal, regexp, sha1 hash, whatever. There're so many requests), then it needs the same cycle again. Active msgpack users (including me and the authors) need verified compatibility (both of code and data) at any time. IETF might think msgpack is not a defined yet and easy to change but actually it's defined. I think you've understood but again, we can't assume changing or adding currently working spec is done correctly without implementations and verification in the real production environments.

I think currently proposed msgpack should be fixed (consistent) as is. JSON is working very well. JSON doesn't have any other types. They're simple. I know msgpack/json don't satisfy all cases. But it's applications business.

So, anyway, I think I need to ask you some questions:

  • What will happen next if you proposed the draft draft?
  • If you proposed, how can we change the proposed draft as we propose later? I mean:
    • We may be able to improve the draft after the verification process.
    • I think the draft needs to be based on the concept of msgpack to guide users to not misuse
  • How can the authors/stakeholders prevent the working group from adding changes into msgpack without the verification process in their production environments and their/stakeholders' consensus?
    • And, who are you? Why you know about IETF? (I just mean I need some more information to confirm why you desire to submit a proposal now)
@kazuho

@cabo

@kazuho: This is fascinating. Of course, I don't have any visibility into the decision processes in your organization, so it is indeed new information to me that widening the discussion to include the IETF could discourage adoption there.

From your statement I understand that you do not know how MessagePack is actually being used (a I mentioned my company case is not something unique), and that fact makes me really scared about your aptitude as a submitter of an internet draft of MessagePack.

Do you think somebody without the knowledge of how the protocol is being used can make good decisions regarding the protocol? I do not think so.

@methane

@rasky Why Python 2 is listed in weak string language is I told @frsyuki so.

MessagePack-update categorize languages that "bytes may represent string" as "weak-string language".
Having unicode type doesn't mean strong distinguish between binary and string because
bytes may be used for both of binary and string.

I'm big fan of Python, And no one intend to speak ill of Python 2.

@frsyuki
Owner

I actually agree with @kazuho:

fact makes me really scared about your aptitude as a submitter of an internet draft of MessagePack.
Do you think somebody without the knowledge of how the protocol is being used can make good decisions regarding the protocol? I do not think so.

I'm very scared how IETF easily tries to change the spec.

@kuenishi
Owner

I have no idea that "Smart Object Networking" exactly means. If cabo-san is talking about some small devices communicating via msgpack for power savings in small devices or so, Why network communication protocol is not shown as well? It looks like traditional, authentic and awful classic protocol like CORBA or ASN.1 might be enough for that.
Serialization protocol itself have nothing to do with Networking.

@cabo

@kazuho: I'm sorry, but that comment of yours was off the mark. People from my culture get very nasty from ad-hominem attacks, and rightly so. (I have been using RPC since before the term was first mentioned in a publication. I've seen dozens, if not hundreds of marshalling formats. msgpack just happens to get a larger number of design decisions right than other ones. I very much understand "how MessagePack is being used". My comment was about the weird processes in your organization where gathering additional support for msgpack would endanger its adoption in your organization. Indeed, I'm not used to that kind of thinking, so that's where I expressed my surprise.)

But then, I'm not submitting MessagePack to the IETF. I'm submitting a specification that just happens to be compatible with msgpack, because msgpack is almost good enough for my requirements (which are documented in that specification). If @frsyuki wants to join me in this, he is more than welcome. I don't want to lead this at all. I just want it to happen, on a reasonable time frame, with a technically reasonable outcome.

@kazuho: I'm sorry that you need to be so protective of your turf. And I'm also sorry that I don't speak Japanese. I just can't stop my work because of that. But I'm happy to stop engaging this community if @frsyuki, the inventor and recognized steward of msgpack, instructs me to do so. The resulting confusion will not get smaller, though.

So far, I'm happy that my actions might have catalyzed the process ever so slightly that might now lead to the resolution of msgpack's string issue. If you look back at the start of this long thread, there were people leaving (or not joining) the MessagePack community, or doing random incompatible forks, because the issue wasn't being addressed. Yes, fostering needed evolution brings some disturbance to this place. But the only really quiet place is a grave.

@cabo

I have no idea that "Smart Object Networking" exactly means.

Sorry for not expanding this term. Try http://tools.ietf.org/html/rfc6574 for a gentle starter.

CORBA or ASN.1 are not useful in this space.

Yes, we have some protocols for message interaction, but data formats are also protocols. I already mentioned http://tools.ietf.org/html/draft-jennings-senml as one such protocol. This can be encoded as text XML, binary XML (EXI), and JSON. I don't want to get stuck with XML when I need binary. That's why I'm here.

@kazuho

@cabo

I'm sorry if it sounded like insulting, but I expect people responsible for standardizing an already-working protocol to be extremely sure what he / she is doing. I also hope that you are the one as well.

In case of MessagePack, not every part of the documentation is as clear as RFC, so if the protocol were to be standardized, the vague parts of the current specification should be clarified according to how actual implementations work (or it would break compatibility), as @frsyuki has done in his newest proposal on defining how already-existing serializers / deserializers should work. To do so, you would need the help of @frsyuki and the community.

Or if anyone requests for a clarification of the meaning of the spec. through IETF, how are you going to handle it? Such kind of work can only be done by the community.

In other words, it would be dangerous to submit a proposal to IETF unless it is done by a person who knows very well about the protocol (including how it is used), or when the submitter has a good relationship between the community - and when the community is willing pay the costs of standardization.

To me it seems that you are not the former. So it the latter the case? I'm afraid it is not. It is not only me who is worried about bringing the proposal to IETF at this moment.

It would be great if MessagePack becomes RFC. But I think that should be achieved by the willingful support by the community. Streamrolling a submission and then enforcing the community to help brush up the specification is not a good way. And if such communication fails, then the specifications might fork.

Such a future isn't beneficial for anybody.

PS. as I described, it's not about language. It's about communication.

@take-cheeze

Hi.
I'm trying to implementing msgpack mruby gems since I heard that doesn't exist.
I too got stuck in the raw section and happily I found this issue.
And I read your proposal( https://gist.github.com/frsyuki/5022569 ) but I still get stucks.

Mostly from the line

don't have to validate a string on storing it.

I know it's impossible due too many limitation but I get discomfort because I would tell "sorry, maybe I broke your data" if I ignored data regulation.
So had debate with oneself how to solve it.
I end up expanding two type(binary and string) to three type:

  1. binary type(new proposal's binary)
  2. maybe string or binary type(old raw)
  3. unicode string(new proposal's string)

For example in C++ I would map
1 to std::vector,
2 to std::string,
3 to std::u32string or std::basic_string (that is validated utf-32 using unicode iterator or other unicode libraries)
All types will be serialized to the same type as input.

For another example in mruby I would map all 1,2,3 to standard mruby string.
And all three type will be serialized to 2nd type.

All type are capable to cast to each other except 2nd to 3rd type conversion.
In 2nd to 3rd type conversion it must validate to unicode string.
Though I don't have any idea about default behavior of 2nd string unpacking whether to validate it in language with unicode support.
I thought 1st binary type is unnecessary sometimes but it's more rational to have that kind of internal type.

By adding one more type it will bear upon to the type chart, it needs some solution.
Adding fix/16/32 size is not a good idea(at least this time), I suggest adding single type with variable-length using variable length quntaty.

btw, I felt msgpack community is obscure even if I can speak/write Japanese when I saw this page.

@cabo

@kazuho Thank you, apology is appreciated.

Again, I don't want to do this. But before we discuss who should do what, let my try to clear up some potential misconceptions of the IETF process.

If (big if!) this ever becomes an IETF WG document, then the WG chair (likely not me; I'm chairing other WGs) will select a document editor. That person will follow the instructions of the WG. Anyone is welcome to be part of the WG, members of the msgpack community included. You don't have to do anything beyond subscribing to a mailing list. Contributions by WG members are generally weighted by their technical merit. If an issue comes up, technical input will be solicited from the WG and other experts. If @frsyuki doesn't want to take on any of the roles I have described, his input will certainly be very welcome, as will be everybody else's input based on technical merit. So I'm sure that the community input will be heard, just as I'm right now trying to hear it in coming up with my own little spec.

The MessagePack spec is trivial enough that I don't see big problems in arriving at an interoperable specification. Hey, you haven't even started properly writing it up yet, and you get some interoperability. For now, my only contribution is trying to do this work. But, yes, the existing community is important, and nobody wants to do something stupid that alienates existing users.

I'm not sure "steamrolling" is a good description of what I'm doing here. Progress with msgpack has been stalling badly. The inaction has been alienating some people, and other people have been forking the spec already a couple of times.
Of course, all those that didn't care about addressing the issues are now the remaining msgpack community.
That is not necessarily a very healthy situation. I'm saying this because from your insider position that may be hard to notice.

@kuenishi
Owner

I don't want to get stuck with XML when I need binary. That's why I'm here.

@cabo This may be a reason why you choose MessagePack but this cannot be the reason why you write a draft and standardize MessagePack-like serialization protocol. It does not look worth paying such time and costs for us (existing msgpack users, maintainers and @frsyuki), because msgpack is implemented as free portable software, not hardware.

@mirabilos
@frsyuki
Owner

@cabo

You might missed my comment above:

So, anyway, I think I need to ask you some questions:

I (and almost everyone here, I guess) didn't know the IETF process.
Is this my understanding on IETF WG correct?:

  • Subscribing the mailing list and posting opinions is the only way to "propose" changes of the draft
  • The editor is the only person who can edit the draft
  • The editor will not be you or me. The editor will be a neutral person who is selected by the WG chair
  • The editor writes a draft by following on the instructions of the WG

(If this ever becomes an IETF WG document.)

@frsyuki
Owner

I don't think the proposal (https://gist.github.com/frsyuki/5022569) is matured.
We'll see other comments as @take-cheeze commented.

@kazuho

@cabo

Thank you for describing the process. It will help many of us understand it.

But whatever the process is, it still seems to me that you are trying to enforce the community to pay the cost of the standardization process, or else the protocol might fork into two, which would cause interoperability problems for sure.

And what I am afraid is that your following statement is wrong.

The MessagePack spec is trivial enough that I don't see big problems in arriving at an interoperable specification.

As many have described (and as is also is documented in @frsyuki's proposal), adding a string type to MessagePack is a huge problem since it introduces incompatibility. Under the current specification both strings and binary data are stored within the same "raw" type, but once we introduce "string" type, we should distinguish the two. In other words, either data should be moved to somewhere else, and that introduce incompatibilities. And this has been the reason why a string type has not been introduced for so long.

So editing a draft of MessagePack is not an easy thing to do, if you care about interoperability. It cannot be done without the help of the community. So please reconsider instead of trying to enforce the community to pay the cost of standardization.

Of course it would be easy if you do not care about existing implementations. But my understanding is that IETF do take care of those. Is my understanding correct?

@cabo

@frsyuki Thanks for the questions.

Being on the mailing list is indeed the only way to contribute to a WG process. Now we don't know yet what WG may be handling this, so if it helps that this be a low-volume, very focused mailing list, we may try to create one.

An Internet-Draft (I-D) can be in two stages: personal draft or WG draft. A personal draft is edited by whoever wrote this (that's why my bpack draft is called draft-bormann-..., Bormann is my last name). A WG draft is edited by one or more editors (e.g., draft-ietf-core-coap is a draft of the CoRE WG). That is usually, but not always the person why wrote the initial personal draft -- this is the discretion of the WG chair.

I'm sure if you want to be the editor of the msgpack draft this will be highly welcome, because this is your work. Remember, however, that such a position comes with the responsibility to carry out changes as the WG decided. It has happened in the past that an editor and the WG disagreed to the extent that the editor stopped editing or even made unwanted changes; in such cases a new editor is appointed (or, if the rough consensus we strive for is not achievable at all, the work is stopped). Many drafts have multiple editors. I wouldn't mind being an editor either, mainly because I think I'm pretty effective at the kind of editing work that remains to be done. We could both be editors; e.g., draft-ietf-core-coap has four editors. You could choose somebody else as a second editor (with the chair's approval). It doesn't hurt to have a pair of editors that represent different communities. Etc. (We are quite some ways from having to make the decision.)

@kazuho Whether the IETF cares about existing implementations is dependent on the specific work that needs to be done. E.g. Oauth 2 differs a lot from Oauth 1. But in that case, evolving the spec also was the (rough!) consensus of the people involved. If people from the msgpack community believe backwards compatibility is important, they should make that point, and I'm sure you will be heard. (Speaking just for myself: I personally don't have a requirement for backwards compatibility, but I'm also interested in maximizing interoperability.) I'm not sure how much impact the implementation considerations for backwards compatibility will need to have on the spec.

I'm sorry if I missed comments here, this github issue is a bit more active than I had planned for...

@cabo

@kuenishi — you act as if I created the problem that is being solved in this github issue. I didn't. This github issue is old, and it is actually a duplicate of github issue 13 that is even older. I just happened to write a specification that also needs a solution for the problem, and that picked up an earlier msgpack fork (binarypack) that was created because issue 13 was ignored. So I have no idea what cost I'm creating here.

@take-cheeze

@mirabilos

A “maybe” doesn’t have a place in a specification.

Sorry I need to use words more carefully.
I mean a type like

Languages which don't have types to distinguish strings and byte arrays (e.g.: PHP, C++, Erlang, OCaml)

in background section of @frsyuki 's proposal.

@cabo

Hmm, IIRC Erlang does have strings that are different from binary?
(And C++ indeed has vector vs. string etc...)
Maybe you do want to handle text strings using the same type that you use for byte strings.
But that is not a property of the language.

@frsyuki
Owner

@cabo Thank you for your detailed description. Let me ask one more question:

I'm a founder of a startup company in SiliconValley. Success of the company is obviously the primary goal which I have to run toward for first. It means I can't spend lots of time for creating a RFC standard for MessagePack.

Thus question is that: what type of work is expected to be the editor?

@cabo

@frsyuki As I said, I don't think there is much work to be done at this point. Your main job will be to be cognizant of all the issues and make sure the document is consistent. If you absolutely need to minimize your work, you probably want to have a second person on the document to handle tedious things like IANA considerations, editorial comments, somebody who knows the IETF process.
E-Mail me for more details (my e-mail address is in the draft).

@frsyuki
Owner

@cabo Let me comment about C++:

C++'s std::vector is used to represents an array whose elements' type is T. If the T is an integer, then it should be serialized using msgpack's Array type and Integer type.

Imagine that the type is "char". It could be an array of very small integers. It could be a byte array. The serializer can't distinguish. How about uint16_t? It could be a UTF-16 string.
C++ programmers are allowed to store byte arrays using std::string. Many programmers (including me) use it because it's easy to use. Otherwise programmers need to allocate/free memories manually.

These languages are widely used. This is a fact. I wanted to have intermediate data representation format which don't enforce extra work to set markers on these objects. MessagePack focuses on these problems. Thus adding a string type is difficult to decide.

@frsyuki
Owner

@cabo I'm sorry but let me be cautious...

What do you mean "at this point"?:

I don't think there is much work to be done at this point

I think this is why @kazuho says "enforce":

If you absolutely need to minimize your work, you probably want to have a second person on the document to handle tedious things like IANA considerations, editorial comments, somebody who knows the IETF process.

Finding dependable person who kindly spends time for an open source project seems hard. But you need an answer by Monday, right?

Please don't get me wrong. I'm appreciated your guides and suggestions on this project.

I think I have following 3 options but none of them seems excellent idea:

1. I'll be the editor (hard to spend time)
2. I try to found a dependable person for an editor (takes time)
3. Just say "don't propose msgpack as a draft" (result in confusing two specs)
@cabo

@frsyuki Well, the draft is written. You can have that. (Later, somebody has to do the small things still listed as TBD. With your help, I could do those — I have done a lot of those before.)

No, I don't need an answer by Monday. The only thing that I want to do by tomorrow is submit the next version of my draft, the -01 (you have seen the draft draft), because the -00 documents something that we both agree is no longer what we want, and I want to replace that. Then the two-week submission moratorium in front of the Orlando IETF starts.

If you aren't ready to take over at this time (and I would certainly understand that), I'll just do the -01 again under my name. We/you/whoever can submit the next version on Monday, March 11, or later. We also can let the -01 stand around unchanged for a while, while we all figure out what to do.

Again, there can be multiple people responsible for a draft, so there are multiple configurations possible beyond those you have listed.
And we aren't even close to having a WG that would need an editor appointed.

So we have plenty of time to find the right way to do this.

If possible, I would like to be able to discuss potential options with other IETFers in Orlando, in the week commencing March 11. So if we can discuss potential ways forward within the next two weeks from now, that would help a lot. But even that isn't strictly necessary. (It would help because there will be a JSON BOF in Orlando.)

I apologize if I have created the impression that the Monday deadline is the end of the world. It is just an Internet-Drafts deadline, which we like to have in the IETF so we can all read the documents before arriving at the meeting place.

@kenn

@frsyuki I suggest that you open a new issue with https://gist.github.com/frsyuki/5022569 so that we could focus on the details of the spec and start the debate afresh there. At this point we need a separate place for those who are only insterested in the proposed new spec itself.

I think this issue has been messed up and important people who should be reading this thread stopped reading. We can leave this thread open so that we can continue to come back when we need to talk about non-spec matters.

@cabo

How about issue 13?...

@frsyuki
Owner

@cabo OK. I misunderstood about the Monday.

The inventor and recognized steward of msgpack, I can NOT say you can use the name "MessagePack" as the name of your next draft (which is likely submitted Monday) SO FAR. You'll submit the -01 under your name on Monday.

I think this is all what I need to say now.

I'll try to hear advices from several people who have experiences on standardization.

I think I need to take a good sleep to make decisions correctly any more....

@frsyuki
Owner

Oh, I forgot to mention the reason: We couldn't reach a consensus on this matter so far (meaning right now).

@cabo

@frsyuki: Thanks, I completely understand. This would indeed be too early.

I also subscribe to the view that a specification needs to be implemented and its ramifications understood before you really can have a solid consensus.

So, for now, have a good night's sleep!

@cabo

Re the C++ string interoperability issue: Let me just point out that WG21 at least appears to be aware of the problem (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html proposes a char8_t (typedefed from unsigned char) and a u8string built from basic_string). But that may be little help for now, at least until C++/TR2 comes out or it is supported by common libraries.

@kiyoto

Let's heed @kenn's advice. I just created a new issue #128

Also, this is the 300th comment! It's really time to start over. Imagine loading this page on an iPhone screen...

@Midar

@kiyoto It's still readable there, just did that today ;).

But yes, let's split this into several tickets and close this.

@kiyoto

@Midar
You have a superior vision than I do (no pun intended). I did that too, got a mild headache, and decided to create a new ticket =)

@kazuho

@cabo @frsyuki

Sorry, now that we have a new location to handle the issue, I have removed my last comment posted here and reposted as #128 (comment)

@frsyuki
Owner

Thank you. See #128 as well.

@tagomoris

hey, can anyone close this issue?

@rasky rasky closed this
@oberstet oberstet referenced this issue in tavendo/WAMP
Closed

Binary Payload Format #4

@niemeyer

My conclusion is that "it's better to support user-defined custom type rather than adding string type"

I'd be happy to have a string type, but custom types opens a relevant can of worms that I'd like to stay away from. msgpack was a great format precisely because it was simple, tight.

For example, a server program requires that data should be serialized in string type. Another program written in PHP can't tell strings from binary type.

The irony is that there are 11 ways in which the number 1 could be sent across the wire. Some of the libraries are unable to drive that distinction as well. Seems like people did okay so far.

@tracker1

Gah, looking at the latest spec.. why not just make the "string" type expect to be UTF-8 encoded IN THE SPEC? and keep the binary type if you want "RAW" whatever?

@grinich grinich referenced this issue from a commit
Commit has since been removed from the repository and is no longer available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.