Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upMsgpack can't differentiate between raw binary data and text strings #121
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Dec 4, 2012
This is a serious problem and what's preventing me from implementing MessagePack in my ObjC framework. I have to know whether I should create a string object or a data object. Creating a string object for everything will fail if it is not UTF-8 and always creating a data object will be very impractical.
MessagePack is announced to be compatible with JSON and only providing what JSON provides - does that mean raw data actually means "UTF-8 string" in the author's view of things?
Midar
commented
Dec 4, 2012
|
This is a serious problem and what's preventing me from implementing MessagePack in my ObjC framework. I have to know whether I should create a string object or a data object. Creating a string object for everything will fail if it is not UTF-8 and always creating a data object will be very impractical. MessagePack is announced to be compatible with JSON and only providing what JSON provides - does that mean raw data actually means "UTF-8 string" in the author's view of things? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chakrit
Jan 14, 2013
First-class string support is proposed 2 years ago and the issue still hasn't been closed.
If msgpack really goes by the motto "It's like JSON", I think it needs to solve this and other related issues ASAP.
For a time being tho, I think going with UTF8 and using some key convention to differentiate between binary blobs and strings might help.
e.g. append a _data to every key that should be treated as binary and otherwise decode as a UTF8 string by default.
EDIT: Found a comment related to this issue on StackOverflow: http://stackoverflow.com/questions/6355497/performant-entity-serialization-bson-vs-messagepack-vs-json#comment15798093_6357042
Generally, the raw bytes are assumed to be a string (usually utf-8), unless otherwise expected and agreed to on both sides of the channel. msgpack is used as a stream/serialization format... and less verbose that json.. though also less human readable.
so i take this to means if we need raw bytes on the wire, we should implement our own addition to the protocol.
chakrit
commented
Jan 14, 2013
|
First-class string support is proposed 2 years ago and the issue still hasn't been closed. If msgpack really goes by the motto "It's like JSON", I think it needs to solve this and other related issues ASAP. For a time being tho, I think going with UTF8 and using some key convention to differentiate between binary blobs and strings might help. e.g. append a EDIT: Found a comment related to this issue on StackOverflow: http://stackoverflow.com/questions/6355497/performant-entity-serialization-bson-vs-messagepack-vs-json#comment15798093_6357042
so i take this to means if we need raw bytes on the wire, we should implement our own addition to the protocol. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Jan 14, 2013
Appending _data or some convention like that means it's not possible to write a generic MsgPack implementation that can be used by application. I need to know whether it's a string or binary data, as that means I need to handle it differently. And I need to know that before I pass the data to the application, because the application will get the wrong object otherwise.
If this bug is well known for over 2 years and there is no intention to fix it, then I guess we should just move on and forget MsgPack.
Midar
commented
Jan 14, 2013
|
Appending _data or some convention like that means it's not possible to write a generic MsgPack implementation that can be used by application. I need to know whether it's a string or binary data, as that means I need to handle it differently. And I need to know that before I pass the data to the application, because the application will get the wrong object otherwise. If this bug is well known for over 2 years and there is no intention to fix it, then I guess we should just move on and forget MsgPack. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mirabilos
Feb 5, 2013
Actually @Midar JSON is not binary-safe and all strings are UTF-16 there (with UTF-8 being a valid representation thereof).
No idea on msgpack though, only stumbled here because of a discussion about salt…
mirabilos
commented
Feb 5, 2013
|
Actually @Midar JSON is not binary-safe and all strings are UTF-16 there (with UTF-8 being a valid representation thereof). No idea on msgpack though, only stumbled here because of a discussion about salt… |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 5, 2013
@mirabilos Nobody is talking about JSON being binary-safe here. The problem is that while strings in JSON are UTF-8 (and UTF-16 internally), there is no specification on that in MsgPack whatsoever. It is simply impossible to know whether something is a string in UTF-8, a string in UTF-16, a string in ISO-8859-1, a string in KOI8-R or just some binary data. And that is the problem. This is completely different to binary-safety and has absolutely nothing to do with JSON.
Midar
commented
Feb 5, 2013
|
@mirabilos Nobody is talking about JSON being binary-safe here. The problem is that while strings in JSON are UTF-8 (and UTF-16 internally), there is no specification on that in MsgPack whatsoever. It is simply impossible to know whether something is a string in UTF-8, a string in UTF-16, a string in ISO-8859-1, a string in KOI8-R or just some binary data. And that is the problem. This is completely different to binary-safety and has absolutely nothing to do with JSON. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
DestyNova
Feb 19, 2013
Agreed, it is a problem which lends itself to ad-hoc workarounds. I've been using the Objective-C msgpack implementation to transfer mixed data between iOS devices and a server.
When "raw" data is detected, it tries to parse it as a UTF8 string first. The only solution I could think of was to patch msgpack-objectivec such that if the UTF8 parse produces a null result, then it simply returns that item as an array of bytes.
However, this heuristic will fail if the UTF8 parse just happens to produce a valid UTF8 string, or perhaps worse if parsing some binary data could cause "unspecified" behaviour.
DestyNova
commented
Feb 19, 2013
|
Agreed, it is a problem which lends itself to ad-hoc workarounds. I've been using the Objective-C msgpack implementation to transfer mixed data between iOS devices and a server. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chakrit
commented
Feb 20, 2013
|
Time for a new fork? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
OK. Sorry for being late.
As the initial designer of the MessagePack format, I think msgpack should not have string type.
I need to write an longer article but let me describe some points so far:
- data format should be isolated from programs
- it depends on applications whether the sequence of bytes is assumed as a byte array or a string.
- lifecycle of data format is usually longer than programs:
- example1: stored data should be consistent rather than programs
- example2: network protocols should be compatible with old programs
- thus data should not have string type information bit and applications should map sequence of bytes into string type only when it's necessary
- successfully stored data must be read successfully
- packer should validate strings to guarantee before it stores if it stores as a string
- implementing validation code is relatively hard and make it difficult to port msgpack to other languages/architecture
- data may not be trusted. thus unpacker also should support string validation at least optionally
- supporting multiple encodings make it even harder
- thus msgpack library should not consider encoding validation including string type bit
- it doesn't be a problem in statically typed languages
- because these languages need to specify the data type before handling the deserialized (=dynamically typed) data either way
- see C++, Java and D implementations which type conversion mechanism
- users think Java implementation's Value class (by @muga) is useful and it doesn't have byte array/string problem completely
- even with dynamically languages, some committers don't think it's causing problems
- I think only JavaScript and Objective-C have problems
- JavaScript doesn't have byte array type historically. it needs special handling either way
- I suggest Objective-C/JavaScript implementations to have following solution:
- unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString
- the object contains a validated UTF-8 string
- if the validation failed, it's nil or something we can tell that the validation failed
- NSStringOrData#data returns the original byte array
- supporting user-defined custom types is better than string type
- 0xc1 is considered to be reserved for string type
|
OK. Sorry for being late.
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
It took time for me to build my opinion.
My conclusion is that "it's better to support user-defined custom type rather than adding string type"
|
It took time for me to build my opinion. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
muga
Feb 20, 2013
Member
Hi,
I'm developer of msgpack-java. the above is well-known (and complicated) problem.
+1
In my opinion,
- the serialization core library should not implement character encoding.
- serialization format should not include charset information.
- having utility library on top of the library is good idea
If msgpack has the string type, the format and library implementations must be complicated. It means keeping the compatibility of the format and libraries becomes difficult. Actually it is really difficult to consider a serialization format for any charset. If it has bugs, we must fix not only the format but the libraries. It is critical thing..
Business logic on application-side should handle character encoding. But having extension hook points in a msgpack library is good idea so that you can extend encoding handling using some other libraries.
|
Hi, I'm developer of msgpack-java. the above is well-known (and complicated) problem. +1 In my opinion,
If msgpack has the string type, the format and library implementations must be complicated. It means keeping the compatibility of the format and libraries becomes difficult. Actually it is really difficult to consider a serialization format for any charset. If it has bugs, we must fix not only the format but the libraries. It is critical thing.. Business logic on application-side should handle character encoding. But having extension hook points in a msgpack library is good idea so that you can extend encoding handling using some other libraries. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
methane
Feb 20, 2013
Member
-0.5 to adding string type.
For example, JSON has Integer and Number types. Application should handle Number when expecting Integer.
If msgpack has string type, application should handle raw when expecting string and handle string when expecting raw.
So, I feel inter language serialization format should have minimum types.
|
-0.5 to adding string type. For example, JSON has Integer and Number types. Application should handle Number when expecting Integer. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chakrit
Feb 20, 2013
I disagree completely.
UTF-8 and UTF-16 is a very well-known standards that has been around for a very long time. All new implementations these day should support Unicode String encoding from day one. There shouldn't even be a question of which character encoding to use especially when msgpack wants to be just "like JSON".
There are well-known UTF string encoding routines available on nearly every platforms. It's not like every implementation has to roll another character encoding routine from zero, they could just use whatever's available on their platform of choice. And there are character encoder/decoder available on most, if not all of the platforms these days. In my opinion, implementing an encoder/decoder is non-problem: don't re-invent any wheel.
Think of this as referencing another standard in your piece of work instead of having to specify all character encoding mechanism yourself.
String is a very fundamental data type required by most (if not all) applications these days. and let me repeat this: "It's like JSON." is printed in an H2 on the very top of the msgpack website and your specification does not include the simplest thing that is a String, why?
Also, the problem exists regardless of wether or not msgpack has a string specification or not. In my opinion, it is even worse to not specify the exact character encoding in your wire protocol.
Suppose you have two applications which both use msgpack, yet they wouldn't be able to communicate because the msgpack protocol itself does not specify how a string should've been encoded thus allowing space for incompatibility whereas if the msgpack specs would just say "here, use this if you need a string, and don't forget to encode it in proper UTF-8", this problem wouldn't have existed from the start.
Let me suggest this:
- You should simply add a String data type. It is so fundamental that it should not be left out. Especially when you are advertising it as a faster/smaller JSON. I suggest you start with UTF-8 and/or UTF-16 as the encoding. (and personally, I don't think there is any need to support more encodings than these two.). If anyone needs absolute speed, they can still use the old raw-bytes type with their own encoding and their own acceptance of any possible incompatibilities that might arises.
- If you insist on not having a String data type, then there should be better documentation and "recommended practice" with regards to handling String and the encoding to use because, as I've repeated, String is very fundamental data type that should've been specified in the spec and there are many platforms where there exists both String and a normal Buffer (or byte[] array) data type in active use such as JS/node.js and ObjC/iOS. Leaving this out just causes confusion between parties trying to implement the same protocol.
TL;DR --- I think this is simply a matter of documenting the "best practice" or what's expected of the implementation properly rather than just throwing out a spec defining only binary blobs and denying all string support in fear of character encoding issue but with zero pointers on how to exactly to implement one should you need it (and you will definitely needs it, what application does not use a string?)
chakrit
commented
Feb 20, 2013
|
I disagree completely. UTF-8 and UTF-16 is a very well-known standards that has been around for a very long time. All new implementations these day should support Unicode String encoding from day one. There shouldn't even be a question of which character encoding to use especially when msgpack wants to be just "like JSON". There are well-known UTF string encoding routines available on nearly every platforms. It's not like every implementation has to roll another character encoding routine from zero, they could just use whatever's available on their platform of choice. And there are character encoder/decoder available on most, if not all of the platforms these days. In my opinion, implementing an encoder/decoder is non-problem: don't re-invent any wheel. Think of this as referencing another standard in your piece of work instead of having to specify all character encoding mechanism yourself. String is a very fundamental data type required by most (if not all) applications these days. and let me repeat this: "It's like JSON." is printed in an H2 on the very top of the msgpack website and your specification does not include the simplest thing that is a String, why? Also, the problem exists regardless of wether or not msgpack has a string specification or not. In my opinion, it is even worse to not specify the exact character encoding in your wire protocol. Suppose you have two applications which both use msgpack, yet they wouldn't be able to communicate because the msgpack protocol itself does not specify how a string should've been encoded thus allowing space for incompatibility whereas if the msgpack specs would just say "here, use this if you need a string, and don't forget to encode it in proper UTF-8", this problem wouldn't have existed from the start. Let me suggest this:
TL;DR --- I think this is simply a matter of documenting the "best practice" or what's expected of the implementation properly rather than just throwing out a spec defining only binary blobs and denying all string support in fear of character encoding issue but with zero pointers on how to exactly to implement one should you need it (and you will definitely needs it, what application does not use a string?) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mzp
Feb 20, 2013
Member
Hi, I'm developer of msgpack-ocaml. I disagree with adding string type.
One of benefits of msgpack is multi-platform. So, we should be careful for adding new type.
But, string type is not so much attractive. Although string type is fundamental type in many laungage, UTF8-encoded string type is so much. For example, OCaml doesn't suppose any encoding on string.
I don't have strong opinion about "recommended practice". But I think that it is each application's task, not msgpack's.
|
Hi, I'm developer of msgpack-ocaml. I disagree with adding string type. One of benefits of msgpack is multi-platform. So, we should be careful for adding new type. But, string type is not so much attractive. Although string type is fundamental type in many laungage, UTF8-encoded string type is so much. For example, OCaml doesn't suppose any encoding on string. I don't have strong opinion about "recommended practice". But I think that it is each application's task, not msgpack's. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
@chakrit I don't think supporting UTF-8 encoding/decoding/validating is easy even if there're some well-known libraries. Remember msgpack focuses on cross-language. For example, I don't think smalltalk supports FFI by default. In JavaScript for browsers, @uupaa implemented IEEE754 and this complex code will be needed again to support UTF-8 (or UTF-16):
https://github.com/msgpack/msgpack-javascript/blob/master/msgpack.js#L135
if the msgpack specs would just say "here, use this if you need a string, and don't forget to encode it in proper UTF-8"
I agree it's good idea. I added a comment on the spec: http://wiki.msgpack.org/display/MSGPACK/Format+specification
At least Java and Ruby implementations (written by me) already use UTF-8 to serialize strings.
Regarding 1., JSON doesn't support binary type. I don't think so but do you mean msgpack should not support Raw type to say it's like JSON? Problem is that some users want to handle strings and binaries at the same time and they want to tell the difference transparently. If we want to use msgpack as the replacement of JSON, users can assume all Raw objects are string. Some msgpack libraries such as Python impl. support string-only mode (this is nice feature, I think). I want to add the feature to the msgpack-ruby v0.5.x as well.
Regarding 2., to be exact, it's a problem of JS/node.js and ObjC/iOS implementations. I mean that String is not a fundamental type in some languages such as C, C++, Ruby (at least 1.8), Erlang, and Lua (actually significant languages...right?). In Python and Ruby 1.9, the difference of strings and binaries is unclear in terms of both implementations and cultural aspects. MessagePack format itself doesn't consider the mappings between msgpack's types and language types. Implementations take the role to project msgpack's types into language specific types (this is an essential concept of msgpack). Thus as I mentioned above, JS and ObjC implementations should document about that specifically.
....But anyway, I agree that it's better msgpack documents should mention the "best practice to handle strings at certain dynamically-typed languages such as Objective-C or JavaScript."
So, TL;DR...msgpack project lacks some important documents such as: why msgpack doesn't have string type, guidelines for implementations how to handle strings, the best practice to handle strings. // TODO FIXME
|
@chakrit I don't think supporting UTF-8 encoding/decoding/validating is easy even if there're some well-known libraries. Remember msgpack focuses on cross-language. For example, I don't think smalltalk supports FFI by default. In JavaScript for browsers, @uupaa implemented IEEE754 and this complex code will be needed again to support UTF-8 (or UTF-16):
I agree it's good idea. I added a comment on the spec: http://wiki.msgpack.org/display/MSGPACK/Format+specification Regarding 1., JSON doesn't support binary type. I don't think so but do you mean msgpack should not support Raw type to say it's like JSON? Problem is that some users want to handle strings and binaries at the same time and they want to tell the difference transparently. If we want to use msgpack as the replacement of JSON, users can assume all Raw objects are string. Some msgpack libraries such as Python impl. support string-only mode (this is nice feature, I think). I want to add the feature to the msgpack-ruby v0.5.x as well. Regarding 2., to be exact, it's a problem of JS/node.js and ObjC/iOS implementations. I mean that String is not a fundamental type in some languages such as C, C++, Ruby (at least 1.8), Erlang, and Lua (actually significant languages...right?). In Python and Ruby 1.9, the difference of strings and binaries is unclear in terms of both implementations and cultural aspects. MessagePack format itself doesn't consider the mappings between msgpack's types and language types. Implementations take the role to project msgpack's types into language specific types (this is an essential concept of msgpack). Thus as I mentioned above, JS and ObjC implementations should document about that specifically. ....But anyway, I agree that it's better msgpack documents should mention the "best practice to handle strings at certain dynamically-typed languages such as Objective-C or JavaScript." So, TL;DR...msgpack project lacks some important documents such as: why msgpack doesn't have string type, guidelines for implementations how to handle strings, the best practice to handle strings. // TODO FIXME |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 20, 2013
I strongly disagree with the position not to add the most basic type: a string.
Let's assume MsgPack is Layer 1 and our protocol is Layer 2, encoded in MsgPack. So, when I want to decode MsgPack to objects (which is Layer 1, remember?), I also need to have knowledge about Layer 2 (because otherwise I can't know what it is)? Sorry, but this is completely retarded. This is like "In order to parse TCP, you need to parse the protocol that's wrapped inside TCP. So, if you want to parse TCP, you need to parse every protocol in existence like HTTP, XMPP, SMTP, IMAP, etc.".
Saying that UTF-8 is too complicated is basically admitting defeat. If you can't implement those 20 lines of C code required for de- and encoding UTF-8, you probably shouldn't write any code at all. Especially as almost all languages have already implemented UTF-8 and you can just use it.
The strangest thing is the reason: You're saying you don't want to have a string type out of fear of being not interoperable. Well, actually, you kill interoperability by not having a string type, as therefore it's not possible to parse Layer 1 in many languages as you don't know which encoding is used or if it even is a string. There is no way to have a look at the data without some kind of schema and thus looking at Layer 2, which you really shouldn't. This violates basic rules of software design!
The advantage of MsgPack to Protocol Buffers could have been that it does not need a schema. But with this decision, MsgPack has no advantage over Protocol Buffers. It's not portable and it needs a schema, both two things you don't want from a general purpose serialization format.
Saying that UTF-8 is a problem for interoperability is really the the biggest nonsense I've heard so far. Almost all modern network protocols require UTF-8. XML requires UTF-8 and works on many more platforms and languages than MsgPack ever will. Requiring UTF-8 eliminates the pain of having to support multiple encodings. There's a reason the world moved to UTF-8…
Midar
commented
Feb 20, 2013
|
I strongly disagree with the position not to add the most basic type: a string. Let's assume MsgPack is Layer 1 and our protocol is Layer 2, encoded in MsgPack. So, when I want to decode MsgPack to objects (which is Layer 1, remember?), I also need to have knowledge about Layer 2 (because otherwise I can't know what it is)? Sorry, but this is completely retarded. This is like "In order to parse TCP, you need to parse the protocol that's wrapped inside TCP. So, if you want to parse TCP, you need to parse every protocol in existence like HTTP, XMPP, SMTP, IMAP, etc.". Saying that UTF-8 is too complicated is basically admitting defeat. If you can't implement those 20 lines of C code required for de- and encoding UTF-8, you probably shouldn't write any code at all. Especially as almost all languages have already implemented UTF-8 and you can just use it. The strangest thing is the reason: You're saying you don't want to have a string type out of fear of being not interoperable. Well, actually, you kill interoperability by not having a string type, as therefore it's not possible to parse Layer 1 in many languages as you don't know which encoding is used or if it even is a string. There is no way to have a look at the data without some kind of schema and thus looking at Layer 2, which you really shouldn't. This violates basic rules of software design! The advantage of MsgPack to Protocol Buffers could have been that it does not need a schema. But with this decision, MsgPack has no advantage over Protocol Buffers. It's not portable and it needs a schema, both two things you don't want from a general purpose serialization format. Saying that UTF-8 is a problem for interoperability is really the the biggest nonsense I've heard so far. Almost all modern network protocols require UTF-8. XML requires UTF-8 and works on many more platforms and languages than MsgPack ever will. Requiring UTF-8 eliminates the pain of having to support multiple encodings. There's a reason the world moved to UTF-8… |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
repeatedly
Feb 20, 2013
Member
Hello, I'm an author of msgpack-d.
I have never wanted string type in my msgpack experience.
In D, string <-> byte conversion has no problem because the application has already normalized the invalid string before serialization.
In addition, many serialization types are bad in my RPC experience. It causes the lack of interoperability.
Probably, this issue is IDL or application layer problem.
P.S.
If introducing the string type, then supporting user-defined custom types is good for me.
Because this approach resolves that someone says "I want this type in msgpack!"
|
Hello, I'm an author of msgpack-d. I have never wanted string type in my msgpack experience. Probably, this issue is IDL or application layer problem. P.S. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rasky
Feb 20, 2013
@frsyuki @methane I am the original issue opener. I have posted a clear Python example that show that msgpack is completely broken in Python as a very simple data structure doesn't load back. So I can't see how you can think that it is not broken in Python at the very least.
I know there is an option to return byte array by default, and that's totally useless, because it applies to all of them.
Also when you say "In Python and Ruby 1.9, the difference of strings and binaries is unclear in terms of both implementations and cultural aspects", I totally don't know what you are referring about. The difference between strings and binaries is very clear in Python (and Ruby, and Java, and Objective C and MANY of the modern languages), there is tons of documentation on it, tons of material, tons of talk. I am surprised that you can think that it is unclear.
I think @Midar nailed it. The problem is that, without a string type, MsgPack always needs a schema/IDL to be useful, because it cannot convert back to native data structures without a schema telling it how to. Vice versa, if you add a string type, it becomes possible (most of the time) to avoid a schema.
rasky
commented
Feb 20, 2013
|
@frsyuki @methane I am the original issue opener. I have posted a clear Python example that show that msgpack is completely broken in Python as a very simple data structure doesn't load back. So I can't see how you can think that it is not broken in Python at the very least. I know there is an option to return byte array by default, and that's totally useless, because it applies to all of them. Also when you say "In Python and Ruby 1.9, the difference of strings and binaries is unclear in terms of both implementations and cultural aspects", I totally don't know what you are referring about. The difference between strings and binaries is very clear in Python (and Ruby, and Java, and Objective C and MANY of the modern languages), there is tons of documentation on it, tons of material, tons of talk. I am surprised that you can think that it is unclear. I think @Midar nailed it. The problem is that, without a string type, MsgPack always needs a schema/IDL to be useful, because it cannot convert back to native data structures without a schema telling it how to. Vice versa, if you add a string type, it becomes possible (most of the time) to avoid a schema. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
I needed to mention another problem about UTF-8 (and unicode).
UTF-8 validation includes NFD/NFC problem. For example, "\u00e9" (NFC) and "\u0065\u0301" (NFD) represent exactly same character (you may know that Mac OS X uses NFD to represent file names and it sometimes causes troubles with Linux which usually uses NFC). If msgpack had string type, should implementations normalize characters to NFC, or NFD?
UTF-8 has verbosity as well. 0x2F could be 0xC0 0xAF. Should deserializers reject these bytes? Or normalize into another character?
|
I needed to mention another problem about UTF-8 (and unicode). UTF-8 validation includes NFD/NFC problem. For example, "\u00e9" (NFC) and "\u0065\u0301" (NFD) represent exactly same character (you may know that Mac OS X uses NFD to represent file names and it sometimes causes troubles with Linux which usually uses NFC). If msgpack had string type, should implementations normalize characters to NFC, or NFD? UTF-8 has verbosity as well. 0x2F could be 0xC0 0xAF. Should deserializers reject these bytes? Or normalize into another character? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
methane
Feb 20, 2013
Member
@rasky I agree with you about adding string type helps pythonistas.
But msgpack is a inter language communication format.
We should communicate with weak typed languages like php or JavaScript.
If you want to serialize Python data type perfectly, you can use pickle instead.
It can serialize and back datetime, tuple and many other types correctly.
|
@rasky I agree with you about adding string type helps pythonistas. If you want to serialize Python data type perfectly, you can use pickle instead. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rasky
Feb 20, 2013
@methane I'm using msgpack specifically because it's an inter-language communication format. I communicate between Python and Objective C, and the Objective C msgpack library is totally broken because the string type is missing; in fact, the Objective C object/dictionary standard construct must have strings as keys, and thus the msgpack Objective C library tries to convert everything into string, thus breaking the transmission of binary data. If msgpack had a different string data type, the Objective C library could now what to do.
@frsyuki FIrst, I assume that all languages that implements native unicode strings will have libraries to handle this either way. My take on this is that msgpack shouldn't do anything. You convert from unicode into utf-8 using the standard behavior of the language, and convert back again with the standard behavior. The problems you cite arise only if someone is trying to use UTF-8 as-is, so it will arises in languages where Unicode is not implemented. I think that, if an implementer is going to communicate between a unicode-rich language and an unicode-poor language, it is up to the implementer himself to take care of these small details.
rasky
commented
Feb 20, 2013
|
@methane I'm using msgpack specifically because it's an inter-language communication format. I communicate between Python and Objective C, and the Objective C msgpack library is totally broken because the string type is missing; in fact, the Objective C object/dictionary standard construct must have strings as keys, and thus the msgpack Objective C library tries to convert everything into string, thus breaking the transmission of binary data. If msgpack had a different string data type, the Objective C library could now what to do. @frsyuki FIrst, I assume that all languages that implements native unicode strings will have libraries to handle this either way. My take on this is that msgpack shouldn't do anything. You convert from unicode into utf-8 using the standard behavior of the language, and convert back again with the standard behavior. The problems you cite arise only if someone is trying to use UTF-8 as-is, so it will arises in languages where Unicode is not implemented. I think that, if an implementer is going to communicate between a unicode-rich language and an unicode-poor language, it is up to the implementer himself to take care of these small details. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 20, 2013
@frsyuki None. That is not part of the serialization. Comparing strings is a completely different domain. You could convert it from UTF-8 to your preferred charset and compare it in that and lose internationalization - that's up to you. Or you could put Unicode in your raw binary and still have those problems. Completely up to you. You don't lose anything by having a type for UTF-8 strings. That's just the transfer encoding, you can recode it to whatever you want.
@methane Do you even hear what you're saying?
But msgpack is a inter language communication format.
We should communicate with weak typed languages like php or JavaScript.
So, in the first sentence, you say it should be inter-language. And in the second you say it should only be for weakly typed languages? If you say you want it inter-lanuage, you should recognize that the only way to have that is to add support for a string type.
@rasky Actually, no, everything can be a key in a dictionary as long as it implements -[copy], -[hash], and -[isEqual:]. But who wants to use binary keys in some code? That would always be "Get the bytes from an NSString and create NSData and then pass that to objectForKey:". :)
Midar
commented
Feb 20, 2013
|
@frsyuki None. That is not part of the serialization. Comparing strings is a completely different domain. You could convert it from UTF-8 to your preferred charset and compare it in that and lose internationalization - that's up to you. Or you could put Unicode in your raw binary and still have those problems. Completely up to you. You don't lose anything by having a type for UTF-8 strings. That's just the transfer encoding, you can recode it to whatever you want. @methane Do you even hear what you're saying?
So, in the first sentence, you say it should be inter-language. And in the second you say it should only be for weakly typed languages? If you say you want it inter-lanuage, you should recognize that the only way to have that is to add support for a string type. @rasky Actually, no, everything can be a key in a dictionary as long as it implements -[copy], -[hash], and -[isEqual:]. But who wants to use binary keys in some code? That would always be "Get the bytes from an NSString and create NSData and then pass that to objectForKey:". :) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
@Midar I couldn't catch what Layer 2 means...do you have some examples? I guess Layer 2 has 2 options:
1. Layer 2 aslo doesn't tell strings and byte arrays.
2. Layer 2 implements its own type system on top of msgpack's type system.
Have you implemented UTF-8 validator (which will be required by serializers)? I don't think it fits into 20 lines of C code...
|
@Midar I couldn't catch what Layer 2 means...do you have some examples? I guess Layer 2 has 2 options:
Have you implemented UTF-8 validator (which will be required by serializers)? I don't think it fits into 20 lines of C code... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 20, 2013
@frsyuki Layer 2 is what you put inside MsgPack. A protocol that says "at this place I expect an array, a string, some bytes". Without that knowledge from a protocol that is completely apart from MsgPack, you can't parse MsgPack, and that's really broken.
Yes, I have implemented UTF-8 checking, encoding and decoding. It's easily possible in 20 lines each (decoding and encoding). Here's both with a lot of wasting space that could easily be reduced:
https://webkeks.org/git?p=objfw.git;a=blob;f=src/OFString.m;h=cc873dab3d178abd0f4ed94546a5b0d74add8171;hb=HEAD#l77
Midar
commented
Feb 20, 2013
|
@frsyuki Layer 2 is what you put inside MsgPack. A protocol that says "at this place I expect an array, a string, some bytes". Without that knowledge from a protocol that is completely apart from MsgPack, you can't parse MsgPack, and that's really broken. Yes, I have implemented UTF-8 checking, encoding and decoding. It's easily possible in 20 lines each (decoding and encoding). Here's both with a lot of wasting space that could easily be reduced: |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
rasky
Feb 20, 2013
Can you please explain WHY you need UTF-8 validations?
In unicode rich languages, you will convert UTF-8 into Unicode, and validation is performed by the language itself (or its standard library). No code to write.
In unicode poor languages, there is no Unicode data type, so you leave UTF-8 as-is.
Why do you ever need to include a UTF-8 validator?
rasky
commented
Feb 20, 2013
|
Can you please explain WHY you need UTF-8 validations? In unicode rich languages, you will convert UTF-8 into Unicode, and validation is performed by the language itself (or its standard library). No code to write. In unicode poor languages, there is no Unicode data type, so you leave UTF-8 as-is. Why do you ever need to include a UTF-8 validator? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
@Midar MessagePack is for all of weakly-typed, strongly-typed, dynamically-typed and statically-typed languages.
Please don't think one type system works perfectly for all languages. All implementations have to manage the inconsistency between language types and msgpack types.
The problem is that which causes more troubles: a) projecting strings and byte arrays into Raw type. b) projecting Raw type into strings or byte arrays.
I understand supporting UTF-8 has lots of merits. Why do you think the troubles caused by having UTF-8 is manageable compared to not having UTF-8?
|
@Midar MessagePack is for all of weakly-typed, strongly-typed, dynamically-typed and statically-typed languages. Please don't think one type system works perfectly for all languages. All implementations have to manage the inconsistency between language types and msgpack types. The problem is that which causes more troubles: a) projecting strings and byte arrays into Raw type. b) projecting Raw type into strings or byte arrays. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
@rasky I suggested a way to handle binary-or-string type in dynamically typed languages without schema:
- I suggest Objective-C/JavaScript implementations to have following solution:
- unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString
the object contains a validated UTF-8 string - if the validation failed, it's nil or something we can tell that the validation failed
- NSStringOrData#data returns the original byte array
- unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString
|
@rasky I suggested a way to handle binary-or-string type in dynamically typed languages without schema:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
@rasky > Can you please explain WHY you need UTF-8 validations?
Because:
- successfully stored data must be read successfully
Imagine that an invalid UTF-8 string is stored on a disk with information "this is a UTF-8 string"
|
@rasky > Can you please explain WHY you need UTF-8 validations? Because:
Imagine that an invalid UTF-8 string is stored on a disk with information "this is a UTF-8 string" |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 20, 2013
MessagePack is for all of weakly-typed, strongly-typed, dynamically-typed and statically-typed languages.
This is exactly what I'm saying, which is why I don't get why on one hand you are against a string type which is required for a lot of languages, but on the other hand praise interoperability - which you just destroyed by not having a string type!
Why do you think the troubles caused by having UTF-8 is manageable compared to not having UTF-8?
You still failed to show us where exactly UTF-8 should cause trouble for MessagePack. What exactly makes UTF-8 harder for you? Again, if you care about internationalization as much as about interoperability, you can convert it to some other non-Unicode encoding. If you use a Unicode encoding, you have these "problems" you call it anyway.
Midar
commented
Feb 20, 2013
This is exactly what I'm saying, which is why I don't get why on one hand you are against a string type which is required for a lot of languages, but on the other hand praise interoperability - which you just destroyed by not having a string type!
You still failed to show us where exactly UTF-8 should cause trouble for MessagePack. What exactly makes UTF-8 harder for you? Again, if you care about internationalization as much as about interoperability, you can convert it to some other non-Unicode encoding. If you use a Unicode encoding, you have these "problems" you call it anyway. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 20, 2013
unpacker deserializes byte sequence as an object of NSStringOrData class which inherits NSString the object contains a validated UTF-8 string
if the validation failed, it's nil or something we can tell that the validation failed
NSStringOrData#data returns the original byte array
Oh great, now I have to implement another string class (remember: NSString is just a class cluster. If I subclass it, I have no implementation!) just because you have never heard about separation of layers? Sorry, but no, just no. If it stays this way, I just won't implement MsgPack, and I'm sure many others won't either. Not because they don't like the idea, but simply because you made it impossible to parse it in a sane matter.
Midar
commented
Feb 20, 2013
Oh great, now I have to implement another string class (remember: NSString is just a class cluster. If I subclass it, I have no implementation!) just because you have never heard about separation of layers? Sorry, but no, just no. If it stays this way, I just won't implement MsgPack, and I'm sure many others won't either. Not because they don't like the idea, but simply because you made it impossible to parse it in a sane matter. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
@rasky > I think that, if an implementer is going to communicate between a unicode-rich language and an unicode-poor language, it is up to the implementer himself to take care of these small details.
My proposal is that msgpack doesn't support string types but msgpack supports user-defined types. It means implementer can add string type if implementer himself needs it.
Do you think this does not work?
|
@rasky > I think that, if an implementer is going to communicate between a unicode-rich language and an unicode-poor language, it is up to the implementer himself to take care of these small details. My proposal is that msgpack doesn't support string types but msgpack supports user-defined types. It means implementer can add string type if implementer himself needs it. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
methane
Feb 20, 2013
Member
So, in the first sentence, you say it should be inter-language. And in the second you say it should only be for weakly typed languages? If you say you want it inter-lanuage, you should recognize that the only way to have that is to add support for a string type.
I'm sorry about my poor english.
What I want to say is msgpack should be designed for many languages, not only for languages distinct string and bytes.
I'm sorry about my poor english. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 20, 2013
@frsyuki Yes, I think this does not work, as everybody will come up with his own string type, and there will be no interoperability. Please stop claiming that not implementing a string type improves interoperability, when it clearly does the exact opposite as has been stated by many and is actually an issue which prevents many from using it or taking MsgPack serious.
@methane Yes, I agree. It should work with all languages. But for that, a string type is required. For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary.
Midar
commented
Feb 20, 2013
|
@frsyuki Yes, I think this does not work, as everybody will come up with his own string type, and there will be no interoperability. Please stop claiming that not implementing a string type improves interoperability, when it clearly does the exact opposite as has been stated by many and is actually an issue which prevents many from using it or taking MsgPack serious. @methane Yes, I agree. It should work with all languages. But for that, a string type is required. For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
@rasky For example, in Ruby (1.9), following code returns a String object with UTF-8 encoding information:
require 'uri'
s = URI.unescape("%DE")
p s.encoding
This easily happens in many applications including Rails. Is this string, or binary? I think it depends on how applications handle this object.
Additionally following code returns the same object as well:
require 'msgpack'
s = MessagePack.unpack("\xA1\xDE")
p s.encoding
|
@rasky For example, in Ruby (1.9), following code returns a String object with UTF-8 encoding information:
This easily happens in many applications including Rails. Is this string, or binary? I think it depends on how applications handle this object. Additionally following code returns the same object as well:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 20, 2013
@frsyuki And exactly that is the problem. It depends on how the applications handles it! There is no way to know that without knowledge from the Layer 2 protocol! Why do you insist on ignoring basic principles of software design?
Midar
commented
Feb 20, 2013
|
@frsyuki And exactly that is the problem. It depends on how the applications handles it! There is no way to know that without knowledge from the Layer 2 protocol! Why do you insist on ignoring basic principles of software design? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
methane
Feb 20, 2013
Member
For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary.
Then, how they should serialize such binary?
When I send a string from Python to php, php may send it back to Python in binary type...
Then, how they should serialize such binary? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 20, 2013
@methane By having an optional parameter how it should be treated in encoding, by wrapping it into some object, etc. There are many ways to overcome this in languages which don't make a difference. There is absolutely no way to overcome not having a string type in languages which do make a difference.
Midar
commented
Feb 20, 2013
|
@methane By having an optional parameter how it should be treated in encoding, by wrapping it into some object, etc. There are many ways to overcome this in languages which don't make a difference. There is absolutely no way to overcome not having a string type in languages which do make a difference. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
@Midar Whether an object should be a byte array or string depends on applications.
I said lifecycle of applications (programs) is shorter than data, and data should be isolated from applications. Do you agree with opinion?
Applications could be changed. But data should not be changed at the same time. Applications may consider that the data is a byte array which was considered string before. But we can't change stored data. We can't update the old code in the same network at the same time.
|
@Midar Whether an object should be a byte array or string depends on applications. Applications could be changed. But data should not be changed at the same time. Applications may consider that the data is a byte array which was considered string before. But we can't change stored data. We can't update the old code in the same network at the same time. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
chakrit
Feb 20, 2013
@methance you are describing the exact problem that can be solved by adding a proper string type.
Python -> STR_XXX -> PHP -> BIN_XXX -> Python
Now Python knows it is getting some binary.
And the same python server can then do:
Python -> STR_XXX -> Node.js -> STR_XXX -> Python
Now Python knows it is getting a UTF8 string.
Now, imagine the above scenario without the String type.
Python -> BIN_XXX -> PHP -> BIN_XXX -> Python
Now Python do not knows it is getting a binary or a string (because it does not and should not need to know that the source lang is PHP)
Python -> BIN_XXX -> Node.js -> BIN_XXX -> Python
Now Python do not knows it is getting a binary or a string (because it does not and should not need to know that the source lang is node.js)
We have this problem and there's no way to tell exactly because you don't have the String type in msgpack !
chakrit
commented
Feb 20, 2013
|
@methance you are describing the exact problem that can be solved by adding a proper string type.
Now Python knows it is getting some binary. And the same python server can then do:
Now Python knows it is getting a UTF8 string. Now, imagine the above scenario without the String type.
Now Python do not knows it is getting a binary or a string (because it does not and should not need to know that the source lang is PHP)
Now Python do not knows it is getting a binary or a string (because it does not and should not need to know that the source lang is node.js) We have this problem and there's no way to tell exactly because you don't have the String type in msgpack ! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 20, 2013
Member
@Midar @chakrit > For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary.
This doesn't work.
For example, a server program requires that data should be serialized in string type. Another program written in PHP can't tell strings from binary type. Let's say it sends data in binary type. Then the PHP program can't send requests to the server.
|
@Midar @chakrit > For languages which don't care about whether something is a string or a binary, nothing will change - they can just interpret a string as binary. For example, a server program requires that data should be serialized in string type. Another program written in PHP can't tell strings from binary type. Let's say it sends data in binary type. Then the PHP program can't send requests to the server. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kuenishi
Feb 24, 2013
Member
I don't want to get stuck with XML when I need binary. That's why I'm here.
@cabo This may be a reason why you choose MessagePack but this cannot be the reason why you write a draft and standardize MessagePack-like serialization protocol. It does not look worth paying such time and costs for us (existing msgpack users, maintainers and @frsyuki), because msgpack is implemented as free portable software, not hardware.
@cabo This may be a reason why you choose MessagePack but this cannot be the reason why you write a draft and standardize MessagePack-like serialization protocol. It does not look worth paying such time and costs for us (existing msgpack users, maintainers and @frsyuki), because msgpack is implemented as free portable software, not hardware. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
mirabilos
Feb 24, 2013
take_cheeze dixit:
- maybe string or binary type(old raw)
A “maybe” doesn’t have a place in a specification.
mirabilos
commented
Feb 24, 2013
|
take_cheeze dixit:
A “maybe” doesn’t have a place in a specification. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 24, 2013
Member
You might missed my comment above:
So, anyway, I think I need to ask you some questions:
I (and almost everyone here, I guess) didn't know the IETF process.
Is this my understanding on IETF WG correct?:
- Subscribing the mailing list and posting opinions is the only way to "propose" changes of the draft
- The editor is the only person who can edit the draft
- The editor will not be you or me. The editor will be a neutral person who is selected by the WG chair
- The editor writes a draft by following on the instructions of the WG
(If this ever becomes an IETF WG document.)
|
You might missed my comment above:
I (and almost everyone here, I guess) didn't know the IETF process.
(If this ever becomes an IETF WG document.) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 24, 2013
Member
I don't think the proposal (https://gist.github.com/frsyuki/5022569) is matured.
We'll see other comments as @take-cheeze commented.
|
I don't think the proposal (https://gist.github.com/frsyuki/5022569) is matured. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kazuho
Feb 24, 2013
Contributor
Thank you for describing the process. It will help many of us understand it.
But whatever the process is, it still seems to me that you are trying to enforce the community to pay the cost of the standardization process, or else the protocol might fork into two, which would cause interoperability problems for sure.
And what I am afraid is that your following statement is wrong.
The MessagePack spec is trivial enough that I don't see big problems in arriving at an interoperable specification.
As many have described (and as is also is documented in @frsyuki's proposal), adding a string type to MessagePack is a huge problem since it introduces incompatibility. Under the current specification both strings and binary data are stored within the same "raw" type, but once we introduce "string" type, we should distinguish the two. In other words, either data should be moved to somewhere else, and that introduce incompatibilities. And this has been the reason why a string type has not been introduced for so long.
So editing a draft of MessagePack is not an easy thing to do, if you care about interoperability. It cannot be done without the help of the community. So please reconsider instead of trying to enforce the community to pay the cost of standardization.
Of course it would be easy if you do not care about existing implementations. But my understanding is that IETF do take care of those. Is my understanding correct?
|
Thank you for describing the process. It will help many of us understand it. But whatever the process is, it still seems to me that you are trying to enforce the community to pay the cost of the standardization process, or else the protocol might fork into two, which would cause interoperability problems for sure. And what I am afraid is that your following statement is wrong.
As many have described (and as is also is documented in @frsyuki's proposal), adding a string type to MessagePack is a huge problem since it introduces incompatibility. Under the current specification both strings and binary data are stored within the same "raw" type, but once we introduce "string" type, we should distinguish the two. In other words, either data should be moved to somewhere else, and that introduce incompatibilities. And this has been the reason why a string type has not been introduced for so long. So editing a draft of MessagePack is not an easy thing to do, if you care about interoperability. It cannot be done without the help of the community. So please reconsider instead of trying to enforce the community to pay the cost of standardization. Of course it would be easy if you do not care about existing implementations. But my understanding is that IETF do take care of those. Is my understanding correct? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cabo
Feb 24, 2013
@frsyuki Thanks for the questions.
Being on the mailing list is indeed the only way to contribute to a WG process. Now we don't know yet what WG may be handling this, so if it helps that this be a low-volume, very focused mailing list, we may try to create one.
An Internet-Draft (I-D) can be in two stages: personal draft or WG draft. A personal draft is edited by whoever wrote this (that's why my bpack draft is called draft-bormann-..., Bormann is my last name). A WG draft is edited by one or more editors (e.g., draft-ietf-core-coap is a draft of the CoRE WG). That is usually, but not always the person why wrote the initial personal draft -- this is the discretion of the WG chair.
I'm sure if you want to be the editor of the msgpack draft this will be highly welcome, because this is your work. Remember, however, that such a position comes with the responsibility to carry out changes as the WG decided. It has happened in the past that an editor and the WG disagreed to the extent that the editor stopped editing or even made unwanted changes; in such cases a new editor is appointed (or, if the rough consensus we strive for is not achievable at all, the work is stopped). Many drafts have multiple editors. I wouldn't mind being an editor either, mainly because I think I'm pretty effective at the kind of editing work that remains to be done. We could both be editors; e.g., draft-ietf-core-coap has four editors. You could choose somebody else as a second editor (with the chair's approval). It doesn't hurt to have a pair of editors that represent different communities. Etc. (We are quite some ways from having to make the decision.)
@kazuho Whether the IETF cares about existing implementations is dependent on the specific work that needs to be done. E.g. Oauth 2 differs a lot from Oauth 1. But in that case, evolving the spec also was the (rough!) consensus of the people involved. If people from the msgpack community believe backwards compatibility is important, they should make that point, and I'm sure you will be heard. (Speaking just for myself: I personally don't have a requirement for backwards compatibility, but I'm also interested in maximizing interoperability.) I'm not sure how much impact the implementation considerations for backwards compatibility will need to have on the spec.
I'm sorry if I missed comments here, this github issue is a bit more active than I had planned for...
cabo
commented
Feb 24, 2013
|
@frsyuki Thanks for the questions. Being on the mailing list is indeed the only way to contribute to a WG process. Now we don't know yet what WG may be handling this, so if it helps that this be a low-volume, very focused mailing list, we may try to create one. An Internet-Draft (I-D) can be in two stages: personal draft or WG draft. A personal draft is edited by whoever wrote this (that's why my bpack draft is called draft-bormann-..., Bormann is my last name). A WG draft is edited by one or more editors (e.g., draft-ietf-core-coap is a draft of the CoRE WG). That is usually, but not always the person why wrote the initial personal draft -- this is the discretion of the WG chair. I'm sure if you want to be the editor of the msgpack draft this will be highly welcome, because this is your work. Remember, however, that such a position comes with the responsibility to carry out changes as the WG decided. It has happened in the past that an editor and the WG disagreed to the extent that the editor stopped editing or even made unwanted changes; in such cases a new editor is appointed (or, if the rough consensus we strive for is not achievable at all, the work is stopped). Many drafts have multiple editors. I wouldn't mind being an editor either, mainly because I think I'm pretty effective at the kind of editing work that remains to be done. We could both be editors; e.g., draft-ietf-core-coap has four editors. You could choose somebody else as a second editor (with the chair's approval). It doesn't hurt to have a pair of editors that represent different communities. Etc. (We are quite some ways from having to make the decision.) @kazuho Whether the IETF cares about existing implementations is dependent on the specific work that needs to be done. E.g. Oauth 2 differs a lot from Oauth 1. But in that case, evolving the spec also was the (rough!) consensus of the people involved. If people from the msgpack community believe backwards compatibility is important, they should make that point, and I'm sure you will be heard. (Speaking just for myself: I personally don't have a requirement for backwards compatibility, but I'm also interested in maximizing interoperability.) I'm not sure how much impact the implementation considerations for backwards compatibility will need to have on the spec. I'm sorry if I missed comments here, this github issue is a bit more active than I had planned for... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cabo
Feb 24, 2013
@kuenishi — you act as if I created the problem that is being solved in this github issue. I didn't. This github issue is old, and it is actually a duplicate of github issue 13 that is even older. I just happened to write a specification that also needs a solution for the problem, and that picked up an earlier msgpack fork (binarypack) that was created because issue 13 was ignored. So I have no idea what cost I'm creating here.
cabo
commented
Feb 24, 2013
|
@kuenishi — you act as if I created the problem that is being solved in this github issue. I didn't. This github issue is old, and it is actually a duplicate of github issue 13 that is even older. I just happened to write a specification that also needs a solution for the problem, and that picked up an earlier msgpack fork (binarypack) that was created because issue 13 was ignored. So I have no idea what cost I'm creating here. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
take-cheeze
Feb 24, 2013
A “maybe” doesn’t have a place in a specification.
Sorry I need to use words more carefully.
I mean a type like
Languages which don't have types to distinguish strings and byte arrays (e.g.: PHP, C++, Erlang, OCaml)
in background section of @frsyuki 's proposal.
take-cheeze
commented
Feb 24, 2013
Sorry I need to use words more carefully.
in background section of @frsyuki 's proposal. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cabo
Feb 24, 2013
Hmm, IIRC Erlang does have strings that are different from binary?
(And C++ indeed has vector vs. string etc...)
Maybe you do want to handle text strings using the same type that you use for byte strings.
But that is not a property of the language.
cabo
commented
Feb 24, 2013
|
Hmm, IIRC Erlang does have strings that are different from binary? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 24, 2013
Member
@cabo Thank you for your detailed description. Let me ask one more question:
I'm a founder of a startup company in SiliconValley. Success of the company is obviously the primary goal which I have to run toward for first. It means I can't spend lots of time for creating a RFC standard for MessagePack.
Thus question is that: what type of work is expected to be the editor?
|
@cabo Thank you for your detailed description. Let me ask one more question: I'm a founder of a startup company in SiliconValley. Success of the company is obviously the primary goal which I have to run toward for first. It means I can't spend lots of time for creating a RFC standard for MessagePack. Thus question is that: what type of work is expected to be the editor? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cabo
Feb 24, 2013
@frsyuki As I said, I don't think there is much work to be done at this point. Your main job will be to be cognizant of all the issues and make sure the document is consistent. If you absolutely need to minimize your work, you probably want to have a second person on the document to handle tedious things like IANA considerations, editorial comments, somebody who knows the IETF process.
E-Mail me for more details (my e-mail address is in the draft).
cabo
commented
Feb 24, 2013
|
@frsyuki As I said, I don't think there is much work to be done at this point. Your main job will be to be cognizant of all the issues and make sure the document is consistent. If you absolutely need to minimize your work, you probably want to have a second person on the document to handle tedious things like IANA considerations, editorial comments, somebody who knows the IETF process. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 24, 2013
Member
@cabo Let me comment about C++:
C++'s std::vector is used to represents an array whose elements' type is T. If the T is an integer, then it should be serialized using msgpack's Array type and Integer type.
Imagine that the type is "char". It could be an array of very small integers. It could be a byte array. The serializer can't distinguish. How about uint16_t? It could be a UTF-16 string.
C++ programmers are allowed to store byte arrays using std::string. Many programmers (including me) use it because it's easy to use. Otherwise programmers need to allocate/free memories manually.
These languages are widely used. This is a fact. I wanted to have intermediate data representation format which don't enforce extra work to set markers on these objects. MessagePack focuses on these problems. Thus adding a string type is difficult to decide.
|
@cabo Let me comment about C++: C++'s std::vector is used to represents an array whose elements' type is T. If the T is an integer, then it should be serialized using msgpack's Array type and Integer type. Imagine that the type is "char". It could be an array of very small integers. It could be a byte array. The serializer can't distinguish. How about uint16_t? It could be a UTF-16 string. These languages are widely used. This is a fact. I wanted to have intermediate data representation format which don't enforce extra work to set markers on these objects. MessagePack focuses on these problems. Thus adding a string type is difficult to decide. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 24, 2013
Member
@cabo I'm sorry but let me be cautious...
What do you mean "at this point"?:
I don't think there is much work to be done at this point
I think this is why @kazuho says "enforce":
If you absolutely need to minimize your work, you probably want to have a second person on the document to handle tedious things like IANA considerations, editorial comments, somebody who knows the IETF process.
Finding dependable person who kindly spends time for an open source project seems hard. But you need an answer by Monday, right?
Please don't get me wrong. I'm appreciated your guides and suggestions on this project.
I think I have following 3 options but none of them seems excellent idea:
1. I'll be the editor (hard to spend time)
2. I try to found a dependable person for an editor (takes time)
3. Just say "don't propose msgpack as a draft" (result in confusing two specs)
|
@cabo I'm sorry but let me be cautious... What do you mean "at this point"?:
I think this is why @kazuho says "enforce":
Finding dependable person who kindly spends time for an open source project seems hard. But you need an answer by Monday, right? Please don't get me wrong. I'm appreciated your guides and suggestions on this project. I think I have following 3 options but none of them seems excellent idea:
|
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cabo
Feb 24, 2013
@frsyuki Well, the draft is written. You can have that. (Later, somebody has to do the small things still listed as TBD. With your help, I could do those — I have done a lot of those before.)
No, I don't need an answer by Monday. The only thing that I want to do by tomorrow is submit the next version of my draft, the -01 (you have seen the draft draft), because the -00 documents something that we both agree is no longer what we want, and I want to replace that. Then the two-week submission moratorium in front of the Orlando IETF starts.
If you aren't ready to take over at this time (and I would certainly understand that), I'll just do the -01 again under my name. We/you/whoever can submit the next version on Monday, March 11, or later. We also can let the -01 stand around unchanged for a while, while we all figure out what to do.
Again, there can be multiple people responsible for a draft, so there are multiple configurations possible beyond those you have listed.
And we aren't even close to having a WG that would need an editor appointed.
So we have plenty of time to find the right way to do this.
If possible, I would like to be able to discuss potential options with other IETFers in Orlando, in the week commencing March 11. So if we can discuss potential ways forward within the next two weeks from now, that would help a lot. But even that isn't strictly necessary. (It would help because there will be a JSON BOF in Orlando.)
I apologize if I have created the impression that the Monday deadline is the end of the world. It is just an Internet-Drafts deadline, which we like to have in the IETF so we can all read the documents before arriving at the meeting place.
cabo
commented
Feb 24, 2013
|
@frsyuki Well, the draft is written. You can have that. (Later, somebody has to do the small things still listed as TBD. With your help, I could do those — I have done a lot of those before.) No, I don't need an answer by Monday. The only thing that I want to do by tomorrow is submit the next version of my draft, the -01 (you have seen the draft draft), because the -00 documents something that we both agree is no longer what we want, and I want to replace that. Then the two-week submission moratorium in front of the Orlando IETF starts. If you aren't ready to take over at this time (and I would certainly understand that), I'll just do the -01 again under my name. We/you/whoever can submit the next version on Monday, March 11, or later. We also can let the -01 stand around unchanged for a while, while we all figure out what to do. Again, there can be multiple people responsible for a draft, so there are multiple configurations possible beyond those you have listed. If possible, I would like to be able to discuss potential options with other IETFers in Orlando, in the week commencing March 11. So if we can discuss potential ways forward within the next two weeks from now, that would help a lot. But even that isn't strictly necessary. (It would help because there will be a JSON BOF in Orlando.) I apologize if I have created the impression that the Monday deadline is the end of the world. It is just an Internet-Drafts deadline, which we like to have in the IETF so we can all read the documents before arriving at the meeting place. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kenn
Feb 24, 2013
@frsyuki I suggest that you open a new issue with https://gist.github.com/frsyuki/5022569 so that we could focus on the details of the spec and start the debate afresh there. At this point we need a separate place for those who are only insterested in the proposed new spec itself.
I think this issue has been messed up and important people who should be reading this thread stopped reading. We can leave this thread open so that we can continue to come back when we need to talk about non-spec matters.
kenn
commented
Feb 24, 2013
|
@frsyuki I suggest that you open a new issue with https://gist.github.com/frsyuki/5022569 so that we could focus on the details of the spec and start the debate afresh there. At this point we need a separate place for those who are only insterested in the proposed new spec itself. I think this issue has been messed up and important people who should be reading this thread stopped reading. We can leave this thread open so that we can continue to come back when we need to talk about non-spec matters. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cabo
commented
Feb 24, 2013
|
How about issue 13?... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 24, 2013
Member
@cabo OK. I misunderstood about the Monday.
The inventor and recognized steward of msgpack, I can NOT say you can use the name "MessagePack" as the name of your next draft (which is likely submitted Monday) SO FAR. You'll submit the -01 under your name on Monday.
I think this is all what I need to say now.
I'll try to hear advices from several people who have experiences on standardization.
I think I need to take a good sleep to make decisions correctly any more....
|
@cabo OK. I misunderstood about the Monday. The inventor and recognized steward of msgpack, I can NOT say you can use the name "MessagePack" as the name of your next draft (which is likely submitted Monday) SO FAR. You'll submit the -01 under your name on Monday. I think this is all what I need to say now. I'll try to hear advices from several people who have experiences on standardization. I think I need to take a good sleep to make decisions correctly any more.... |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
frsyuki
Feb 24, 2013
Member
Oh, I forgot to mention the reason: We couldn't reach a consensus on this matter so far (meaning right now).
|
Oh, I forgot to mention the reason: We couldn't reach a consensus on this matter so far (meaning right now). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cabo
Feb 24, 2013
@frsyuki: Thanks, I completely understand. This would indeed be too early.
I also subscribe to the view that a specification needs to be implemented and its ramifications understood before you really can have a solid consensus.
So, for now, have a good night's sleep!
cabo
commented
Feb 24, 2013
|
@frsyuki: Thanks, I completely understand. This would indeed be too early. I also subscribe to the view that a specification needs to be implemented and its ramifications understood before you really can have a solid consensus. So, for now, have a good night's sleep! |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
cabo
Feb 24, 2013
Re the C++ string interoperability issue: Let me just point out that WG21 at least appears to be aware of the problem (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html proposes a char8_t (typedefed from unsigned char) and a u8string built from basic_string<char8_t>). But that may be little help for now, at least until C++/TR2 comes out or it is supported by common libraries.
cabo
commented
Feb 24, 2013
|
Re the C++ string interoperability issue: Let me just point out that WG21 at least appears to be aware of the problem (http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3398.html proposes a char8_t (typedefed from unsigned char) and a u8string built from basic_string<char8_t>). But that may be little help for now, at least until C++/TR2 comes out or it is supported by common libraries. |
kiyoto
referenced this issue
Feb 24, 2013
Closed
Discussions on the upcoming MessagePack spec that adds the string type to the protocol. #128
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kiyoto
commented
Feb 24, 2013
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Midar
Feb 24, 2013
@kiyoto It's still readable there, just did that today ;).
But yes, let's split this into several tickets and close this.
Midar
commented
Feb 24, 2013
|
@kiyoto It's still readable there, just did that today ;). But yes, let's split this into several tickets and close this. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kiyoto
Feb 24, 2013
@Midar
You have a superior vision than I do (no pun intended). I did that too, got a mild headache, and decided to create a new ticket =)
kiyoto
commented
Feb 24, 2013
|
@Midar |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
kazuho
Feb 25, 2013
Contributor
Sorry, now that we have a new location to handle the issue, I have removed my last comment posted here and reposted as #128 (comment)
|
Sorry, now that we have a new location to handle the issue, I have removed my last comment posted here and reposted as #128 (comment) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
Thank you. See #128 as well. |
cabo
referenced this issue
Feb 27, 2013
Closed
MessagePack should be developed in an open process #129
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
|
hey, can anyone close this issue? |
rasky
closed this
Feb 27, 2013
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
niemeyer
Jun 4, 2013
My conclusion is that "it's better to support user-defined custom type rather than adding string type"
I'd be happy to have a string type, but custom types opens a relevant can of worms that I'd like to stay away from. msgpack was a great format precisely because it was simple, tight.
For example, a server program requires that data should be serialized in string type. Another program written in PHP can't tell strings from binary type.
The irony is that there are 11 ways in which the number 1 could be sent across the wire. Some of the libraries are unable to drive that distinction as well. Seems like people did okay so far.
niemeyer
commented
Jun 4, 2013
I'd be happy to have a string type, but custom types opens a relevant can of worms that I'd like to stay away from. msgpack was a great format precisely because it was simple, tight.
The irony is that there are 11 ways in which the number 1 could be sent across the wire. Some of the libraries are unable to drive that distinction as well. Seems like people did okay so far. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
tracker1
Aug 26, 2013
Gah, looking at the latest spec.. why not just make the "string" type expect to be UTF-8 encoded IN THE SPEC? and keep the binary type if you want "RAW" whatever?
tracker1
commented
Aug 26, 2013
|
Gah, looking at the latest spec.. why not just make the "string" type expect to be UTF-8 encoded IN THE SPEC? and keep the binary type if you want "RAW" whatever? |
tarruda
referenced this issue
Sep 1, 2014
Merged
[RFC] Update to the experimental msgpack v5 branch and other msgpack-rpc improvements #1130
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
terrylinooo
Jul 23, 2016
I use msgpack in PHP 5.6 without any problem, (my PHP files are UTF-8 encoding)
Is this issue fixed?
terrylinooo
commented
Jul 23, 2016
|
I use msgpack in PHP 5.6 without any problem, (my PHP files are UTF-8 encoding) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
methane
Aug 1, 2016
Member
This issue is about msgpack format, not PHP implementation.
AFAIK, PHP doesn't have text type and binary type. No problem at first.
|
This issue is about msgpack format, not PHP implementation. |
rasky commentedNov 12, 2012
It looks like the msgpack spec does not differentiate between a raw binary data buffer and text strings. This causes some problems in all high-level language wrappers, because most high-level languages have different data types for text strings and binary buffers.
For instance, the objective C wrapper is currently broken because it tries to decode all raw bytes into high-level strings (through UTF-8 decoding) because using a text string (NSString) is the only way to populate a NSDictionary (map). But it breaks because obviously some binary buffers cannot be decoded as UTF8-strings.
The same happen with Python2/3: when you serialize and deserialize a unicode string, you always get a binary string back, and this breaks simple code:
As you can see, when you deserialize, you get a different object which does not work (because internal text strings are not decoded from UTF-8).
Most wrappers have an option to specify automatic UTF-8 decoding for all raw bytes, but that is wrong because it will apply to ALL raw bytes, while you might have a mixture of text strings and binary bytes within the same messagepack. It's not at all uncommon.
As I said, this problem can be found in almost all high-level messagepack bindings, because most high-level languages have different data types for text strings and binary buffers.
I think the only final solution for this problem is to enhance the msgpack spec to explicitly differentiate between text strings and binary buffers. Is this something that msgpack authors are willing to discuss?
I am willing to implement whatever solution you decide it's the best one and submit a pull request.
Thanks!