New optional crc type at 0xc6 #114

rurban · 2012-06-29T19:55:06Z

Use empty reserved msgpack 0xc6 type for a crc32 as uint32 at the end of the buffer.
See msgpack-perl for an added add_crc() pack method.

Optional CRC tag for basic security and data corruption.
If pack provides a crc checksum, unpack should check the result against the given
crc and report an error otherwise.
The crc tag must be the last tag in the buffer.
The crc checksum does not include itself and its type tag.

Use empty reserved msgpack 0xc6 type for a crc32 as uint32 at the end of the buffer. See msgpack-perl for an added add_crc() pack method. Optional CRC tag for basic security and data corruption. If pack provides a crc checksum, unpack should check the result against the given crc and report an error otherwise. The crc tag must be the last tag in the buffer. The crc checksum does not include itself and its type tag.

gfx · 2012-06-30T03:05:38Z

+1

qehgt · 2012-06-30T15:17:48Z

Why should it be implemented on the library level?
Do you really think that "basic security" can be implemented by adding CRC?

rurban · 2012-06-30T16:10:45Z

On Sat, Jun 30, 2012 at 10:17 AM, qehgt
reply@reply.github.com
wrote:

Why should it be implemented on the library level?

Because the parser needs to assert that it's the only optional element
after the
last valid parsed element.

Do you really think that "basic security" can be implemented by adding CRC?

Of course not. It's an added optional value for fast data corruption detection.
Also:
We do poll reads and do not know if the writer already finished.
We check now the the fifth last byte for 0xc6, only then we try to parse,
Otherwise we assume the writer is still busy writing.
In our previous implementation we sometimes tried to parse half-written buffers,
which was too slow. With this scheme we are ~11x faster and a bit safer.

Security would be implemented by gpg signing.

Reini Urban
http://cpanel.net/ http://www.perl-compiler.org/

bdraco · 2012-06-30T17:47:45Z

Having this would be fantastic. You can't also count on being able to write a temp file and rename() it in place. For those circumstances its way to easy to introduce seemly random bugs caused by half written data.

It is about 5% slower than the inlined macro version though

frsyuki · 2012-07-03T00:45:45Z

Please understand that we have to be careful to add these new specifications…

I'm opposed to add that format specification.
It's ok to add the feature to the perl implementation as an original extension.
I think it's better to add compression feature such as zlib or snappy
It's good idea to define 0xC6 as a "never used" byte so that applications can use it as a header of metadata

Reason 1) Adding extra bytes after serialized objects doesn't work with streaming deserialization. For example, when you received a complete serialized object via a pipe (or socket), there are no methods to know whether no more bytes come. So you have to wait until at least one byte arrives to decide to check CRC or not to check, but it could block long time. Even if it's optional, there are no reasonable ways to handle it optionally.

Reason 2) With MessagePack format, you can handle an array as a sequence of objects by skipping header bytes of the array. This technique is important to handle a big array which will not fit in memory. In this case, CRC bytes after the array cause a problem because the deserializer will compare it with the CRC of the last element of the array.

Alternative plan) Some compression algorithms have error detection functionality as well as message framing specification

zlib has "full flush" (Z_FULL_FLUSH) which writes a message frame boundary and make it possible to recover decompression from that position.
Snappy framing format: http://snappy.googlecode.com/svn/trunk/framing_format.txt

rurban · 2012-07-03T14:10:09Z

@frsyuki:

I don't care which bytes is used for CRC uint32, there are several available (=reserved). I just took one which looked nice.

Ad Reason 1) Adding extra bytes after serialized objects doesn't work with streaming deserialization. For example, when you received a complete serialized object via a pipe (or socket), there are no methods to know whether no more bytes come. So you have to wait until at least one byte arrives to decide to check CRC or not to check, but it could block long time. Even if it's optional, there are no reasonable ways to handle it optionally.

The streaming deserializer can ignore an extra CRC tag and just stop after receiving one object, or it can just use the msgpack parser which reads the CRC tag which comes after the first object. The method to know if there will come more bytes is up to you, which parser you choose. There will be no additional waiting after a complete object is already received.

Ad Reason 2) With MessagePack format, you can handle an array as a sequence of objects by skipping header bytes of the array. This technique is important to handle a big array which will not fit in memory. In this case, CRC bytes after the array cause a problem because the deserializer will compare it with the CRC of the last element of the array.

If the writer decides to split big objects into several smaller ones, the writer can decide to add a crc or not. I see no technical problem with CRC or without, because the logic is on the application level, not in the library. The library just optionally checks one buffer CRC if there is a CRC tag at the end. BTW: The reader must also know about such a split to splice it together.

zlib and snappy would be a nice extension for long strings or a lot of strings, but the overhead for typical small data is too big. I haven't checked starting with which string size it would make sense to use compression.
Something with 2-5K I assume. Numbers are already compressed.

frsyuki · 2012-07-03T18:37:11Z

You mean that application developers can't always enable CRC handling and they have to consider stored data (or data transferred from another node) include CRC or not, right?

If my understanding is correct, the CRC handling will be a flag of unpacker class (or an option of unpack function) like this:

my $mp = Data::MessagePack->new();
$mp->enable_0xc6_crc_check = true;
$mp->unpack($dat);  # check 0xc6 follows or not

So far, I think these features which require an extra application-specific knowledge doesn't make sense as a serialization format specification. We need more concrete use cases.

But it's good idea to add this to the perl implementation because it just works practically (decision is up to @gfx, though).

frsyuki · 2012-07-03T19:27:25Z

zlib and snappy would be a nice extension for long strings or a lot of strings, but the overhead for typical small data is too big. I haven't checked starting with which string size it would make sense to use compression.

Right. So you need a fast and compact message framing protocol which supports error detection and protocol recovery.
I once wrote a similar program for a proof-of-concept in ruby (logpack.tar.gz) which uses [size:uint32] [object:msgpack] format. And I know another company that "escaped" 0xC6 in serialized data to 0xC6 0xC6 and inserted 0xC6 0xC5 after an object as a border between objects. ...There are many ways but no "best practices" so far.

So just for idea, following format might be enough:

[object:msgpack] 0xC6

(all applications) assume 0xC6 is "never used" byte and just ignore it
- adding $skipped_bytes = $unpacker->try_skip_after_N_bytes_after_0xc6(0) function is good idea in this case
(some applications) assume valid object is not an integer
- most corrupt byte is recognized as an integer with msgpack format
(some applications) detect data corruptions by checking the successfully deserialized object is not integer and 0xC6 follows
(some applications) recovers protocol by scanning a valid object byte by byte

This format can detect and recover application-level problems. To detect data corruption in disk or network, use whole stream compression and/or SSL.

rurban · 2012-07-03T19:48:05Z

On Tue, Jul 3, 2012 at 1:37 PM, FURUHASHI Sadayuki
reply@reply.github.com
wrote:

You mean that application developers can't always enable CRC handling and they have to consider stored data (or data transferred from another node) include CRC or not, right?

Yes. The perl function is called add_crc($packed), and it adds the crc
tag and uint32 add the end of $packed destructively.

You can use (pseudo code)
$packed = pack($data);
or use
$packed = add_crc(pack($data));

If my understanding is correct, the CRC handling will be a flag of unpacker class (or an option of unpack function) like this:
my $mp = Data::MessagePack->new();
$mp->enable_0xc6_crc_check = true;
$mp->unpack($dat);  # check 0xc6 follows or not

No flag for unpack, only a function to pack.
The unpacker sees a crc tag in the stream after the data or not.
If so, it checks the crc and errors out with a new error message. If
not it does nothing.

Please check out my implementation.

So far, I think these features which require an extra application-specific knowledge doesn't
make sense as a serialization format specification. We need more concrete use cases.

The pack with crc function must be added by the languages who want to add that.
The unpacker is using the c library.

But it's good idea to add this to the perl implementation because it just works practically
(decision is up to @gfx, though).

Goro likes it.
The remaining problem is the general unpack support in the msgpack C library.
I do not really want to add the add_crc function to all languages by
myself (ruby, python, C#, ...)
But since it is optional, the library should not care too much about
the languages.
And they can start implementing sooner or later.
cross-language support is also given by having the unpacker understand

this new tag.

Reini Urban
http://cpanel.net/ http://www.perl-compiler.org/

frsyuki · 2012-07-03T20:33:34Z

The unpacker sees a crc tag in the stream after the data or not.
If so, it checks the crc and errors out with a new error message.

Sorry for confusing, I meant "CRC handling should be a flag" which is enabled by only users who can expect CRC bytes after a deserialized object. Because streaming unpackers can't always block to check whether crc tag exists or not. And it may cause a problem during handling arrays. Your code always assume there is at least extra one byte after a deserialized object and the following CRC is for the last deserialized object, but it can't (at the same time, your code seems to have a SEGV problem).

And as I mentioned, I'm opposed to add that feature which doesn't work with streaming deserializers and requires application-specific knowledge to the format specification so far.

Unless it's defined as a format specification, each maintainers should be able to decide to add this feature to their implementations or not. I understand its interoperability is important but I don't want to force them to implement imperfect features. But msgpack_unpack_func in msgpack/unpack_template.h is still shared by several implementations. @gfx can copy that file and merge your code and it's up to him.

Anyway...I want him (and all other maintainers) to copy the file to be independent.

rurban · 2012-07-03T20:56:07Z

Sorry, but as I already explained both of your counter arguments are wrong.

It works fine with streaming deserializers - no need to wait for more data if the parser is finished with the data. crc is optional.

and 2. no application specific knowledge required at all. Either add a crc on the application level or not. unpack is transparent.

I'm fine with the private implementation as we are using it like this.
msgpack can add this feature if they want to.
And other languages can add a add_crc packer function also if they want to. It's up to them.

frsyuki · 2012-07-03T21:13:28Z

I couldn't understand why it works with streaming deserializers...What happens with the following situation?:

node A sent a serialized object and 0xC6+CRC via TCP
node B received a serialized object (TCP doesn't guarantee message boundary)
node B decided there are no CRC bytes after the object because it didn't received it. Node B doesn't expect that CRC bytes arrive after several milliseconds because it doesn't have any knowledge
node B received 0xC6 byte
unpacker raises an exception?

Anyway, I agree to add this to the perl implementation and I entrust @gfx with that matter.

If not CRC_INLINED use the zlib crc32() with word stepper and a crc lookup table. With zlib the crc overhead is only 2% for small buffers. Previously it was 10%. with -lz: $ pb benchmark/serialize.pl; pb benchmark/deserialize.pl -- serialize JSON::XS: 2.32 Data::MessagePack: 0.46_01 Storable: 2.35 Benchmark: running json, mp, mp_crc, storable for at least 1 CPU seconds... json: 1 wallclock secs ( 1.06 usr + 0.03 sys = 1.09 CPU) @ 136436.70/s (n=148716) mp: 1 wallclock secs ( 1.02 usr + 0.01 sys = 1.03 CPU) @ 185578.64/s (n=191146) mp_crc: 0 wallclock secs ( 1.06 usr + 0.00 sys = 1.06 CPU) @ 162293.40/s (n=172031) storable: 1 wallclock secs ( 1.05 usr + 0.00 sys = 1.05 CPU) @ 91021.90/s (n=95573) Rate storable json mp_crc mp storable 91022/s -- -33% -44% -51% json 136437/s 50% -- -16% -26% mp_crc 162293/s 78% 19% -- -13% mp 185579/s 104% 36% 14% -- -- deserialize JSON::XS: 2.32 Data::MessagePack: 0.46_01 Storable: 2.35 Benchmark: running json, mp, mp_crc, storable for at least 1 CPU seconds... json: 1 wallclock secs ( 1.03 usr + 0.01 sys = 1.04 CPU) @ 97302.88/s (n=101195) mp: 1 wallclock secs ( 1.12 usr + 0.01 sys = 1.13 CPU) @ 126866.37/s (n=143359) mp_crc: 1 wallclock secs ( 1.06 usr + 0.00 sys = 1.06 CPU) @ 124841.51/s (n=132332) storable: 1 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) @ 80387.85/s (n=86015) Rate storable json mp_crc mp storable 80388/s -- -17% -36% -37% json 97303/s 21% -- -22% -23% mp_crc 124842/s 55% 28% -- -2% mp 126866/s 58% 30% 2% --

Do not use USE_SMAZ in crc branch

teruokun · 2012-12-31T01:21:29Z

Admittedly I haven't looked at your implementation, but why not have the the style be CRC -> object instead of object -> CRC? Then the processing works like this:

node A sends 0xC6+CRC then the object the CRC is for via TCP
node B receives the 0xC6 byte and the checksum and recognizes that the next msgpack object sent in this session must match the checksum

Then, if you like, you could specify a flag to the unpacker that raises an exception if the checksum is not present for top-level objects in the stream, so that you never accept messages without a checksum

kuenishi · 2013-08-17T07:54:42Z

We added a new ext type at the new spec. See #128 for discussion.

add crc32 lookuptable variant

7888c9c

It is about 5% slower than the inlined macro version though

Reini Urban added 4 commits July 23, 2012 11:12

crc: adjust to msgpack-c - _msgpack_crc_user macro

ed56954

common defines for SMAZ - string compression for short strings

07ecc09

define HAVE_ZLIB in sysdep.h

77df1cc

Do not use USE_SMAZ in crc branch

kuenishi closed this Aug 17, 2013

rurban mentioned this pull request Nov 14, 2014

New add_crc pack function msgpack/msgpack-perl#7

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New optional crc type at 0xc6 #114

New optional crc type at 0xc6 #114

rurban commented Jun 29, 2012

gfx commented Jun 30, 2012

qehgt commented Jun 30, 2012

rurban commented Jun 30, 2012

bdraco commented Jun 30, 2012

frsyuki commented Jul 3, 2012

rurban commented Jul 3, 2012

frsyuki commented Jul 3, 2012

frsyuki commented Jul 3, 2012

rurban commented Jul 3, 2012

frsyuki commented Jul 3, 2012

rurban commented Jul 3, 2012

frsyuki commented Jul 3, 2012

teruokun commented Dec 31, 2012

kuenishi commented Aug 17, 2013

New optional crc type at 0xc6 #114

New optional crc type at 0xc6 #114

Conversation

rurban commented Jun 29, 2012

gfx commented Jun 30, 2012

qehgt commented Jun 30, 2012

rurban commented Jun 30, 2012

Security would be implemented by gpg signing.

bdraco commented Jun 30, 2012

frsyuki commented Jul 3, 2012

rurban commented Jul 3, 2012

frsyuki commented Jul 3, 2012

frsyuki commented Jul 3, 2012

rurban commented Jul 3, 2012

this new tag.

frsyuki commented Jul 3, 2012

rurban commented Jul 3, 2012

frsyuki commented Jul 3, 2012

teruokun commented Dec 31, 2012

kuenishi commented Aug 17, 2013