Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New optional crc type at 0xc6 #114

Closed
wants to merge 6 commits into from
Closed

New optional crc type at 0xc6 #114

wants to merge 6 commits into from

Conversation

rurban
Copy link

@rurban rurban commented Jun 29, 2012

Use empty reserved msgpack 0xc6 type for a crc32 as uint32 at the end of the buffer.
See msgpack-perl for an added add_crc() pack method.

Optional CRC tag for basic security and data corruption.
If pack provides a crc checksum, unpack should check the result against the given
crc and report an error otherwise.
The crc tag must be the last tag in the buffer.
The crc checksum does not include itself and its type tag.

Use empty reserved msgpack 0xc6 type for a crc32 as uint32 at the end of the buffer.
See msgpack-perl for an added add_crc() pack method.

Optional CRC tag for basic security and data corruption.
If pack provides a crc checksum, unpack should check the result against the given
crc and report an error otherwise.
The crc tag must be the last tag in the buffer.
The crc checksum does not include itself and its type tag.
@gfx
Copy link
Member

gfx commented Jun 30, 2012

+1

@qehgt
Copy link

qehgt commented Jun 30, 2012

Why should it be implemented on the library level?
Do you really think that "basic security" can be implemented by adding CRC?

@rurban
Copy link
Author

rurban commented Jun 30, 2012

On Sat, Jun 30, 2012 at 10:17 AM, qehgt
reply@reply.github.com
wrote:

Why should it be implemented on the library level?

Because the parser needs to assert that it's the only optional element
after the
last valid parsed element.

Do you really think that "basic security" can be implemented by adding CRC?

Of course not. It's an added optional value for fast data corruption detection.
Also:
We do poll reads and do not know if the writer already finished.
We check now the the fifth last byte for 0xc6, only then we try to parse,
Otherwise we assume the writer is still busy writing.
In our previous implementation we sometimes tried to parse half-written buffers,
which was too slow. With this scheme we are ~11x faster and a bit safer.

Security would be implemented by gpg signing.

Reini Urban
http://cpanel.net/ http://www.perl-compiler.org/

@bdraco
Copy link

bdraco commented Jun 30, 2012

Having this would be fantastic. You can't also count on being able to write a temp file and rename() it in place. For those circumstances its way to easy to introduce seemly random bugs caused by half written data.

It is about 5% slower than the inlined macro version though
@frsyuki
Copy link
Member

frsyuki commented Jul 3, 2012

Please understand that we have to be careful to add these new specifications…

  • I'm opposed to add that format specification.
  • It's ok to add the feature to the perl implementation as an original extension.
  • I think it's better to add compression feature such as zlib or snappy
  • It's good idea to define 0xC6 as a "never used" byte so that applications can use it as a header of metadata

Reason 1) Adding extra bytes after serialized objects doesn't work with streaming deserialization. For example, when you received a complete serialized object via a pipe (or socket), there are no methods to know whether no more bytes come. So you have to wait until at least one byte arrives to decide to check CRC or not to check, but it could block long time. Even if it's optional, there are no reasonable ways to handle it optionally.

Reason 2) With MessagePack format, you can handle an array as a sequence of objects by skipping header bytes of the array. This technique is important to handle a big array which will not fit in memory. In this case, CRC bytes after the array cause a problem because the deserializer will compare it with the CRC of the last element of the array.

Alternative plan) Some compression algorithms have error detection functionality as well as message framing specification

@rurban
Copy link
Author

rurban commented Jul 3, 2012

@frsyuki:

I don't care which bytes is used for CRC uint32, there are several available (=reserved). I just took one which looked nice.

Ad Reason 1) Adding extra bytes after serialized objects doesn't work with streaming deserialization. For example, when you received a complete serialized object via a pipe (or socket), there are no methods to know whether no more bytes come. So you have to wait until at least one byte arrives to decide to check CRC or not to check, but it could block long time. Even if it's optional, there are no reasonable ways to handle it optionally.

The streaming deserializer can ignore an extra CRC tag and just stop after receiving one object, or it can just use the msgpack parser which reads the CRC tag which comes after the first object. The method to know if there will come more bytes is up to you, which parser you choose. There will be no additional waiting after a complete object is already received.

Ad Reason 2) With MessagePack format, you can handle an array as a sequence of objects by skipping header bytes of the array. This technique is important to handle a big array which will not fit in memory. In this case, CRC bytes after the array cause a problem because the deserializer will compare it with the CRC of the last element of the array.

If the writer decides to split big objects into several smaller ones, the writer can decide to add a crc or not. I see no technical problem with CRC or without, because the logic is on the application level, not in the library. The library just optionally checks one buffer CRC if there is a CRC tag at the end. BTW: The reader must also know about such a split to splice it together.

zlib and snappy would be a nice extension for long strings or a lot of strings, but the overhead for typical small data is too big. I haven't checked starting with which string size it would make sense to use compression.
Something with 2-5K I assume. Numbers are already compressed.

@frsyuki
Copy link
Member

frsyuki commented Jul 3, 2012

You mean that application developers can't always enable CRC handling and they have to consider stored data (or data transferred from another node) include CRC or not, right?

If my understanding is correct, the CRC handling will be a flag of unpacker class (or an option of unpack function) like this:

my $mp = Data::MessagePack->new();
$mp->enable_0xc6_crc_check = true;
$mp->unpack($dat);  # check 0xc6 follows or not

So far, I think these features which require an extra application-specific knowledge doesn't make sense as a serialization format specification. We need more concrete use cases.

But it's good idea to add this to the perl implementation because it just works practically (decision is up to @gfx, though).

@frsyuki
Copy link
Member

frsyuki commented Jul 3, 2012

zlib and snappy would be a nice extension for long strings or a lot of strings, but the overhead for typical small data is too big. I haven't checked starting with which string size it would make sense to use compression.

Right. So you need a fast and compact message framing protocol which supports error detection and protocol recovery.
I once wrote a similar program for a proof-of-concept in ruby (logpack.tar.gz) which uses [size:uint32] [object:msgpack] format. And I know another company that "escaped" 0xC6 in serialized data to 0xC6 0xC6 and inserted 0xC6 0xC5 after an object as a border between objects. ...There are many ways but no "best practices" so far.

So just for idea, following format might be enough:

[object:msgpack] 0xC6
  • (all applications) assume 0xC6 is "never used" byte and just ignore it
    • adding $skipped_bytes = $unpacker->try_skip_after_N_bytes_after_0xc6(0) function is good idea in this case
  • (some applications) assume valid object is not an integer
    • most corrupt byte is recognized as an integer with msgpack format
  • (some applications) detect data corruptions by checking the successfully deserialized object is not integer and 0xC6 follows
  • (some applications) recovers protocol by scanning a valid object byte by byte

This format can detect and recover application-level problems. To detect data corruption in disk or network, use whole stream compression and/or SSL.

@rurban
Copy link
Author

rurban commented Jul 3, 2012

On Tue, Jul 3, 2012 at 1:37 PM, FURUHASHI Sadayuki
reply@reply.github.com
wrote:

You mean that application developers can't always enable CRC handling and they have to consider stored data (or data transferred from another node) include CRC or not, right?

Yes. The perl function is called add_crc($packed), and it adds the crc
tag and uint32 add the end of $packed destructively.

You can use (pseudo code)
$packed = pack($data);
or use
$packed = add_crc(pack($data));

If my understanding is correct, the CRC handling will be a flag of unpacker class (or an option of unpack function) like this:

my $mp = Data::MessagePack->new();
$mp->enable_0xc6_crc_check = true;
$mp->unpack($dat);  # check 0xc6 follows or not

No flag for unpack, only a function to pack.
The unpacker sees a crc tag in the stream after the data or not.
If so, it checks the crc and errors out with a new error message. If
not it does nothing.

Please check out my implementation.

So far, I think these features which require an extra application-specific knowledge doesn't
make sense as a serialization format specification. We need more concrete use cases.

The pack with crc function must be added by the languages who want to add that.
The unpacker is using the c library.

But it's good idea to add this to the perl implementation because it just works practically
(decision is up to @gfx, though).

Goro likes it.
The remaining problem is the general unpack support in the msgpack C library.
I do not really want to add the add_crc function to all languages by
myself (ruby, python, C#, ...)
But since it is optional, the library should not care too much about
the languages.
And they can start implementing sooner or later.
cross-language support is also given by having the unpacker understand

this new tag.

Reini Urban
http://cpanel.net/ http://www.perl-compiler.org/

@frsyuki
Copy link
Member

frsyuki commented Jul 3, 2012

The unpacker sees a crc tag in the stream after the data or not.
If so, it checks the crc and errors out with a new error message.

Sorry for confusing, I meant "CRC handling should be a flag" which is enabled by only users who can expect CRC bytes after a deserialized object. Because streaming unpackers can't always block to check whether crc tag exists or not. And it may cause a problem during handling arrays. Your code always assume there is at least extra one byte after a deserialized object and the following CRC is for the last deserialized object, but it can't (at the same time, your code seems to have a SEGV problem).

And as I mentioned, I'm opposed to add that feature which doesn't work with streaming deserializers and requires application-specific knowledge to the format specification so far.

Unless it's defined as a format specification, each maintainers should be able to decide to add this feature to their implementations or not. I understand its interoperability is important but I don't want to force them to implement imperfect features. But msgpack_unpack_func in msgpack/unpack_template.h is still shared by several implementations. @gfx can copy that file and merge your code and it's up to him.

Anyway...I want him (and all other maintainers) to copy the file to be independent.

@rurban
Copy link
Author

rurban commented Jul 3, 2012

Sorry, but as I already explained both of your counter arguments are wrong.

  1. It works fine with streaming deserializers - no need to wait for more data if the parser is finished with the data. crc is optional.

and 2. no application specific knowledge required at all. Either add a crc on the application level or not. unpack is transparent.

I'm fine with the private implementation as we are using it like this.
msgpack can add this feature if they want to.
And other languages can add a add_crc packer function also if they want to. It's up to them.

@frsyuki
Copy link
Member

frsyuki commented Jul 3, 2012

I couldn't understand why it works with streaming deserializers...What happens with the following situation?:

  1. node A sent a serialized object and 0xC6+CRC via TCP
  2. node B received a serialized object (TCP doesn't guarantee message boundary)
  3. node B decided there are no CRC bytes after the object because it didn't received it. Node B doesn't expect that CRC bytes arrive after several milliseconds because it doesn't have any knowledge
  4. node B received 0xC6 byte
  5. unpacker raises an exception?

Anyway, I agree to add this to the perl implementation and I entrust @gfx with that matter.

Reini Urban added 4 commits July 23, 2012 11:12
If not CRC_INLINED use the zlib crc32() with word stepper and a
crc lookup table.
With zlib the crc overhead is only 2% for small buffers.
Previously it was 10%.

with -lz:
$ pb benchmark/serialize.pl; pb benchmark/deserialize.pl
-- serialize
JSON::XS: 2.32
Data::MessagePack: 0.46_01
Storable: 2.35
Benchmark: running json, mp, mp_crc, storable for at least 1 CPU seconds...
      json:  1 wallclock secs ( 1.06 usr +  0.03 sys =  1.09 CPU) @ 136436.70/s (n=148716)
        mp:  1 wallclock secs ( 1.02 usr +  0.01 sys =  1.03 CPU) @ 185578.64/s (n=191146)
    mp_crc:  0 wallclock secs ( 1.06 usr +  0.00 sys =  1.06 CPU) @ 162293.40/s (n=172031)
  storable:  1 wallclock secs ( 1.05 usr +  0.00 sys =  1.05 CPU) @ 91021.90/s (n=95573)
             Rate storable     json   mp_crc       mp
storable  91022/s       --     -33%     -44%     -51%
json     136437/s      50%       --     -16%     -26%
mp_crc   162293/s      78%      19%       --     -13%
mp       185579/s     104%      36%      14%       --
-- deserialize
JSON::XS: 2.32
Data::MessagePack: 0.46_01
Storable: 2.35
Benchmark: running json, mp, mp_crc, storable for at least 1 CPU seconds...
      json:  1 wallclock secs ( 1.03 usr +  0.01 sys =  1.04 CPU) @ 97302.88/s (n=101195)
        mp:  1 wallclock secs ( 1.12 usr +  0.01 sys =  1.13 CPU) @ 126866.37/s (n=143359)
    mp_crc:  1 wallclock secs ( 1.06 usr +  0.00 sys =  1.06 CPU) @ 124841.51/s (n=132332)
  storable:  1 wallclock secs ( 1.07 usr +  0.00 sys =  1.07 CPU) @ 80387.85/s (n=86015)
             Rate storable     json   mp_crc       mp
storable  80388/s       --     -17%     -36%     -37%
json      97303/s      21%       --     -22%     -23%
mp_crc   124842/s      55%      28%       --      -2%
mp       126866/s      58%      30%       2%       --
Do not use USE_SMAZ in crc branch
@teruokun
Copy link

Admittedly I haven't looked at your implementation, but why not have the the style be CRC -> object instead of object -> CRC? Then the processing works like this:

  1. node A sends 0xC6+CRC then the object the CRC is for via TCP
  2. node B receives the 0xC6 byte and the checksum and recognizes that the next msgpack object sent in this session must match the checksum

Then, if you like, you could specify a flag to the unpacker that raises an exception if the checksum is not present for top-level objects in the stream, so that you never accept messages without a checksum

@kuenishi
Copy link
Member

We added a new ext type at the new spec. See #128 for discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants