-
Notifications
You must be signed in to change notification settings - Fork 518
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New optional crc type at 0xc6 #114
Conversation
Use empty reserved msgpack 0xc6 type for a crc32 as uint32 at the end of the buffer. See msgpack-perl for an added add_crc() pack method. Optional CRC tag for basic security and data corruption. If pack provides a crc checksum, unpack should check the result against the given crc and report an error otherwise. The crc tag must be the last tag in the buffer. The crc checksum does not include itself and its type tag.
+1 |
Why should it be implemented on the library level? |
On Sat, Jun 30, 2012 at 10:17 AM, qehgt
Because the parser needs to assert that it's the only optional element
Of course not. It's an added optional value for fast data corruption detection. Security would be implemented by gpg signing.Reini Urban |
Having this would be fantastic. You can't also count on being able to write a temp file and rename() it in place. For those circumstances its way to easy to introduce seemly random bugs caused by half written data. |
It is about 5% slower than the inlined macro version though
Please understand that we have to be careful to add these new specifications…
Reason 1) Adding extra bytes after serialized objects doesn't work with streaming deserialization. For example, when you received a complete serialized object via a pipe (or socket), there are no methods to know whether no more bytes come. So you have to wait until at least one byte arrives to decide to check CRC or not to check, but it could block long time. Even if it's optional, there are no reasonable ways to handle it optionally. Reason 2) With MessagePack format, you can handle an array as a sequence of objects by skipping header bytes of the array. This technique is important to handle a big array which will not fit in memory. In this case, CRC bytes after the array cause a problem because the deserializer will compare it with the CRC of the last element of the array. Alternative plan) Some compression algorithms have error detection functionality as well as message framing specification
|
I don't care which bytes is used for CRC uint32, there are several available (=reserved). I just took one which looked nice. Ad Reason 1) Adding extra bytes after serialized objects doesn't work with streaming deserialization. For example, when you received a complete serialized object via a pipe (or socket), there are no methods to know whether no more bytes come. So you have to wait until at least one byte arrives to decide to check CRC or not to check, but it could block long time. Even if it's optional, there are no reasonable ways to handle it optionally. The streaming deserializer can ignore an extra CRC tag and just stop after receiving one object, or it can just use the msgpack parser which reads the CRC tag which comes after the first object. The method to know if there will come more bytes is up to you, which parser you choose. There will be no additional waiting after a complete object is already received. Ad Reason 2) With MessagePack format, you can handle an array as a sequence of objects by skipping header bytes of the array. This technique is important to handle a big array which will not fit in memory. In this case, CRC bytes after the array cause a problem because the deserializer will compare it with the CRC of the last element of the array. If the writer decides to split big objects into several smaller ones, the writer can decide to add a crc or not. I see no technical problem with CRC or without, because the logic is on the application level, not in the library. The library just optionally checks one buffer CRC if there is a CRC tag at the end. BTW: The reader must also know about such a split to splice it together. zlib and snappy would be a nice extension for long strings or a lot of strings, but the overhead for typical small data is too big. I haven't checked starting with which string size it would make sense to use compression. |
You mean that application developers can't always enable CRC handling and they have to consider stored data (or data transferred from another node) include CRC or not, right? If my understanding is correct, the CRC handling will be a flag of unpacker class (or an option of unpack function) like this:
So far, I think these features which require an extra application-specific knowledge doesn't make sense as a serialization format specification. We need more concrete use cases. But it's good idea to add this to the perl implementation because it just works practically (decision is up to @gfx, though). |
Right. So you need a fast and compact message framing protocol which supports error detection and protocol recovery. So just for idea, following format might be enough:
This format can detect and recover application-level problems. To detect data corruption in disk or network, use whole stream compression and/or SSL. |
On Tue, Jul 3, 2012 at 1:37 PM, FURUHASHI Sadayuki
Yes. The perl function is called add_crc($packed), and it adds the crc You can use (pseudo code)
No flag for unpack, only a function to pack. Please check out my implementation.
The pack with crc function must be added by the languages who want to add that.
Goro likes it. this new tag.Reini Urban |
Sorry for confusing, I meant "CRC handling should be a flag" which is enabled by only users who can expect CRC bytes after a deserialized object. Because streaming unpackers can't always block to check whether crc tag exists or not. And it may cause a problem during handling arrays. Your code always assume there is at least extra one byte after a deserialized object and the following CRC is for the last deserialized object, but it can't (at the same time, your code seems to have a SEGV problem). And as I mentioned, I'm opposed to add that feature which doesn't work with streaming deserializers and requires application-specific knowledge to the format specification so far. Unless it's defined as a format specification, each maintainers should be able to decide to add this feature to their implementations or not. I understand its interoperability is important but I don't want to force them to implement imperfect features. But Anyway...I want him (and all other maintainers) to copy the file to be independent. |
Sorry, but as I already explained both of your counter arguments are wrong.
and 2. no application specific knowledge required at all. Either add a crc on the application level or not. unpack is transparent. I'm fine with the private implementation as we are using it like this. |
I couldn't understand why it works with streaming deserializers...What happens with the following situation?:
Anyway, I agree to add this to the perl implementation and I entrust @gfx with that matter. |
If not CRC_INLINED use the zlib crc32() with word stepper and a crc lookup table. With zlib the crc overhead is only 2% for small buffers. Previously it was 10%. with -lz: $ pb benchmark/serialize.pl; pb benchmark/deserialize.pl -- serialize JSON::XS: 2.32 Data::MessagePack: 0.46_01 Storable: 2.35 Benchmark: running json, mp, mp_crc, storable for at least 1 CPU seconds... json: 1 wallclock secs ( 1.06 usr + 0.03 sys = 1.09 CPU) @ 136436.70/s (n=148716) mp: 1 wallclock secs ( 1.02 usr + 0.01 sys = 1.03 CPU) @ 185578.64/s (n=191146) mp_crc: 0 wallclock secs ( 1.06 usr + 0.00 sys = 1.06 CPU) @ 162293.40/s (n=172031) storable: 1 wallclock secs ( 1.05 usr + 0.00 sys = 1.05 CPU) @ 91021.90/s (n=95573) Rate storable json mp_crc mp storable 91022/s -- -33% -44% -51% json 136437/s 50% -- -16% -26% mp_crc 162293/s 78% 19% -- -13% mp 185579/s 104% 36% 14% -- -- deserialize JSON::XS: 2.32 Data::MessagePack: 0.46_01 Storable: 2.35 Benchmark: running json, mp, mp_crc, storable for at least 1 CPU seconds... json: 1 wallclock secs ( 1.03 usr + 0.01 sys = 1.04 CPU) @ 97302.88/s (n=101195) mp: 1 wallclock secs ( 1.12 usr + 0.01 sys = 1.13 CPU) @ 126866.37/s (n=143359) mp_crc: 1 wallclock secs ( 1.06 usr + 0.00 sys = 1.06 CPU) @ 124841.51/s (n=132332) storable: 1 wallclock secs ( 1.07 usr + 0.00 sys = 1.07 CPU) @ 80387.85/s (n=86015) Rate storable json mp_crc mp storable 80388/s -- -17% -36% -37% json 97303/s 21% -- -22% -23% mp_crc 124842/s 55% 28% -- -2% mp 126866/s 58% 30% 2% --
Do not use USE_SMAZ in crc branch
Admittedly I haven't looked at your implementation, but why not have the the style be CRC -> object instead of object -> CRC? Then the processing works like this:
Then, if you like, you could specify a flag to the unpacker that raises an exception if the checksum is not present for top-level objects in the stream, so that you never accept messages without a checksum |
We added a new ext type at the new spec. See #128 for discussion. |
Use empty reserved msgpack 0xc6 type for a crc32 as uint32 at the end of the buffer.
See msgpack-perl for an added add_crc() pack method.
Optional CRC tag for basic security and data corruption.
If pack provides a crc checksum, unpack should check the result against the given
crc and report an error otherwise.
The crc tag must be the last tag in the buffer.
The crc checksum does not include itself and its type tag.