Skip to content
Marc edited this page Jul 26, 2018 · 11 revisions

Page contents

The Cereal serialization format

The serialization format used by Cereal is really simple: All the data types are stored in Big Endian as raw data with a byte in front that tells us what data type it is.

std::string are stored a bit different: first, a short (2 bytes) is stored in front of it that tells us how many characters does the string have. And then, all the characters are stored as single-byte ASCII characters, one after the other. So the string Database would be encoded as 0x00 0x08 0x44 0x61 0x74 0x61 0x62 0x61 0x73 0x65 (8 characters, D, a, t, a, b, a, s, e).

In Cereal, headers contain databases, databases contain objects, and objects contain fields and arrays. This is important in order to organize our databases.

In the same way, headers delete databases when deleted, databases delete objects, and finally objects delete fields and arrays, so if we want to free memory, we just need to delete header; to delete everything if we are using headers or delete database; if we are only using a database without any header.

Once we know that we can move into our serialization units:

Fields

Fields have a byte that identifies them as a field (value 0x09). Next, we can find a short indicating the length of the name, followed by the ASCII name.
Finally, we find another byte indicating the data type and the data (from 1 to 8 bytes long, depending on the data type).

Arrays

Arrays also have a byte that identifies them as an array (value 0x0A) because of compatibility with fields and objects. After that, we find a short that indicates the length of the name, and an array of bytes with the name.
As usual, now we can find the byte indicating the data type of the array followed by four bytes indicating the item count of the array. After that we can see all the raw bytes of the array.

Use sizeof(data type) * count to figure out the amount of bytes an array uses.

Objects

Again, objects start with a byte that identifies them as an object (value 0x0B). Next, we find an string representing the name of the object, followed by a short with the field count of the object. After the field count we can find the fields, one after the other, containing what has been described above. Finally, we find another short (the array count) and the arrays.

Databases

Databases start with a short (two bytes) describing the version. Currently, there are two versions for databases: 1.0 and 2.0, and they are represented as 0x0100 and 0x0200 respectively.

Next, we can see the string containing the name of the database, followed by four bytes with the size of the database. Here's why a database cannot be larger than four gigabytes, because the four bytes can only represent up to 4 Gb.
Finally, we can see the object count as a short and all the objects as described above.

Headers

Headers start with a magic value, that is a value used to check if it is a header or not. It may be changed in future releases.
After the short 0x524D (the magic value), we find the database count as a byte, so a header can store up to 255 databases. Next, we can find an array of unsigned ints with the offsets where the databases start, and finally we can see the databases just as described above.