What was the design reason for a struct streaming format, opposed to streaming data per column? #443

skinkie · 2020-04-19T22:40:46Z

skinkie
Apr 19, 2020

I was aware of NDS for a while, but not their open source projects. So I am positively surprised on these topics. First of all I am not here to troll or zealot, but rather want to ask a question and get informed. It seems that zserio is serialising similar to protobuf, works like any historic database: a row store, thus not taking in account the ability to compress data by looking at it via columns. As example;

 struct Employee
{
    uint8   age;
    string  name;
    uint16  salary;
    Role    role;
};

Would become:

Fixed width types:
uint8[] age;
uint16[] salary;
Role[] role;

Variable width types:
uint32[] name_index; (or delta encoded)
string names;

Is there a technical reason to distribute data per row, opposed per type?

Maybe this answer could be added to the FAQ.

Answered by fklebert

Apr 20, 2020

A column store concept is certainly appealing since it has a couple of benefits as you already mentioned: compression, fixed width arrays reading performance and others.

One of the advertised benefits of zserio is its "zero serialization overhead", this means that we do not impose a wire-format. Such a wireformat description would be needed to be able to write schema in a well-readable struct like format but store it optimal in columns or other structures.

zserio gives you the opportunity to write your own schema however you like. So you can simply convert your struct Employee (which you would later store in an array) into a column-store like

struct EmployeeList
{
  uint16 entries;
  uint…

View full answer

mikir · 2020-04-20T08:50:09Z

mikir
Apr 20, 2020
Maintainer

Thanks a lot for an interest in zserio and for the question.

To answer your question, we would need to clarify the meaning of 'row'. Do you mean that if you put Employee to the array

struct EmployeeArray
{
    Employee employees[];
};

and use zserio to serialize it, you will get binary data distributed per row? Meaning binary data will start with employees[0] data following by employees[1] data, ... and not employees.age[0], employees.age[1],...., employees.name[0], employees.name[1], .... etc...

Or do you mean SQL tables?

sql_table EmployeeTable
{
    Employee employee;
};

0 replies

fklebert · 2020-04-20T09:17:27Z

fklebert
Apr 20, 2020
Maintainer

A column store concept is certainly appealing since it has a couple of benefits as you already mentioned: compression, fixed width arrays reading performance and others.

One of the advertised benefits of zserio is its "zero serialization overhead", this means that we do not impose a wire-format. Such a wireformat description would be needed to be able to write schema in a well-readable struct like format but store it optimal in columns or other structures.

zserio gives you the opportunity to write your own schema however you like. So you can simply convert your struct Employee (which you would later store in an array) into a column-store like

struct EmployeeList
{
  uint16 entries;
  uint8 age[entries];
  uint16 salary[entries];
...
};

So zserio actually allows both design paradigms: column and row store. It is basically up to you to implement those. Of course your application will have to deal with a little bit of overhead in the case of the column-store approach since there will be no generated class Employee in the end but you will have to do that on your own.

But I agree that we may want to update the FAQs in that respect in the future.

0 replies

skinkie · 2020-04-20T13:10:29Z

skinkie
Apr 20, 2020
Author

@mikir I hope that @fklebert has made it a bit more clear. Row oriented file formats take a C-struct as a database row and transfer that some what compacted as long as the varchar case is handled well. But some properties such efficient as random access to individual properties are lost. If a file format would somehow be column aware for example blocks of information were transferred it could be much more efficient to group the data per attribute (column), opposed to group the data per object (row/struct). Once you want to do a realtime individual properties exchange the column format would fall back to the single element column (hence: row).

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

What was the design reason for a struct streaming format, opposed to streaming data per column? #443

{{title}}

Replies: 3 comments

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

What was the design reason for a struct streaming format, opposed to streaming data per column? #443

skinkie Apr 19, 2020

Replies: 3 comments

mikir Apr 20, 2020 Maintainer

fklebert Apr 20, 2020 Maintainer

skinkie Apr 20, 2020 Author

skinkie
Apr 19, 2020

mikir
Apr 20, 2020
Maintainer

fklebert
Apr 20, 2020
Maintainer

skinkie
Apr 20, 2020
Author