Huge data folder size overhead in multiple overwrite workload #21

nsentinel · 2017-04-10T15:00:48Z

We use DBreeze 1.083.2017.0302 in our product with custom serializer to a byte array for data and manually split it across few tables (external logic, 6 tables right now). All tables consist of key (int) and byte[] (custom serializer result) pairs.

The workload can be:

Insert new "batch" of rows (rare)
Remove "batch" of rows and insert new ones with different keys (very often).
Overwrite all or part of rows with new data (depends but can be very often).

Write size can vary significantly but a rough average is about 5-15 KB per 1 row in write "batch".
Each write/insert/update row request wrapped to the individual transaction per 1 row, so they, not real batch, just logically grouped operations.
All data in workload goes to one of "manual" tables.
Technical_SetTable_OverwriteIsNotAllowed options is not set and has default value.

We noticed that DBreeze folder size right now around 7 GB but if we count size of stored data it will be just 75 MB (2 466 476 rows).

Are we missing some options / configuration for our workload, or maybe we have to implement some kind of "VACUUM" command?

hhblaze · 2017-04-10T15:32:30Z

Hi.
Repack your data into an empty table to get initial size (select from one table and insert into another, use byte[] level for copying) - it will be your "vacuum" and will restore the table size.
Then try to disable "Technical_SetTable_OverwriteIsNotAllowed" (just don't call it) and sort your batch ascending by key in memory before insert (you can look integrated sorter - tran.RandomKeySorter - read docu).
If the speed of the batch insert is not satisfactory, then try to use "Technical_SetTable_OverwriteIsNotAllowed" after batch is sorted in memory.
If size will be again too big, then algorithm must be rethinked.

nsentinel · 2017-04-10T15:54:16Z

Hi,
thanks for quick response!

A couple of notes:

Technical_SetTable_OverwriteIsNotAllowed we do not use from the beginning (actually I've read about it today when digging doc, so this option was always in its default state).
Keys are always in ascending order. Actually, we use DBreeze as complimentary DB engine to SQL Server, so our keys are values from ID column from corresponding SQL Table and have identity / auto-increment field type (1,1) (so grow always ascending).

Can you slightly explain (or point to the source code or doc) how engine decide when to overwrite or remove/insert instead and how removing implemented if there are other records following deleted row exists (update in the middle)?

nsentinel · 2017-04-10T16:31:32Z

Am I understand correctly that there are no any additional actions for removing in middle (due to performance reasons), so if we need such behavior we have to implement manually, i.e. read tail, remove it and store update with unmodified parts? Similar to the logic of RandomKeySorter as you pointed.

hhblaze · 2017-04-10T18:29:00Z

May be you can try to write a small program emulating your behavior of the insert/insert/update logic, pointing the performance or size problem. For me is very difficult to talk without such examples.
If you want we can go on discussion on our native language.

nsentinel · 2017-04-10T18:36:22Z

Thank you for the answer!

I dig source code a bit and can say that I'm probably right in my assumptions. I think we can close for now. We rethink our save logic and if the issue arises again then we can reopen it.

hhblaze · 2017-04-10T19:32:29Z

Remove in DBreeze is only a "logical" operation, data stays physically inside the file, only search keys are rewritten.

nsentinel · 2017-04-10T20:45:04Z

Thanks for clarification.

nsentinel closed this as completed Apr 10, 2017

MithrilMan mentioned this issue Oct 9, 2018

Coinview UTXO is too bloated (Explore alternatives to dbreeze database) stratisproject/StratisBitcoinFullNode#2414

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge data folder size overhead in multiple overwrite workload #21

Huge data folder size overhead in multiple overwrite workload #21

nsentinel commented Apr 10, 2017 •

edited

Loading

hhblaze commented Apr 10, 2017

nsentinel commented Apr 10, 2017 •

edited

Loading

nsentinel commented Apr 10, 2017

hhblaze commented Apr 10, 2017

nsentinel commented Apr 10, 2017

hhblaze commented Apr 10, 2017

nsentinel commented Apr 10, 2017

Huge data folder size overhead in multiple overwrite workload #21

Huge data folder size overhead in multiple overwrite workload #21

Comments

nsentinel commented Apr 10, 2017 • edited Loading

hhblaze commented Apr 10, 2017

nsentinel commented Apr 10, 2017 • edited Loading

nsentinel commented Apr 10, 2017

hhblaze commented Apr 10, 2017

nsentinel commented Apr 10, 2017

hhblaze commented Apr 10, 2017

nsentinel commented Apr 10, 2017

nsentinel commented Apr 10, 2017 •

edited

Loading

nsentinel commented Apr 10, 2017 •

edited

Loading