-
-
Notifications
You must be signed in to change notification settings - Fork 58
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Huge data folder size overhead in multiple overwrite workload #21
Comments
Hi. |
Hi, A couple of notes:
Can you slightly explain (or point to the source code or doc) how engine decide when to overwrite or remove/insert instead and how removing implemented if there are other records following deleted row exists (update in the middle)? |
Am I understand correctly that there are no any additional actions for removing in middle (due to performance reasons), so if we need such behavior we have to implement manually, i.e. read tail, remove it and store update with unmodified parts? Similar to the logic of RandomKeySorter as you pointed. |
May be you can try to write a small program emulating your behavior of the insert/insert/update logic, pointing the performance or size problem. For me is very difficult to talk without such examples. |
Thank you for the answer! I dig source code a bit and can say that I'm probably right in my assumptions. I think we can close for now. We rethink our save logic and if the issue arises again then we can reopen it. |
Remove in DBreeze is only a "logical" operation, data stays physically inside the file, only search keys are rewritten. |
Thanks for clarification. |
We use DBreeze 1.083.2017.0302 in our product with custom serializer to a byte array for data and manually split it across few tables (external logic, 6 tables right now). All tables consist of key (int) and byte[] (custom serializer result) pairs.
The workload can be:
We noticed that DBreeze folder size right now around 7 GB but if we count size of stored data it will be just 75 MB (2 466 476 rows).
Are we missing some options / configuration for our workload, or maybe we have to implement some kind of "VACUUM" command?
The text was updated successfully, but these errors were encountered: