Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large database size #75

Open
Lupino opened this issue May 25, 2018 · 6 comments
Open

Large database size #75

Lupino opened this issue May 25, 2018 · 6 comments

Comments

@Lupino
Copy link

Lupino commented May 25, 2018

lmj@Menjun:data$ du -sh *
 12G 	data
5.0M	dump.db
 22G	index
8.0K	meta

I use haskey on my project https://github.com/Lupino/haskell-periodic/blob/master/periodic-server/src/Periodic/Server/Scheduler.hs#L119

When I run three days haskey create large file more then 30GB,
the real data is 5M of 57028 items

please help me how to alloc the file size of haskey

My schema is:

type JobTree = Tree JobName Job
type ProcTree = Tree ByteString Job

data Schema = Schema
  { _schemaJobTrees  :: Tree FuncName JobTree
  , _schemaProcTrees :: Tree FuncName ProcTree
  } deriving (Generic, Show, Typeable)

instance Binary Schema
instance Value Schema
instance Root Schema
@hverr
Copy link
Member

hverr commented May 28, 2018

This should not been happening, I'll take a look!

@Lupino
Copy link
Author

Lupino commented May 29, 2018

I make a new test.

  • First i push 200000 file name (every filename is less 256 byte) into haskey
  • Second i read and write and delete more then 2 times every second
  • record the file size every step

2018-05-29 17:32:22

total record is 193521

count the record size use 53.719s

filesize:

$ du -sh *
320M	data
128M	index
8.0K	meta
20M	dump.db

2018-05-29 18:04:57

total record is 184738

count the record size use 4.045s

filesize:

$ du -sh *
512M	data
 20M	dump.db
 20M	dump1.db
144M	index
8.0K	meta

2018-05-30 09:00:49

total record is 132516

count the record size use 5.798s

filesize:

$ du -sh *
1.6G	data
 20M	dump.db
 20M	dump1.db
 15M	dump2.db
208M	index
8.0K	meta

2018-05-31 17:14:24

total record is 19718

count the record size use 0.484s

filesize:

du -sh *
4.0G	data
 20M	dump.db
 20M	dump1.db
 15M	dump2.db
2.1M	dump3.db
1.2G	index
8.0K	meta

@hverr
Copy link
Member

hverr commented Jun 2, 2018

Wasn't able to find it yet. But, some remarks:

  • You are using nested trees (advanced but certainly possible), do you ever:

    1. Directly delete an element in your schemaProcTrees, e.g.: updateProcTree (B.delete funcName)
    2. Same for schemaJobTrees?
  • Do you have a minimal example I can run that automatically generates input data and performs the insertions/deletions?

@Lupino
Copy link
Author

Lupino commented Jun 3, 2018

This is my minimal example

https://github.com/Lupino/test-haskey

@hverr hverr changed the title how to alloc file size? Large database size Jul 26, 2018
@hverr
Copy link
Member

hverr commented Jul 26, 2018

Sorry for the delay, busy times.

I've done some experimenting:

  1. Stage 1: Inserting: When inserting all ~200000 files (~20MB), using insert in lexicographical order (one filename per commit), the resulting amount of leaf pages (actual pages containing data) is ~80000 pages, i.e. on average 2.5 key-value pairs (2x100bytes). Each page is 4kb, so the total resulting data/data file (containing all leaf pages) is 80k pages * 4kb/page = 320MB. The low amount of data per leaf page (2.5 key-value pairs/ page, which should actually be around 10-15 key-value pairs/page) explains the explosion of the database size.

    This low ratio is because of inefficient splitting of the underlying B+-tree nodes. This in turn is caused by inserting the file names in lexicographical order and using insert instead of insertMany. Tip: shuffle the file names before inserting them, and insert them in batches using insertMany (experiment with batch size).

    When only shuffling the file names before inserting them (still using insert) the final data file size is 200MB, with an average of 3.7 key-value pairs/page. (still not optimal)

  2. Stage 2: Removing: Removing the entries will free all pages that are being used to store all data. The database needs to keep track of all free pages to allocate data, this explains the increase of the database size when removing all entries.

Conclusion: Haskey does not optimally allocate pages, to fix this haskey would have to do some cleanup/garbage collecting between operations.

yairchu added a commit to lamdu/lamdu that referenced this issue Sep 30, 2018
See haskell-haskey/haskey#75
which recommends to use `insertMany` over `insert`.
@drchaos
Copy link

drchaos commented Jan 31, 2020

I've faced out with this the same issue.
I've made a queue with file persistent storage. The for B+-tree is the current timestamp, I'm inserting entries one by one and getting a min key and removing it. No batch operations.
Is there a workaround to make a clean-up from time to time?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants