Large database size #75

Lupino · 2018-05-25T02:55:00Z

lmj@Menjun:data$ du -sh *
 12G 	data
5.0M	dump.db
 22G	index
8.0K	meta

I use haskey on my project https://github.com/Lupino/haskell-periodic/blob/master/periodic-server/src/Periodic/Server/Scheduler.hs#L119

When I run three days haskey create large file more then 30GB,
the real data is 5M of 57028 items

please help me how to alloc the file size of haskey

My schema is:

type JobTree = Tree JobName Job
type ProcTree = Tree ByteString Job

data Schema = Schema
  { _schemaJobTrees  :: Tree FuncName JobTree
  , _schemaProcTrees :: Tree FuncName ProcTree
  } deriving (Generic, Show, Typeable)

instance Binary Schema
instance Value Schema
instance Root Schema

The text was updated successfully, but these errors were encountered:

hverr · 2018-05-28T17:47:00Z

This should not been happening, I'll take a look!

Lupino · 2018-05-29T08:04:46Z

I make a new test.

First i push 200000 file name (every filename is less 256 byte) into haskey
Second i read and write and delete more then 2 times every second
record the file size every step

2018-05-29 17:32:22

total record is 193521

count the record size use 53.719s

filesize:

$ du -sh *
320M	data
128M	index
8.0K	meta
20M	dump.db

2018-05-29 18:04:57

total record is 184738

count the record size use 4.045s

filesize:

$ du -sh *
512M	data
 20M	dump.db
 20M	dump1.db
144M	index
8.0K	meta

2018-05-30 09:00:49

total record is 132516

count the record size use 5.798s

filesize:

$ du -sh *
1.6G	data
 20M	dump.db
 20M	dump1.db
 15M	dump2.db
208M	index
8.0K	meta

2018-05-31 17:14:24

total record is 19718

count the record size use 0.484s

filesize:

du -sh *
4.0G	data
 20M	dump.db
 20M	dump1.db
 15M	dump2.db
2.1M	dump3.db
1.2G	index
8.0K	meta

hverr · 2018-06-02T10:12:57Z

Wasn't able to find it yet. But, some remarks:

You are using nested trees (advanced but certainly possible), do you ever:
1. Directly delete an element in your schemaProcTrees, e.g.: updateProcTree (B.delete funcName)
2. Same for schemaJobTrees?
Do you have a minimal example I can run that automatically generates input data and performs the insertions/deletions?

Lupino · 2018-06-03T02:18:15Z

This is my minimal example

https://github.com/Lupino/test-haskey

hverr · 2018-07-26T16:07:44Z

Sorry for the delay, busy times.

I've done some experimenting:

Stage 1: Inserting: When inserting all ~200000 files (~20MB), using insert in lexicographical order (one filename per commit), the resulting amount of leaf pages (actual pages containing data) is ~80000 pages, i.e. on average 2.5 key-value pairs (2x100bytes). Each page is 4kb, so the total resulting data/data file (containing all leaf pages) is 80k pages * 4kb/page = 320MB. The low amount of data per leaf page (2.5 key-value pairs/ page, which should actually be around 10-15 key-value pairs/page) explains the explosion of the database size.

This low ratio is because of inefficient splitting of the underlying B+-tree nodes. This in turn is caused by inserting the file names in lexicographical order and using insert instead of insertMany. Tip: shuffle the file names before inserting them, and insert them in batches using insertMany (experiment with batch size).

When only shuffling the file names before inserting them (still using insert) the final data file size is 200MB, with an average of 3.7 key-value pairs/page. (still not optimal)
Stage 2: Removing: Removing the entries will free all pages that are being used to store all data. The database needs to keep track of all free pages to allocate data, this explains the increase of the database size when removing all entries.

Conclusion: Haskey does not optimally allocate pages, to fix this haskey would have to do some cleanup/garbage collecting between operations.

See haskell-haskey/haskey#75 which recommends to use `insertMany` over `insert`.

drchaos · 2020-01-31T11:53:38Z

I've faced out with this the same issue.
I've made a queue with file persistent storage. The for B+-tree is the current timestamp, I'm inserting entries one by one and getting a min key and removing it. No batch operations.
Is there a workaround to make a clean-up from time to time?

hverr changed the title ~~how to alloc file size?~~ Large database size Jul 26, 2018

Lupino mentioned this issue Aug 1, 2018

Large database size Lupino/haskell-periodic#1

Closed

yairchu added a commit to lamdu/lamdu that referenced this issue Sep 30, 2018

Use insertMany, still sluggish

ad4e642

See haskell-haskey/haskey#75 which recommends to use `insertMany` over `insert`.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large database size #75

Large database size #75

Lupino commented May 25, 2018 •

edited

hverr commented May 28, 2018

Lupino commented May 29, 2018 •

edited

hverr commented Jun 2, 2018 •

edited

Lupino commented Jun 3, 2018 •

edited

hverr commented Jul 26, 2018

drchaos commented Jan 31, 2020

Large database size #75

Large database size #75

Comments

Lupino commented May 25, 2018 • edited

hverr commented May 28, 2018

Lupino commented May 29, 2018 • edited

hverr commented Jun 2, 2018 • edited

Lupino commented Jun 3, 2018 • edited

hverr commented Jul 26, 2018

drchaos commented Jan 31, 2020

Lupino commented May 25, 2018 •

edited

Lupino commented May 29, 2018 •

edited

hverr commented Jun 2, 2018 •

edited

Lupino commented Jun 3, 2018 •

edited