-
Notifications
You must be signed in to change notification settings - Fork 1.5k
Reduce memory usage of bulkloader #4529
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce memory usage of bulkloader #4529
Conversation
… key into a batch. That causes processing paused for a long time and the memory usage grows rapaidly. This commit fix it.
dgraph/cmd/bulk/reduce.go
Outdated
| @@ -214,6 +239,8 @@ func (r *reducer) encodeAndWrite( | |||
| listSize = 0 | |||
| } | |||
| } | |||
| kvb.finish = true | |||
| listSize += r.toList(nil, list, kvb) | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ineffectual assignment to listSize (from ineffassign)
dgraph/cmd/bulk/reduce.go
Outdated
| @@ -60,10 +67,18 @@ func (r *reducer) run() error { | |||
| go func(shardId int, db *badger.DB) { | |||
| defer thr.Done(nil) | |||
|
|
|||
| //Use a chan as mapEntries pool for mapIterator to read mapoutput. Fixed size is ok because mapEntries will | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line is 110 characters (from lll)
| @@ -159,8 +166,7 @@ func newMapIterator(filename string) *mapIterator { | |||
| return &mapIterator{fd: fd, reader: bufio.NewReaderSize(gzReader, 16<<10)} | |||
| } | |||
|
|
|||
| func (r *reducer) encodeAndWrite( | |||
| writer *badger.StreamWriter, entryCh chan []*pb.MapEntry, closer *y.Closer) { | |||
| func (r *reducer) encodeAndWrite(writer *badger.StreamWriter, entryCh chan []*pb.MapEntry, closer *y.Closer) { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
line is 110 characters (from lll)
1 similar comment
|
I have opened another pr for this. I really need to study how to use git and github. Close. |
I used bulkloader to initial my cluster, but It was killed due to OOM. The dataset only has one predicate: dgraph.type, so I set reducer to 1, but the reduce stage still occupied 100G memory.
I located the problem at the reduce.go source file: all mapEntries with the same key must in one batch to produce postinglist. And in my dataset, postinglists can be very large. That leads to a long time to produce a batch and a very large batch size. So I make a little change to improve bulkloader.
The idea is that use a struct named kvBuilder to store inbuilding kv, So a batch doesn't have to include all all mapEntries with the same key, therefore saves memory from holding massive duplicate keys. This small change brings two benefits: first, process won't be block by a large batch, therefore speeds up bulkloader; second, memory usage will be reduced.
By this method, memory usage of the reduce stage cut down to 7G, and achieved 1.33x speedup.
This change is