Skip to content

Reduce memory usage of bulkloader #4529

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

xiangzhao632
Copy link

@xiangzhao632 xiangzhao632 commented Jan 9, 2020

I used bulkloader to initial my cluster, but It was killed due to OOM. The dataset only has one predicate: dgraph.type, so I set reducer to 1, but the reduce stage still occupied 100G memory.
I located the problem at the reduce.go source file: all mapEntries with the same key must in one batch to produce postinglist. And in my dataset, postinglists can be very large. That leads to a long time to produce a batch and a very large batch size. So I make a little change to improve bulkloader.
The idea is that use a struct named kvBuilder to store inbuilding kv, So a batch doesn't have to include all all mapEntries with the same key, therefore saves memory from holding massive duplicate keys. This small change brings two benefits: first, process won't be block by a large batch, therefore speeds up bulkloader; second, memory usage will be reduced.
By this method, memory usage of the reduce stage cut down to 7G, and achieved 1.33x speedup.


This change is Reviewable

… key into a batch. That causes processing paused for a long time and the memory usage grows rapaidly. This commit fix it.
@xiangzhao632 xiangzhao632 requested a review from a team January 9, 2020 10:45
@@ -214,6 +239,8 @@ func (r *reducer) encodeAndWrite(
listSize = 0
}
}
kvb.finish = true
listSize += r.toList(nil, list, kvb)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ineffectual assignment to listSize (from ineffassign)

@@ -60,10 +67,18 @@ func (r *reducer) run() error {
go func(shardId int, db *badger.DB) {
defer thr.Done(nil)

//Use a chan as mapEntries pool for mapIterator to read mapoutput. Fixed size is ok because mapEntries will

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line is 110 characters (from lll)

@xiangzhao632 xiangzhao632 removed the request for review from a team January 9, 2020 10:54
@xiangzhao632 xiangzhao632 changed the title The PR improve memory usage of bulkloader Reduce memory usage of bulkloader Jan 10, 2020
@mangalaman93 mangalaman93 self-assigned this Jan 11, 2020
@@ -159,8 +166,7 @@ func newMapIterator(filename string) *mapIterator {
return &mapIterator{fd: fd, reader: bufio.NewReaderSize(gzReader, 16<<10)}
}

func (r *reducer) encodeAndWrite(
writer *badger.StreamWriter, entryCh chan []*pb.MapEntry, closer *y.Closer) {
func (r *reducer) encodeAndWrite(writer *badger.StreamWriter, entryCh chan []*pb.MapEntry, closer *y.Closer) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

line is 110 characters (from lll)

@mangalaman93 mangalaman93 removed their assignment Jan 23, 2020
@CLAassistant
Copy link

CLA assistant check
All committers have signed the CLA.

1 similar comment
@CLAassistant
Copy link

CLA assistant check
All committers have signed the CLA.

@xiangzhao632
Copy link
Author

I have opened another pr for this. I really need to study how to use git and github. Close.

@xiangzhao632 xiangzhao632 deleted the xiangzhao020/bulk_loader_memory_improve branch May 28, 2020 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants