Skip to content

Improve throughput of bulk loader with distributed loading #2628

@danielmai

Description

@danielmai

Experience Report

The dgraph bulk loader the fastest way to load data into Dgraph at close to 1M edges/sec. This currently satisfies most users, but for extremely large data sets on the order of terabytes, it takes on the order of days if not weeks to finish bulk loading the entire data set.

What you wanted to do

Complete a bulk load a multi-terabyte RDF triples data set in a timely manner.

What you actually did

Run the bulk loader on a multi-terabyte RDF triples data set on an i3.metal AWS instance with 14 TB of SSD space.

Why that wasn't great, with examples

The bulk loader job did not finish on the i3.metal instance. Disk space ran out during the mapping phase.

What could be improved

Since the bulk loader mapping phase can't be completed on a single machine for large data sets, then a distributed map reduce bulk loader would help make it at the very least possible while also increasing throughput to decrease the wait time from weeks to days or hours.

Metadata

Metadata

Assignees

No one assigned

    Labels

    area/bulk-loaderIssues related to bulk loading.area/performancePerformance related issues.kind/enhancementSomething could be better.popularpriority/P2Somehow important but would not block a release.status/acceptedWe accept to investigate/work on it.

    Type

    No type

    Projects

    No projects

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions