Skip to content

how to import 4 billion rdf to dgraph quickly #1323

@flyfatasy

Description

@flyfatasy

Recently we are doing some research on contact maps and we have a graph consist of almost 2 billion contacts, so obviously we need a graph database. Dgraph is really impressive and really cool. Following the document, we arrange our data into rdf using spark. Now we have about 4.2 billion rdfs.

The schema looks like this:
mutation{
schema{
thisphone: string @index(hash) .
contact: uid .
contact_of: uid .
}
}

The rdf looks like this:
<contact_p104008111111> <contact_of> <contact_p113761083758> (name="sam", ots=1452908610, lts=1501758356, status=1) .
<contact_p104008111111> <contact_of> <contact_p113810888226> (name="frank", ots=1453119360, lts=1500729904, status=1) .
<contact_p104008111111> <contact_of> <contact_p113811659687> (name="tony", ots=1444992764, lts=1498013559, status=1) .

We are running dgraph on a machine with 64G memory, 3.5T ssd(raid5 though), 40 cores.
Now problem is: the import speed will converge to about 20000/s after several minutes. This is not very slow, however when compare to 4.2 billion rdfs still we needs quite a lot of time(3 days). So can we generate sst files and vlogs on spark and then simply copy it to the p directory ? Glad to hear other ways to accelerate the import procedure. Thanks.

Metadata

Metadata

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions