Skip to content

Transactions and Batch updates #5608

@RJKeevil

Description

@RJKeevil

Experience Report

What you wanted to do

I am very happy with Dgraph's bulk load and query performance. However I am struggling to get good insert/update speeds with a high volume streaming application.

What you actually did

I have 16 Go workers, that take records from a Kafka bus, do some processing on the data, and updates a section of the graph (with many nodes coming from one kafka record). I got a huge (100x) performance improvement by batching records into bundles of 500 records (~3000 nodes).

Why that wasn't great, with examples

For new nodes (inserts) this works great. However, one constraint is my nodes must be unique in the graph based on an external identifier, i.e. when the record already exists in the graph I must reuse the existing Uid and perform an update instead. Fetching the Uid is performant, but the update operation is killing performance due to transaction behaviour. In batches of 3000 nodes I am almost guaranteed to be updating the same node in one of the other 16 workers, so I get transaction errors, and may need to retry the entire batch multiple times.

I mostly add links to the node in the update operation, so these are purely additive and shouldnt error (i.e. A should be able to link to B in one thread, and A link to C in a separate thread). I think this does however cause a transaction failure currently.

I also occasionally update scalar properties in these updates, which make sense that they currently fail. In this I would like a "best effort" mode where it picks one at random/last transaction timestamp. I dont need perfect consistency here.

Another idea would to be to just fail the nodes that have the problem so i dont have to retry the whole batch (but I think this is harder due to the way transactions work)

Note: due to the structure of the records it is impossible for me to shuffle the update nodes onto the same worker to avoid simultaneous updates in my code.

Any external references to support your case

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions