Skip to content

[Feature request] Add Graph Deep Learning Capabilities #4608

@marvin-hansen

Description

@marvin-hansen

Experience Report

Currently, I am building a unified graph that converges data, compute and machine learning. On the data side, I use Postgres & a graph database. Still working on making DGraph work, but that's a very different story.
On the compute & machine learning side, everything integrates through webservices. On the integration layer, all data and web services get queried, accessed, and mutated through a master GraphQL layer.

It works but when I recently integrated another machine learning use case, I ended up sending a bunch of data forward and backward and that made me think of a better way.

What you wanted to do

I had a similar situation a few times before, which means I move data around that isn't exactly great because processing cannot be done where the data is, or the data aren't where the processing is happening. Either way, this is stupid, and Hadoop really isn't the answer either.

What you actually did

Eventually, I did what everyone else would have done: Load the data, send it to the ML service,
processes it, and store the results back to the datastore. Simple.

Why that wasn't great, with examples

There are so many problems:

  1. No data / compute locality
  2. Unnecessary traffic
  3. Lousy latency
  4. Real-time is getting harder due to another network hop
  5. Processing larger datasets, well, isn't fun precisely because of the implied data loading.

What is the main idea?

One of the most intriguing properties of a DGraph comes in the form of context locality that naturally follows from predicate sharding. When predicates P1 ... Pn resides on node-X, then all queries against predicates on that node-X are, by definition, independent from all other predicates on all other nodes and thus can be executed in parallel. Ok, we all know that.

However, it also follows that any algorithm operating on predicates can be parallelized by default and because of that, moved to the node where the data are located, traverse through the corresponding sub-graph and compute stuff. That is data parallelism in its purest form.

What would be a truly great way of doing this?

The greatest possible way I can think of would be processing data where they are stored by adding a third kind of node to Dgraph that hosts data & ML algorithms. Due to the distributed nature, I think aggregated real-time result streams might be possible.

It requires a third kind of node to prevent adding more load to the rest of the system so that normal queries remain unaffected performance-wise. The alpha nodes just dispatch queries and mutations to zero nodes, and sends ML workload to ML nodes. Plus, the ML instance might get pinned to a high-spec or GPU machine. With the current work on the GraphQL endpoint, the ML algorithm can be then exposed and parametrized through a custom resolver.

More details & considerations

This idea is based on my current practice of integrating ML services into the unified graph by only taking a start & stop ID as a parameter to let the custom resolver query all data required before passing them to the ML server. It does work in practice but comes at the caveat that there are some limits of how much data you can query and send over the network before something slows down.
Obviously, this makes a lot of sense when the sub-graph remains reasonably small, but it doesn't work anymore on a large graph. At some point, the network isn't your friend anymore.

The number of usable algorithms that are suitable for a large graph is relatively low, but these solve really important problems, among other things community predictions, graph structure classification, and predicting shifts in structural (im)balance.

Specifically, the algorithms in DGL are a terrific contender for these tasks because they can manage arbitrary graph size through either batch loading or message passing, and DGL already supports multi-processing and distributed training out of the box. Obviously, loading a large graph out of a distributed DGraph cluster, shuffle the data to a DGL cluster, just to distribute them again for processing sounds as stupid as it actually is. And in many ways, that is how it's done in practice.

I do not believe that kind of data-shuffling between distributed data storage and data processing is going to be sustainable. And the leading DB vendors already know that for some time.

Oracle absolutely nailed its in-database machine learning precisely because, at some point, it is actually easier to move the headquarter than moving data out of the data warehouse. Even the guys at neo4j got this simple message and started adding some ML capabilities.

You simply cannot load and transfer humongous data anymore so you have to process them where they are and that is exactly why in-database machine learning will only grow in importance.

Doing so, however, remains a prevalent pain-point with no truly great solution for graph data.

Any external references to support your case

https://docs.dgl.ai/api/python/graph_store.html

https://docs.dgl.ai/tutorials/basics/1_first.html

https://www.microsoft.com/en-us/research/wp-content/uploads/2009/10/Fourth_Paradigm.pdf

Metadata

Metadata

Assignees

No one assigned

    Labels

    kind/featureSomething completely new we should consider.

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions