Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

readwrite module for interfacing with RDF #855

Closed
wants to merge 24 commits into from

Conversation

pedros
Copy link

@pedros pedros commented Mar 13, 2013

Read and write graphs in RDF format. Requires rdflib. This module adds 8 new functions:

It can load arbitrary RDF graphs, or RGML-namespaced graphs. RGML is the RDF Graph Modeling Language, an RDF ontology for generic graphs.

For arbitrary RDF graphs, it can represent them as multiple directed labeled graphs as per the RDF specification, with terms that occur as subject/object and, later on, as a predicate, being reified multiple times. Since this is not optimal for connectivity analyses, it also recognizes that RDF graphs are, in fact, hypergraphs, and as such represents them as bipartite graphs where each node in one partition has 3 edges pointing to subject, predicate, and object nodes in the other partition.

For more information, see Hayes, J. (2004). A graph model for RDF. Technische Universität Darmstadt

All serialization formats supported by rdflib are supported, currently:

TODO: implement building from generic and rgml graphs
+    # I think we don't need to consider disconnected nodes, since that
+    # is impossible to represent in straight RDF. That is, RDF has a
+    # concept of disconnected triples, but not terms, which are
+    # properly analogous to nodes in networkx
- Fix error in to_rdfgraph() dispatching mechanism
  due to iterators being consumed and later reused
- reduced cyclomatic complexity on {to,from}_rgmlgraph below 10
- improved pep8 compliance
- tools used: pep8, pyflakes, pylint, pygenie, pymetrics
@ghost ghost assigned hagberg Jul 19, 2013
Make Python3 compatible
Skip failing tests
Skip tests completely if no rdflib
@hagberg
Copy link
Member

hagberg commented Dec 16, 2013

Pull request at pedros#1 with some suggested changes.

@westurner
Copy link

In addition to the changes from pedros#1,

@hagberg hagberg modified the milestones: networkx-2.1, networkx-2.0 Feb 2, 2016
@hagberg hagberg modified the milestones: networkx-2.1, networkx-future Nov 25, 2017
Base automatically changed from master to main March 4, 2021 18:20
@MridulS
Copy link
Member

MridulS commented Feb 2, 2023

@pedros Thanks for the contribution! And sorry for the delay in review 😅

As of today it looks like rdflib maintains connectors to networkx https://github.com/RDFLib/rdflib/blob/9625ed0b432c9085e2d9dda1fd8acf707b9022ab/rdflib/extras/external_graph_libs.py#L72 so we don't need to add them here :)

The link to RGML seems broken, I think this is the current one https://www.cs.rpi.edu/~puninj/rgml.html . And from previous experience maintaining markup languages in networkx becomes a maintenance burden so we are trying to avoid adding new readwrite modules, especially the ones which don't seem to have robust support outside of networkx already. Let me know what you think. Thanks again!

@rossbar
Copy link
Contributor

rossbar commented Feb 3, 2023

As of today it looks like rdflib maintains connectors to networkx so we don't need to add them here :)

I agree - there's not a lot to gain by duplicating format-conversion functionality, and the RDF library seems like a more natural place for these to live. I will go ahead and close this - thanks all for the proposal & discussion!

@rossbar rossbar closed this Feb 3, 2023
@westurner
Copy link

westurner commented Feb 4, 2023

@MridulS

def rdflib_to_networkx_digraph(
    graph,
    calc_weights=True,
    edge_attrs=lambda s, p, o: {"triples": [(s, p, o)]},
    **kwds,
):

def rdflib_to_networkx_multidigraph(
    graph, edge_attrs=lambda s, p, o: {"key": p}, **kwds
):
  • To copy from (edit) networkx to rdflib:

@pedros @rossbar Would this rdflib read/write code be best as a third-party module?

https://github.com/networkx/networkx/blob/main/setup.py:

(Edit)

@MridulS
Copy link
Member

MridulS commented Feb 4, 2023

@westurner The plugins bits are currently setup to work for backend computation plugins, not readwrite modules. But this is something that indeed can be thought more about :)

For readwrite to arrow/Parquet, I think we can have a readwrite inside networkx too! (just my opinion) Arrow is a robust data format outside of networkx and if there is an efficient way of reading/writing into that I think that's a plus.

Now if someone comes up and implements algorithms on top of arrow data structures for graphs, that would be great :D. We would be able to directly latch into that as a backend.

@westurner
Copy link

westurner commented Feb 4, 2023

RDF support would be worthwhile as a core "read write plugin" or as a third-party adapter with it's own integration tests that depend upon rdflib import IMHO.

These have C/C++-based tests_require dependencies, too:

Now if someone comes up and implements algorithms on top of arrow data structures for graphs, that would be great :D. We would be able to directly latch into that as a backend.

https://github.com/rapidsai/cugraph/blob/branch-23.02/readme_pages/algorithms.md

https://github.com/rapidsai/cugraph#apache-arrow-on-gpu-- :

Data scientists familiar with Python will quickly pick up how cuGraph integrates with the Pandas-like API of cuDF. Likewise, users familiar with NetworkX will quickly recognize the NetworkX-like API provided in cuGraph, with the goal to allow existing code to be ported with minimal effort into RAPIDS. To similfy integration, cuGraph also support data found in Pandas DataFrame, NetworkX Graph Objects and several other formats.

While the high-level cugraph python API provides an easy-to-use and familiar interface for data scientists that's consistent with other RAPIDS libraries in their workflow, some use cases require access to lower-level graph theory concepts. For these users, we provide an additional Python API called pylibcugraph, intended for applications that require a tighter integration with cuGraph at the Python layer with fewer dependencies. Users familiar with C/C++/CUDA and graph structures can access libcugraph and libcugraph_c for low level integration outside of python.
[...]

Apache Arrow on GPU

The GPU version of Apache Arrow is a common API that enables efficient interchange of tabular data between processes running on the GPU. End-to-end computation on the GPU avoids unnecessary copying and converting of data off the GPU, reducing compute time and cost for high-performance analytics common in artificial intelligence workloads. As the name implies, cuDF uses the Apache Arrow columnar data format on the GPU. Currently, a subset of the features in Apache Arrow are supported

https://www.phoronix.com/news/Intel-oneAPI-2023 :

Very interesting as part of this is Intel-owned Codeplay Software releasing oneAPI plug-ins for NVIDIA and AMD GPUs. This allows SYCL and oneAPI use atop NVIDIA's proprietary driver stack as well as AMD with ROCm.

@MridulS
Copy link
Member

MridulS commented Feb 4, 2023

RDF support would be worthwhile as a core "read write plugin" or as a third-party adapter with it's own integration tests that depend upon rdflib import IMHO.

rdflib already has all the support for conversion b/w networkx and rdf, not sure what else we can/should add.

about arrow on GPU

Well we support (soon) cugraph as a backend for networkx so that's good news.

But arrow is still columnar memory layout which doesn't really work that well with graph algorithms, so it's not that straight forward having arrow support for the graph data structure itself. Yes, it can work as a dumping ground for graph data but not something we can write code on top off (which is the more interesting thing to me).

@westurner
Copy link

https://arrow.apache.org/powered_by/ Ctrl-F "graph" doesn't appear to list e.g. CuGraph, which is built on Apache Arrow.

Does pyarrow already support SparseTensors?
https://arrow.apache.org/docs/cpp/api/tensor.html#sparse-tensors https://arrow.apache.org/docs/format/Other.html#sparse-tensor

rdflib already has all the support for conversion b/w networkx and rdf, not sure what else we can/should add.

  • To copy from rdflib to networkx:
  • To copy from (edit) networkx to rdflib

@westurner
Copy link

westurner commented Feb 4, 2023

FWIW, rdflib-hdt also includes support for RDF HDT Header Dictionary Triples; which IIUC this PR would make easier to readwrite from? https://en.wikipedia.org/wiki/HDT_(data_format)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Development

Successfully merging this pull request may close these issues.

None yet

6 participants