Define from/to RDF algorithms in terms of standard RDF #125

lanthaler opened this Issue May 22, 2012 · 11 comments

4 participants

JSON-LD Public Repositories member

Feedback from Richard Cyganiak:

I'd prefer if the algorithms were defined in terms of standard RDF
terminology (RDF graph, triple, IRI, etc.) rather than API interfaces
that use quite different terminology (array of Statements, Statement,
NamedNode, etc.)

@gkellogg gkellogg was assigned May 23, 2012
JSON-LD Public Repositories member

Richard, do you have an alternative to RDF triple? You've said you don't like Statement, but we need a concept that includes the notion of a graph name. For the time being, I'll continue to useStatement with an issue that the name should be updated when the RDF Concepts has an appropriate term. I note that the TriG graph uses Graph Statements, which is pretty close.

We could probably change fromRDF() to take either an RDF Graph or RDF Dataset. toRDF() is more problematic, as it yields statements. There's also a potential issue of how it should deal with an empty named graph. Changing this to do a single callback with either an RDF Graph or RDF Dataset is possible, but moves us quite a ways away from the RDF API.

For the time being, I'll annotate the spec with issue markers.

We currently use the IDL Statement[] to represent the Dataset. We can certainly change the method signature to use a RDF Dataset type, which can be described as a


A “triple” is the things in the data model; a “statement” is the meaning of a triple, the thing that you make when asserting the triple. So it has an existing meaning in RDF that disagrees with the use in JSON-LD.

We don't have an alternative to “triple” that also includes “quads”. An obvious possibility would be “tuple”, as this encompasses triples and quads, but it's not a very pretty term. Personally I don't think it's likely that this would be defined in RDF Concepts, because the graph/triple model is so central to so many specs already that we'd rather talk about collections of graphs than about quads. But this is all still a bit up in the air.

Would it be possible at all to redefine the API to use something closer to RDF datasets? Having a Graph interface, and then Dataset with defaultGraph and multiple namedGraphs?

I think it's ok for toRDF() to yield an RDF dataset. Probably a note that says, “if you only want an RDF graph then get the default graph of the dataset” is sufficient. A warning though, there has been some pretty strong opposition to the idea of having “normal” RDF languages also capable of transmitting named graphs, because that's potentially dangerous when loading such a file into a graph store.

I wouldn't worry much about empty named graphs. There is precedent in SPARQL Update, where implementations are allowed to ignore empty named graphs. I think we will end up saying that one shouldn't consider the distinction between an empty named graph and an absent named graph as significant because some formats/systems don't maintain them.

@gkellogg gkellogg added a commit that referenced this issue May 23, 2012
@gkellogg gkellogg Change Statement to Quad (at least temporarily). Distinguish IDL refe…
…rences from term references.

Add definitions for RDFGraph and RDFDataset (not used presently).
Rename NamedNode to IRI, LiteralNode to Literal and provide references back to RDF Concepts.
This relates to issue #125.
JSON-LD Public Repositories member

@cygri, you might want to monitor the raw version of the API spec to see if we're narrowing the issues:

JSON-LD Public Repositories member

Said this in IRC, re-posting it here at @gkellogg's request:

Statement can't be changed to triple because it includes the graph name - it's a Quad.

We don't want to return the entire RDFGraph RDFDataset in the callback for memory purposes... your resulting graph could be multiple hundreds of megabytes in size. We need this stuff to be stream-process-able, so it needs to operate more like a SAX-based processor than a DOM-based one.

We also shouldn't generate an RDFGraph or an RDFDataset because those imply that you're only processing one named graph at a time or you're doing it serially - we can't make either assumption.

We also wanted to do one statement at a time to perhaps give the developer some control over when they wanted to stop processing. For example, if the callback returns false - stop processing... but we haven't really discussed that yet.

JSON-LD Public Repositories member

Just for clarification, RDFDataset would imply a default graph and zero or more named graphs. If Statement/Quad/Triple is used only in that context, it wouldn't need the name component.

For the sake of simplicity, I'll remove the RDFGraph/RDFDataset discussion for the time being, and we'll settle on Quad as the callback from toRDF(). If you come up with a better name for Quad, we can change to align with that.

@gkellogg gkellogg added a commit that referenced this issue May 24, 2012
@gkellogg gkellogg Remove RDFGraph and RDFDataset definitions, and cleanup some remainin…
…g references to "statement" with "quad".

This addresses issue #125

Ok, I understand the streaming issue and why the IDL needs to have an interface for individual triples/quads.

But I think this strengthens the case for divorcing the RDF algorithms from the IDL, and expressing them directly in RDF Concepts terms. Streaming is an implementation detail and shouldn't matter for how the to-/from-RDF algorithms are specified. Most JSON-LD generators that are connected to a native RDF system probably won't be streaming, and many won't use a JSON-LD API implementation at all.

JSON-LD Public Repositories member

Streaming isn't an implementation detail when it comes to the IDL - or rather, replace the phrase "streaming design" with "asynchronous design" and you will quickly see that doing the parsing in a synchronous fashion is a pretty bad idea (especially for large documents, or browser environments) - you don't want the UI to lock and you don't want thousands of threads waiting on a large number of documents to finish processing. That is - all synchronous systems can employ asynchronous sub-systems using callbacks. However, an asynchronous system will never be able to simulate a synchronous system because there isn't an easy way to say "wait until my subsystem has completed processing" without creating ugly spinlock hacks.

We are targeting languages like JavaScript, Python, PHP and Ruby first... languages like C, C++, C# and Java come second as far as how the API should be designed.

The RDF algorithms re-use the IDL because the terminology /should/ be aligned between the two. Since we have a note in the spec that the document should be aligned with RDF Concepts, and since Gregg has already made a number of edits ( a28f820 ) to bring the algorithm more in line with the current RDF Concepts document, I'm inclined to close this bug as we have done all of the concrete changes that we can at this time.

Gregg is working on a different way of explaining the to/from RDF conversion, but I don't think that should be a blocker for the FPWD. @cygri, how would you like to proceed? We can close this bug stating that we have aligned with RDF Concepts as much as possible right now, and hope to do more in the future. If you disagree with the current wording in the spec, please cite specific examples and hopefully do that in a different bug so that we have something a bit more concrete to work with.

JSON-LD Public Repositories member

The only place the algorithms vary from RDF Concept terms is in the "generate a Quad" statements. Quad is, however, defined in terms for RDF Concepts (although the hyper-link still seems to be going to the IDL definition, which I'll fix). @cygri had mentioned that RDF Concepts may have something to say about quads, so it seems like a reasonable place-holder. Otherwise, the language would be a bit more clunky: "Generate a Triple in either the default graph, or a named graph, if graph name is not null..." repeated several times. Generate a Quad seems more succinct.

In the next couple of days, I'll ad informative sections that walk through the conversion process to make it clear to a more casual reader what the detail of the algorithms mean.

JSON-LD Public Repositories member

@gkellogg @cygri can we close this issue or should we use something else than "quad"?


Editorial comment on

Replace this sentence:

See [RDF-CONCEPTS] definition for RDF triple, which most closely aligns to Quad.


An RDF Quad is an RDF triple [RDF-CONCEPTS] with an optional fourth element, the graph name, being a Node.

@lanthaler lanthaler added a commit that referenced this issue Nov 8, 2012
@lanthaler lanthaler Clarified what a Quad is
This was suggested by @cygri, thanks.

This addresses #125.
@gkellogg gkellogg closed this Nov 13, 2012
JSON-LD Public Repositories member

Update to RDF in 6d8e825. From RDF should also be re-written to use datasets instead of quads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment