Skip to content
This repository

Define from/to RDF algorithms in terms of standard RDF #125

Closed
lanthaler opened this Issue May 22, 2012 · 11 comments

4 participants

Markus Lanthaler Gregg Kellogg Richard Cyganiak Manu Sporny
Markus Lanthaler
Collaborator

Feedback from Richard Cyganiak:

I'd prefer if the algorithms were defined in terms of standard RDF
terminology (RDF graph, triple, IRI, etc.) rather than API interfaces
that use quite different terminology (array of Statements, Statement,
NamedNode, etc.)

Gregg Kellogg
Owner

Richard, do you have an alternative to RDF triple? You've said you don't like Statement, but we need a concept that includes the notion of a graph name. For the time being, I'll continue to useStatement with an issue that the name should be updated when the RDF Concepts has an appropriate term. I note that the TriG graph uses Graph Statements, which is pretty close.

We could probably change fromRDF() to take either an RDF Graph or RDF Dataset. toRDF() is more problematic, as it yields statements. There's also a potential issue of how it should deal with an empty named graph. Changing this to do a single callback with either an RDF Graph or RDF Dataset is possible, but moves us quite a ways away from the RDF API.

For the time being, I'll annotate the spec with issue markers.

We currently use the IDL Statement[] to represent the Dataset. We can certainly change the method signature to use a RDF Dataset type, which can be described as a

Richard Cyganiak
Collaborator
cygri commented May 23, 2012

A “triple” is the things in the data model; a “statement” is the meaning of a triple, the thing that you make when asserting the triple. So it has an existing meaning in RDF that disagrees with the use in JSON-LD.

We don't have an alternative to “triple” that also includes “quads”. An obvious possibility would be “tuple”, as this encompasses triples and quads, but it's not a very pretty term. Personally I don't think it's likely that this would be defined in RDF Concepts, because the graph/triple model is so central to so many specs already that we'd rather talk about collections of graphs than about quads. But this is all still a bit up in the air.

Would it be possible at all to redefine the API to use something closer to RDF datasets? Having a Graph interface, and then Dataset with defaultGraph and multiple namedGraphs?

I think it's ok for toRDF() to yield an RDF dataset. Probably a note that says, “if you only want an RDF graph then get the default graph of the dataset” is sufficient. A warning though, there has been some pretty strong opposition to the idea of having “normal” RDF languages also capable of transmitting named graphs, because that's potentially dangerous when loading such a file into a graph store.

I wouldn't worry much about empty named graphs. There is precedent in SPARQL Update, where implementations are allowed to ignore empty named graphs. I think we will end up saying that one shouldn't consider the distinction between an empty named graph and an absent named graph as significant because some formats/systems don't maintain them.

Gregg Kellogg gkellogg referenced this issue from a commit May 23, 2012
Gregg Kellogg Change Statement to Quad (at least temporarily). Distinguish IDL refe…
…rences from term references.

Add definitions for RDFGraph and RDFDataset (not used presently).
Rename NamedNode to IRI, LiteralNode to Literal and provide references back to RDF Concepts.
This relates to issue #125.
9a09fb2
Gregg Kellogg
Owner

@cygri, you might want to monitor the raw version of the API spec to see if we're narrowing the issues: http://json-ld.org/spec/latest/json-ld-api/

Manu Sporny
Owner
msporny commented May 23, 2012

Said this in IRC, re-posting it here at @gkellogg's request:

Statement can't be changed to triple because it includes the graph name - it's a Quad.

We don't want to return the entire RDFGraph RDFDataset in the callback for memory purposes... your resulting graph could be multiple hundreds of megabytes in size. We need this stuff to be stream-process-able, so it needs to operate more like a SAX-based processor than a DOM-based one.

We also shouldn't generate an RDFGraph or an RDFDataset because those imply that you're only processing one named graph at a time or you're doing it serially - we can't make either assumption.

We also wanted to do one statement at a time to perhaps give the developer some control over when they wanted to stop processing. For example, if the callback returns false - stop processing... but we haven't really discussed that yet.

Gregg Kellogg
Owner

Just for clarification, RDFDataset would imply a default graph and zero or more named graphs. If Statement/Quad/Triple is used only in that context, it wouldn't need the name component.

For the sake of simplicity, I'll remove the RDFGraph/RDFDataset discussion for the time being, and we'll settle on Quad as the callback from toRDF(). If you come up with a better name for Quad, we can change to align with that.

Gregg Kellogg gkellogg referenced this issue from a commit May 23, 2012
Gregg Kellogg Remove RDFGraph and RDFDataset definitions, and cleanup some remainin…
…g references to "statement" with "quad".

This addresses issue #125
8edb1ae
Richard Cyganiak
Collaborator
cygri commented May 23, 2012

Ok, I understand the streaming issue and why the IDL needs to have an interface for individual triples/quads.

But I think this strengthens the case for divorcing the RDF algorithms from the IDL, and expressing them directly in RDF Concepts terms. Streaming is an implementation detail and shouldn't matter for how the to-/from-RDF algorithms are specified. Most JSON-LD generators that are connected to a native RDF system probably won't be streaming, and many won't use a JSON-LD API implementation at all.

Manu Sporny
Owner

Streaming isn't an implementation detail when it comes to the IDL - or rather, replace the phrase "streaming design" with "asynchronous design" and you will quickly see that doing the parsing in a synchronous fashion is a pretty bad idea (especially for large documents, or browser environments) - you don't want the UI to lock and you don't want thousands of threads waiting on a large number of documents to finish processing. That is - all synchronous systems can employ asynchronous sub-systems using callbacks. However, an asynchronous system will never be able to simulate a synchronous system because there isn't an easy way to say "wait until my subsystem has completed processing" without creating ugly spinlock hacks.

We are targeting languages like JavaScript, Python, PHP and Ruby first... languages like C, C++, C# and Java come second as far as how the API should be designed.

The RDF algorithms re-use the IDL because the terminology /should/ be aligned between the two. Since we have a note in the spec that the document should be aligned with RDF Concepts, and since Gregg has already made a number of edits ( a28f820 ) to bring the algorithm more in line with the current RDF Concepts document, I'm inclined to close this bug as we have done all of the concrete changes that we can at this time.

Gregg is working on a different way of explaining the to/from RDF conversion, but I don't think that should be a blocker for the FPWD. @cygri, how would you like to proceed? We can close this bug stating that we have aligned with RDF Concepts as much as possible right now, and hope to do more in the future. If you disagree with the current wording in the spec, please cite specific examples and hopefully do that in a different bug so that we have something a bit more concrete to work with.

Gregg Kellogg
Owner

The only place the algorithms vary from RDF Concept terms is in the "generate a Quad" statements. Quad is, however, defined in terms for RDF Concepts (although the hyper-link still seems to be going to the IDL definition, which I'll fix). @cygri had mentioned that RDF Concepts may have something to say about quads, so it seems like a reasonable place-holder. Otherwise, the language would be a bit more clunky: "Generate a Triple in either the default graph, or a named graph, if graph name is not null..." repeated several times. Generate a Quad seems more succinct.

In the next couple of days, I'll ad informative sections that walk through the conversion process to make it clear to a more casual reader what the detail of the algorithms mean.

Markus Lanthaler
Collaborator

@gkellogg @cygri can we close this issue or should we use something else than "quad"?

Richard Cyganiak
Collaborator

Editorial comment on http://json-ld.org/spec/latest/json-ld-api/#quad

Replace this sentence:

See [RDF-CONCEPTS] definition for RDF triple, which most closely aligns to Quad.

with:

An RDF Quad is an RDF triple [RDF-CONCEPTS] with an optional fourth element, the graph name, being a Node.

Markus Lanthaler lanthaler referenced this issue from a commit November 08, 2012
Markus Lanthaler Clarified what a Quad is
This was suggested by @cygri, thanks.

This addresses #125.
846616d
Gregg Kellogg gkellogg closed this November 12, 2012
Gregg Kellogg
Owner

Update to RDF in 6d8e825. From RDF should also be re-written to use datasets instead of quads.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.