Primitive Triples and Primitive Nodes

Paul Houle edited this page Aug 9, 2013 · 4 revisions
Clone this wiki locally

The PrimitiveTriple has been a part of Infovore since the very beginning. view source

Because invalid triples are endemic in large RDF data sets, there's a need for a data type that can represent both valid and invalid triples.

A primitive triple is simply three strings representing a subject, predicate, and object. A N-Triples file can be parsed with a lightweight parser that would, for instance, convert

blah blah blah.

to PrimitiveTriple("blah","blah","blah")

By itself, the PrimitiveTriple is not very intelligent. For instance, it doesn't expand something like rdf:type to the full URI. Potentially this could be a big problem, but once RDF data is validated and normalized by Parallel Super Eyeball 3, all nodes are converted to a canonical and complete form.

Primitive Triples are used in early processing stages of questionable data and they can also be used to rapidly add RDF capability to tools like Pig, Hive and Cascading.

The existence of the Primitive Triple also infers the existence of Primitive Nodes and Primitive Quads. The strings contained inside a Primitive Triple are Primitive Nodes. In valid data, these look like

<http://example.com/example>
1772
"bakemono"@jp

A Primitive Quad is an analogous data structure to the Primitive Triple that would, hypothetically, represent a RDF quad. It hasn't been implemented yet because so far there has been enough to do with Triples, but once there is interest in working with quads, it will be created.

Note that Primitive Triples of various levels of quality can be found throughout the Infovore system. For instance, when Freebase data is being pre-processed by the FreebaseRDFTool, the file is parsed into Primitive Triples that contain nodes like

ns:m.0bzy1

Processing that file, we expand that prefix to produce something that can stand on its own

<http://rdf.freebase.com/ns/m.0bzy1>

The FreebaseRDFTool has only a superficial understanding of RDF syntax, so the output still contains invalid nodes. The PSE3 process works on the output of the FreebaseRDFTool, converting valid PrimitiveTriples to Jena triples, which are then serialized to N-Triples by Jena's RIOT serializer.

Thus, the flexibility of PrimitiveTriples comes with some risk because you could get very different results depending on the input files you use. PrimitiveTriples are adequate to do many tasks, but you need to be conscious of what is in the input file.