Skip to content
paulhoule edited this page Dec 26, 2012 · 11 revisions

Millipede

Overview

Millipede is a Map/Reduce framework designed for processing RDF and Freebase quad data on a single multi-threaded workstation or server. Unlike other RDF tools, it works in a streaming mode that requires that we only hold a small fraction of the triples in memory. Unlike Hadoop, it is simple to operate, compatible with both Windows and Unix, and is optimized for handling data sets around one billion triples.

Concepts

It's a general policy in millipede that exceptions be thrown upward as high as is correct, so most methods of most interfaces throw Exception.

Sink A Sink is a consumer for push-mode processing. A Sink may write to a file (see CodecSink or NTriplesSink, collect triples in a Jena model (see JenaModelSink) or process the stream and pass results to one or more output sinks (see ProgressReportingSink or FilterSink). In this way, "Map" operations can be implemented

GroupingSink A GroupingSink implements an abstract class to implement a "Reduce" operation. Assuming that the stream passing through it is ordered by group key, a GroupingSink receives notifications when a group begins and ends, so it can process a group of facts together. The AssemblerSegment is an example of a GroupingSink that gathers all triples sharing a common subject into an in-memory Jena model and performs SPARQL operations on it.

Source A Source implements pull mode processing. For practical reasons, most processing pipelines start with a Source and use methods from the Plumbing class to push data into a push pipeline. Even so, pull-based pipelines can be assembled with Sources such as TransformingSource and OrderVerificationFilter. OrderedMergeSource shuffles together sorted data from multiple sources and StoredValueSource

MultiFile MultiFile implements MultiSource. Unlike a Source, a MultiSource implements push-mode processing, because some of our data "sources" such as Jena's N-Triples parser, work only in a push mode. A MultiFile consists of a number of files in a directory with names like

triples0000.nt.gz
triples0001.nt.gz
...
triples1023.nt.gz

Facts are assigned to files on the basis of a PartitionFunction supplied when the MultiFile object is initialized. Many subclasses of MultiFile exist such as LineMultiFile and TripleMultiFile. In addition to the pushBin() method inherited from MultiSource, MultiFiles implement a createSink() method that creates Sink objects that deposit facts into separate files.

PushMultiSource because certain operations (such as OrderedMergeSource are difficult to do in push mode, PullMultiSource provides an alternate API for pull-mode processing. This is implemented by LineMultiFile for instance.

Millipede A Millipede implements createSegment(int segmentNumber) where the segmentNumber is the number of the file that is being processed. Each "segment" is a Sink. Typically you'll write a Millipede and a segment implementation that go together, such as Sort and SortSegment.

Accumulator An Accumulator is Sink that implements a getResult() method that returns a result calculated on the complete stream.

Runner A Runner spawns multiple threads to stream data from a MultiSource into Millipede segments in parallel. If a segment implements Accumulator the Runner will return all of the results returned as a List.

Partitioner Given a PartitionFunction, the Partitioner breaks up a single file into a MultiFile suitable for parallel processing.

Examples

A number of processing pipelines based on millipede exist in the hydroxide framework.

PartitionQuadDumpApp is the first step in processing a Freebase quad file and combines the Partitioner with a simple pipeline to filter and reverse certain quads.

SortQuadDumpApp uses the Sort millipede to order the quad dump on subject.

ExtractKeyRecordsApp is a simple Map operation that extracts a simple type of quad in an efficient format.

AssemblerApp is a fairly complex application that implements a reduce operation with processing steps described in the SPARQL language.

Clone this wiki locally