-
Notifications
You must be signed in to change notification settings - Fork 25
Using GoldenOrb
GoldenOrb is a powerful framework that allows for a multitude of possibilities when it comes to large scale graph processing. In order to use it to solve problems, a user must implement classes that extend framework classes to analyze graph. We recommended that you read Google's Pregel paper to understand graph algorithms that follow a distributed bulk synchronous parallel (BSP) model.
The first thing to do is to ask yourself what kind of graph problem you want to solve. What are the nodes? What are the node values? What messages are being passed in between? What is my ultimate analytical goal?
In this tutorial, we'll use the maximum value example from the Pregel paper. In this example, the ultimate goal of the computations is to find the maximum value in a graph. The concept here is that each vertex (a node) in the graph propagates its own value in a message to other vertices, and with each step each vertex looks at and compares its own value to the received values from other vertices. Once a vertex finds that there's another value out there that's greater than its own, it propagates messages once more and votes to halt, thereby turning off that vertex.
Let's return to GoldenOrb knowing what we need to define within the framework. The key building blocks that GoldenOrb provides are Vertex, VertexBuilder, VertexWriter, and OrbRunner. The first three classes are concerned with how we define what a vertex is and what data we'll be working with in the graph. The last class, OrbRunner, is used to configure and start a Job. Let's go over the specifics of each class and what you'll need to define.
A Vertex encompasses an ID, value and connecting edges, and is the representation of a single node in a graph.
- vertexID - the unique ID of the vertex, usually a String
- vertexValue - the data value of the vertex, in a max value implementation this would be an Integer
- edges - a collection of edges that define how the vertex is connected to other vertices
- compute - The method by which each vertex processes data. In a max value problem, the vertex compares the values received from other vertices in a message object with its own value. If it finds that there is a value greater than its own, it sends a message with that value out to its neighboring vertices.
A minor but important note with implementing Vertex: the framework defines some fields in Vertex with generic classes, so the vertexValue class, edge class, and message class that each vertex uses must be specified. These classes must extend the Writable interface from Hadoop. The framework supplies some already implemented basic message types that wrap some primitive data types, but vertices must be implemented by the user. See MaximumValueVertex.java
The VertexBuilder class is a utility class meant to do exactly as its name says, build vertices. A VertexBuilder should be extended to implement the buildVertex method, which essentially takes its parameters and constructs a new vertex object. See MaximumValueVertexReader.java
The VertexWriter class is used to create a relevant OrbContext for a vertex object. See MaximumValueVertexWriter.java
OrbRunner is the class that kickstarts every Job. The framework provides an OrbRunner class that must be extended on a per-Job basis to run. The main function of the OrbRunner class is to parse arguments when starting the Job from the command line, to initialize and connect to ZooKeeper, and to initialize and set the proper OrbConfiguration settings.
Properties in conf/orb-default.xml
and conf/orb-site.xml
can be overriden at the command line. GoldenOrb also supports copying the jar file containing your job to all of the running OrbTracker nodes via the property goldenOrb.orb.localFilesToDistribute
used as follows:
java -cp conf/.:org.goldenorb.core-0.1.1-SNAPSHOT.jar:lib/*:yourjar.jar org.algorithms.YourAlgorithm goldenOrb.orb.localFilesToDistribute=/home/user/yourjar.jar mapred.input.dir=/data/in/ mapred.output.dir=/data/out/ goldenOrb.orb.requestedPartitions=3 goldenOrb.orb.reservedPartitions=0 goldenOrb.orb.classpaths=/home/user/yourjar.jar
In some cases, it might be necessary to write your own Message and Writable types, like in a shortest path graph problem. The framework easily allows you to do so by properly extending Message and Writable.
See Python scripts