# Simple Graph Example

This is taken from the Apache Spark Documentation.  It is a great example that shows the basic structure of the `Graph` class.

In [None]:
import org.apache.spark._
import org.apache.spark.graphx._
// To make some of the examples work we will also need RDD
import org.apache.spark.rdd.RDD


A graph consists of vertices (sometimes called nodes), and edges.  

Edges are connections between vertices.  The connection can be a numeric value, or a word.  Usually, the word indicates a relationship between the vertices.  For example, "A likes B" indicates that vertex A and B are connected by an edge $e_{AB}=$ "likes"
In Apache Spark, the graphs are *directed*, which means that a relationship between A and B can be different than B and A.  
For example, the relationships "A likes B" and "B dislikes A" is a valid relationship (although very sad for A).

To create a graph, we simply create a list of vertices and a list of edges (connections).  Once we have these, and have enforced some self-consistency requirements, we can create the graph.

In [4]:
// Create an RDD for the vertices
val users: RDD[(VertexId, (String, String))] =
  sc.parallelize(Array((3L, ("rxin", "student")), (7L, ("jgonzal", "postdoc")),
                       (5L, ("franklin", "prof")), (2L, ("istoica", "prof")),
                       (4L, ("peter", "student"))))
// Create an RDD for edges
val relationships: RDD[Edge[String]] =
  sc.parallelize(Array(Edge(3L, 7L, "collab"),    Edge(5L, 3L, "advisor"),
                       Edge(2L, 5L, "colleague"), Edge(5L, 7L, "PI"),
                       Edge(4L, 0L, "student"),   Edge(5L, 0L, "colleague")))
// Define a default user in case there are relationship with missing user
val defaultUser = ("John Doe", "Missing")

// Build the initial Graph
val graph = Graph(users, relationships, defaultUser)

users = ParallelCollectionRDD[26] at parallelize at <console>:40
relationships = ParallelCollectionRDD[27] at parallelize at <console>:45
defaultUser = (John Doe,Missing)
graph = org.apache.spark.graphx.impl.GraphImpl@5cb896


org.apache.spark.graphx.impl.GraphImpl@5cb896

To make things explicit, let's call the vertices method and see what the structure looks like.  The vertex contains both an index and two different labels that can be accessed.

In [6]:
graph.vertices.collect.foreach(println)

(0,(John Doe,Missing))
(2,(istoica,prof))
(3,(rxin,student))
(4,(peter,student))
(5,(franklin,prof))
(7,(jgonzal,postdoc))


The `edges` method returns only the index names of the vertices that are connected, along with the relationship.  Notice that the connection to 0 is actually a dangling edge, and we have assigned that to a "default user".  This is a bit of error handling for the graph structure.

In [8]:
graph.edges.collect.foreach(println)

Edge(3,7,collab)
Edge(5,3,advisor)
Edge(2,5,colleague)
Edge(5,7,pi)
Edge(4,0,student)
Edge(5,0,colleague)


The `triplets` method returns the list of edges, but with some additional annotation.  Now we can recover the vertex labels if we want to see them.

In [9]:
// Notice that there is a user 0 (for which we have no information) connected to users
// 4 (peter) and 5 (franklin).
graph.triplets.map(
  triplet => triplet.srcAttr._1 + " is the " + triplet.attr + 
  " of " + triplet.dstAttr._1
).collect.foreach(println(_))

rxin is the collab of jgonzal
franklin is the advisor of rxin
istoica is the colleague of franklin
franklin is the pi of jgonzal
peter is the student of John Doe
franklin is the colleague of John Doe


We can add our own error handling here by using the `subgraph` method, which is the graph equivalent of `filter`.

In [1]:
// Remove missing vertices as well as the edges to connected to them
val validGraph = graph.subgraph(vpred = (id, attr) => attr._2 != "Missing")
// The valid subgraph will disconnect users 4 and 5 by removing user 0
validGraph.vertices.collect.foreach(println(_))
validGraph.triplets.map(
  triplet => 
  triplet.srcAttr._1 + " is the " + triplet.attr + 
  " of " + triplet.dstAttr._1
).collect.foreach(println(_))

rxin is the collab of jgonzal
franklin is the advisor of rxin
istoica is the colleague of franklin
franklin is the pi of jgonzal
peter is the student of John Doe
franklin is the colleague of John Doe
(2,(istoica,prof))
(3,(rxin,student))
(4,(peter,student))
(5,(franklin,prof))
(7,(jgonzal,postdoc))
rxin is the collab of jgonzal
franklin is the advisor of rxin
istoica is the colleague of franklin
franklin is the pi of jgonzal


org.apache.spark.graphx.impl.GraphImpl@3f2d0076

(3,(rxin,student))
(5,(franklin,prof))
(4,(peter,student))
(0,(John Doe,Missing))
(7,(jgonzal,postdoc))
(2,(istoica,prof))
