In [3]:
import org.apache.spark._
import org.apache.spark.graphx._
// To make some of the examples work we will also need RDD
import org.apache.spark.rdd.RDD
import org.apache.spark.graphx.{Edge, Graph}

### 5.1 Graph的属性
#### 1. 图的表示
(1) 每个Vertex都是一个kv对, 以64-bit长度的long型值作为key(称作vertexId); 顶点的自定义属性值为value  
(2) 每个Edge都有srcVertextId和destVertexId; 还有边上的属性值  
(3) **Graph[VD,ED]**是图对象的泛型, VD是定点的属性类型; ED是边的属性类型  
(4) 有些case中, vertex的属性可能有多个类型, 此时只能通过继承同一个接口实现 :   
    ```scala
    class VertexProperty()
    case class UserProperty(val name:String) extends VertexProperty
    case class ProductProperty(val name:String,val price:Double) extends VertexProperty
    var g:Graph[VertexProperty,String] = null
    ```
#### 2. Graph是静态的
(1) 和RDD一样, Graph也是静态的, 任何修改Graph的值或结构的操作, 都会产生一个全新的Graph;   
(2) 不过, 源Graph中未产生修改的部分数据可以在新产生的Graph上重用
   
#### 3. Graph的逻辑表示
Graph的逻辑表示为2个RDD集合(源码中VertexRDD和EdgeRDD都继承自RDD类), 如下表示 : 
<img src="img/graphres.png" width="65%">

#### 4. 创建Graph   
如下, 用2个RDD创建Graph, 后面会介绍更多的创建Graph的方法

In [4]:
import org.apache.spark.graphx.Graph
// rdd for vertex
val users:RDD[(VertexId,(String,String))] = sc.parallelize(Array((3L, ("rxin", "student")), 
                                                                 (7L, ("jgonzal", "postdoc")),
                                                                 (5L, ("franklin", "prof")), 
                                                                 (2L, ("istoica", "prof"))))
// rdd for edge
val relationships:RDD[Edge[String]] = sc.parallelize(Array(Edge(3l,7l,"collab"),
                                                           Edge(5L, 3L, "advisor"),
                                                           Edge(2L, 5L, "colleague"), 
                                                           Edge(5L, 7L, "pi")))

val defaultUser:(String,String) = ("Anonymous","Missing")
val g:Graph[(String,String),String] = Graph(users,relationships,defaultUser)

users = ParallelCollectionRDD[0] at parallelize at <console>:37
relationships = ParallelCollectionRDD[1] at parallelize at <console>:40
defaultUser = (Anonymous,Missing)
g = org.apache.spark.graphx.impl.GraphImpl@4e301178


org.apache.spark.graphx.impl.GraphImpl@4e301178

In [7]:
val cnt1 = g.vertices.filter({                            // VertexRDD[VD] extends RDD[(VertexId, VD)]
  case (id,(name,position)) => position == "postdoc"      // EdgeRDD[ED] extends RDD[Edge[ED]](sc, deps)
}).count

val cnt2 = g.edges.filter(e=>e.srcId>e.dstId).count

cnt1 = 1
cnt2 = 1


1

#### 5. Graph的三元组视图
(1) 除了从Rdd[Long,VD]和RDD[Long,Long,ED]的角度看图的构成外, Graph还有一个三元组视图:   
&nbsp;&nbsp;&nbsp;&nbsp;(1) (srcId,srcAttr),  
&nbsp;&nbsp;&nbsp;&nbsp;(2) (dstId,dstAttr),  
&nbsp;&nbsp;&nbsp;&nbsp;(3) attr    
(2) 三元组可以看做join得来的视图
```sql
SELECT src.id, dst.id, src.attr, e.attr, dst.attr
FROM edges AS e LEFT JOIN vertices AS src, vertices AS dst
ON e.srcId = src.Id AND e.dstId = dst.Id
```

In [8]:
/**
 * class EdgeTriplet[VD, ED]{
 *    override def toString: String = ((srcId, srcAttr), (dstId, dstAttr), attr).toString()
 * } 
 */
g.triplets.collect().foreach(println)

((3,(rxin,student)),(7,(jgonzal,postdoc)),collab)
((5,(franklin,prof)),(3,(rxin,student)),advisor)
((2,(istoica,prof)),(5,(franklin,prof)),colleague)
((5,(franklin,prof)),(7,(jgonzal,postdoc)),pi)


### 5.2 操作符
#### 1. map操作符
每种map操作都会产生一个新的Graph, 不过会重用map之前的Graph的部分数据
```scala
class Graph{
    def mapVertices[VD2: ClassTag](map: (VertexId, VD) => VD2)
    def mapEdges[ED2: ClassTag](map: Edge[ED] => ED2): Graph[VD, ED2]
    def mapTriplets[ED2: ClassTag](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
}
```

#### 2. 结构操作符
(1) reverse: Graph[VD, ED]   
将边的src和dst互换; 由于改变边的属性和边的条数, 可以很快速地实现
```scala
def reverse: Graph[VD, ED]
```
  
(2) subgraph: graph中只保留
```scala
def subgraph(
      epred: EdgeTriplet[VD, ED] => Boolean = (x => true),
      vpred: (VertexId, VD) => Boolean = ((v, d) => true))
    : Graph[VD, ED]
```
(3) mask: 构建一个子图, 只保留另一个图中也出现的顶点和边
```scala
def mask[VD2: ClassTag, ED2: ClassTag](other: Graph[VD2, ED2]): Graph[VD, ED]
```

In [11]:
import org.apache.spark.graphx.Graph
// rdd for vertex
val users:RDD[(VertexId,(String,String))] = sc.parallelize(Array((3L, ("rxin", "student")), 
                                                                 (7L, ("jgonzal", "postdoc")),
                                                                 (5L, ("franklin", "prof")), 
                                                                 (2L, ("istoica", "prof"))))
// rdd for edge
val relationships:RDD[Edge[String]] = sc.parallelize(Array(Edge(3l,7l,"collab"),
                                                           Edge(5L, 3L, "advisor"),
                                                           Edge(2L, 5L, "colleague"), 
                                                           Edge(5L, 7L, "pi")))

val defaultUser:(String,String) = ("Anonymous","Missing")
val g:Graph[(String,String),String] = Graph(users,relationships,defaultUser)
val validGraph = g.subgraph(vpred = (vertexId,attr)=> attr._2 != "prof")
validGraph.triplets.collect().foreach(println)

((3,(rxin,student)),(7,(jgonzal,postdoc)),collab)


users = ParallelCollectionRDD[77] at parallelize at <console>:46
relationships = ParallelCollectionRDD[78] at parallelize at <console>:51
defaultUser = (Anonymous,Missing)
g = org.apache.spark.graphx.impl.GraphImpl@5d140851
validGraph = org.apache.spark.graphx.impl.GraphImpl@1806e871


org.apache.spark.graphx.impl.GraphImpl@1806e871

#### 3. join操作符

### 5.3 Pregel
#### 前言
1. Graphx有很多的内部优化, 具体参考[graphx paper](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-gonzalez.pdf)

#### 一. Pregel api
1. 三种user defined function
    1. **Vertex Program**:   
       Vertex Program在每个ertex上运行, 其输入为
        1. message list
        2. vertex attr state
        3. vertexId  
       输出为vertex的新状态(新的attr)
    2. **Send Message Program**:  
       运行在需要的边上;其输入为三元组视图**EdgeTriplet**, 输出为1个message
    3. **Merge Message Program**:    
       把同一个顶点上的2个message合并为一个message, 输出组合后的message  
       消息的形式为kv对: (vartexid作为key, vertex message作为value)
2. 三个参数
    1. Initial message:该message会发送给每个vertex, 用于第一次迭代  
    2. Max Iteration: 最大迭代次数    
    3. Edge Direction: 用于过滤那些需要执行send message程序的边上; 只有当变得方向是OUT是才会执行发送程序
    
#### 二. Pregel的表现优化
1. VertexRDD手动分区   
    graphx只会对EdgeRDD分区, 因此需要手动对VertexRDD分区; 经验上看, VertexRDD和EdgeRDD个数相同时会有更好的表现
2. 设置checkpoint    
&nbsp;&nbsp;&nbsp;&nbsp;因为graphx是迭代算法, 每次迭代都会导致构成graph的VertexRDD和EdgeRDD链会越来越长; 所以需要使用缓存来确保每次迭代避免重复计算RDD链;单着并不能改变一个事实: 子RDD到父RDD的对象引用列表还是会不断增长. 为了切断RDD的linage, 应该在每几次迭代后进行checkpoint.  
3. 如下,有checkpoint的迭代式图更新算法(模拟) :    
pregel中每次迭代会persist到内存  ; 每隔一段间隔checkpint;

In [30]:
sc.setLogLevel("WARN")
import org.apache.spark.storage.StorageLevel

def fun() = {
    sc.setCheckpointDir("/tmp/test")
    var updateCount = 0
    val interval = 10

    def update(data:Graph[Int,Int]):Unit = {
      data.persist()  // 每轮迭代都persist
      updateCount += 1
      if(updateCount%interval == 0)   // 每隔interval进行checkpoint
        data.checkpoint()
    }

    var g = Graph.fromEdges(sc.parallelize(Array(Edge(1l,3l,1),
      Edge(2l,4l,1),
      Edge(3l,4l,1))),1)

    g.persist()
    println(g.vertices.count())

    for(i <- 1 to 20){
      println(s"Iteration $i")
      val newGraph = g.mapVertices((vid,vattr) => (vattr*i)/17)
      g = g.outerJoinVertices(newGraph.vertices)({(vid,vAttr,newAttr) => newAttr.getOrElse(-99)})
      update(g)
      println(g.vertices.count)
    }

    g.triplets.collect.foreach(println)
}
fun()

4
Iteration 1
4
Iteration 2
4
Iteration 3
4
Iteration 4
4
Iteration 5
4
Iteration 6
4
Iteration 7
4
Iteration 8
4
Iteration 9
4
Iteration 10
4
Iteration 11
4
Iteration 12
4
Iteration 13
4
Iteration 14
4
Iteration 15
4
Iteration 16
4
Iteration 17
4
Iteration 18
4
Iteration 19
4
Iteration 20
4


fun: ()Unit


((1,0),(3,0),1)
((2,0),(4,0),1)
((3,0),(4,0),1)
