### $\S$ Start using GraphFrames
下面, 创建一个GraphFrame, 并进行pagerank算法

In [1]:
val myspark = spark
import myspark.implicits._

myspark = org.apache.spark.sql.SparkSession@420355a


org.apache.spark.sql.SparkSession@420355a

In [2]:
// 顶点集合的dataframe, 用id唯一标识1个顶点
val v = Seq(("a", "Alice", 34),
            ("b", "Bob", 36),
            ("c", "Charlie", 30)
           ).toDF("id", "name", "age")
// 创建边的集合,包含"src" and "dst"
val e = Seq(("a", "b", "friend"),
            ("b", "c", "follow"),
            ("c", "b", "follow")
           ).toDF("src", "dst", "relationship")
// 创建GraphFrame
import org.graphframes.GraphFrame
val g = GraphFrame(v,e)

v = [id: string, name: string ... 1 more field]
e = [src: string, dst: string ... 1 more field]
g = GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])


GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])

In [3]:
// Query: Get in-degree of each vertex.
g.inDegrees.show

+---+--------+
| id|inDegree|
+---+--------+
|  c|       1|
|  b|       2|
+---+--------+



In [4]:
// Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship='follow'").count()

2

In [5]:
// Run PageRank algorithm, and show results.
val res = g.pageRank.resetProbability(0.01).maxIter(20).run()
res.vertices.select("id","pagerank").show

+---+------------------+
| id|          pagerank|
+---+------------------+
|  a|              0.01|
|  b|1.0905890109440908|
|  c|1.8994109890559092|
+---+------------------+



res = GraphFrame(v:[id: string, name: string ... 2 more fields], e:[src: string, dst: string ... 2 more fields])


GraphFrame(v:[id: string, name: string ... 2 more fields], e:[src: string, dst: string ... 2 more fields])

### $\S$ GraphFrames User Guide

#### 2.1 Creating GraphFrames
* Vertex DataFrame: 顶点集合必须包含id字段, 用来唯一标识1个顶点
* Edge DataFrame: 边集合必须包含"src"(source vertex ID of edge) 和"dst"(destination vertex ID of edge)字段
* 顶点和边的dataframe可以包含任意其他字段  
  GraphFrame可以只从边的dataframe构建出来, 顶点的dataframe会被自动推断出来(通过src和dst)

In [6]:
import org.graphframes.GraphFrame
// Vertex DataFrame
val v = Seq(
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60)
 ).toDF("id", "name", "age")
// Edge DataFrame
val e = Seq(
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
 ).toDF("src", "dst", "relationship")
// Create a GraphFrame
val g = GraphFrame(v, e)

v = [id: string, name: string ... 1 more field]
e = [src: string, dst: string ... 1 more field]
g = GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])


GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])

* 上述GraphFrame已在package中自带

In [7]:
import org.graphframes.{examples,GraphFrame}
val g: GraphFrame = examples.Graphs.friends

g = GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])


GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])

#### 2.2 简单查询
* GraphFrames由边和顶点的DataFrame组成, 通过其vertices和edges属性获得
* g.inDegrees得到的也是DataFrame
* vertices和edges使用DataFrame的api来查询

In [8]:
import org.graphframes.{examples,GraphFrame}
val g: GraphFrame = examples.Graphs.friends  // get example graph

// Display the vertex and edge DataFrames
g.vertices.show()
// +--+-------+---+
// |id|   name|age|
// +--+-------+---+
// | a|  Alice| 34|
// | b|    Bob| 36|
// | c|Charlie| 30|
// | d|  David| 29|
// | e| Esther| 32|
// | f|  Fanny| 36|
// | g|  Gabby| 60|
// +--+-------+---+

g.edges.show()
// +---+---+------------+
// |src|dst|relationship|
// +---+---+------------+
// |  a|  b|      friend|
// |  b|  c|      follow|
// |  c|  b|      follow|
// |  f|  c|      follow|
// |  e|  f|      follow|
// |  e|  d|      friend|
// |  d|  a|      friend|
// |  a|  e|      friend|
// +---+---+------------+

// import Spark SQL package
import org.apache.spark.sql.DataFrame

// Get a DataFrame with columns "id" and "inDeg" (in-degree)
val vertexInDegrees: DataFrame = g.inDegrees
vertexInDegrees.show

// Find the youngest user's age in the graph.
// This queries the vertex DataFrame.
g.vertices.groupBy().min("age").show()

// Count the number of "follows" in the graph.
// This queries the edge DataFrame.
val numFollows = g.edges.filter("relationship = 'follow'").count()

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  a|  Alice| 34|
|  b|    Bob| 36|
|  c|Charlie| 30|
|  d|  David| 29|
|  e| Esther| 32|
|  f|  Fanny| 36|
|  g|  Gabby| 60|
+---+-------+---+

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  a|  b|      friend|
|  b|  c|      follow|
|  c|  b|      follow|
|  f|  c|      follow|
|  e|  f|      follow|
|  e|  d|      friend|
|  d|  a|      friend|
|  a|  e|      friend|
+---+---+------------+

+---+--------+
| id|inDegree|
+---+--------+
|  f|       1|
|  e|       1|
|  d|       1|
|  c|       2|
|  b|       2|
|  a|       1|
+---+--------+

+--------+
|min(age)|
+--------+
|      29|
+--------+



g = GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])
vertexInDegrees = [id: string, inDegree: int]
numFollows = 4


4

#### 2.3 模式查找
1. 模式查找在GraphFrame中通过DSL(Domain-Specific Language)进行.例如`graph.find("(a)-[e]->(b); (b)-[e2]->(a)")`将要查找一对顶点a,b且这两个顶点存在双向的边. 结果以DataFrame的形式返回, 列名为“a, b, e, e2.”


In [9]:
import org.graphframes.{examples,GraphFrame}
val g: GraphFrame = examples.Graphs.friends  // get example graph
g.vertices.show()
g.edges.show()

val motifs = g.find("(a)-[e]->(b); (b)-[e2]->(a)")
motifs.show()

// More complex queries can be expressed by applying filters.
motifs.filter("b.age>35").show

+---+-------+---+
| id|   name|age|
+---+-------+---+
|  a|  Alice| 34|
|  b|    Bob| 36|
|  c|Charlie| 30|
|  d|  David| 29|
|  e| Esther| 32|
|  f|  Fanny| 36|
|  g|  Gabby| 60|
+---+-------+---+

+---+---+------------+
|src|dst|relationship|
+---+---+------------+
|  a|  b|      friend|
|  b|  c|      follow|
|  c|  b|      follow|
|  f|  c|      follow|
|  e|  f|      follow|
|  e|  d|      friend|
|  d|  a|      friend|
|  a|  e|      friend|
+---+---+------------+

+----------------+--------------+----------------+--------------+
|               a|             e|               b|            e2|
+----------------+--------------+----------------+--------------+
|    [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|[c, b, follow]|
|[c, Charlie, 30]|[c, b, follow]|    [b, Bob, 36]|[b, c, follow]|
+----------------+--------------+----------------+--------------+

+----------------+--------------+------------+--------------+
|               a|             e|           b|            e2|
+--

g = GraphFrame(v:[id: string, name: string ... 1 more field], e:[src: string, dst: string ... 1 more field])
motifs = [a: struct<id: string, name: string ... 1 more field>, e: struct<src: string, dst: string ... 1 more field> ... 2 more fields]


[a: struct<id: string, name: string ... 1 more field>, e: struct<src: string, dst: string ... 1 more field> ... 2 more fields]

2. 大多数例子的图操作是无状态的, 类似上面的简单查找. 现在, 假设要查找4个顶点组成的链,从满足这个条件的链周静筛选符合状态定义的链:
    * 在path中初始化状态
    * 更新顶点a的状态
    * 更新顶点b的状态
    * c,d也和a,b一样
    * 如果最终状态匹配某些条件, 则这个链被过滤出来
    
如下, 展示了在4个顶点的链中, 找到关系为"friends"的个数大于2的的链

In [10]:
import org.apache.spark.sql.Column
import org.apache.spark.sql.functions.{col, when,lit}

// Find chains of 4 vertices.
val chain4 = g.find("(a)-[ab]->(b);(b)-[bc]->(c);(c)-[cd]->(d)")

// Query on sequence, with state (cnt)
//  (a) Define method for updating state given the next element of the motif.
def sumFriends(cnt:Column,relationshipCol:Column):Column = {
    when(relationshipCol==="friend",cnt+1).otherwise(cnt)
}

//  (b) Use sequence operation to apply method to sequence of elements in motif.
//      In this case, the elements are the 3 edges.

val condition = { Seq("ab", "bc", "cd")
  //  lit(java.lang.Object literal): Creates a Column of literal value.
  .foldLeft(lit(0))((cnt, edgeName) => sumFriends(cnt, col(edgeName)("relationship"))) }

//  (c) Apply filter to DataFrame.
val chainWith2Friends2 = chain4.where(condition >= 2)


chain4 = [a: struct<id: string, name: string ... 1 more field>, ab: struct<src: string, dst: string ... 1 more field> ... 5 more fields]
condition = CASE WHEN (cd[relationship] = friend) THEN (CASE WHEN (bc[relationship] = friend) THEN (CASE WHEN (ab[relationship] = friend) THEN (0 + 1) ELSE 0 END + 1) ELSE CASE WHEN (ab[relationship] = friend) THEN (0 + 1) ELSE 0 END END + 1) ELSE CASE WHEN (bc[relationship] = friend) THEN (CASE WHEN (ab[relationship] = friend) THEN (0 + 1) ELSE 0 END + 1) ELSE CASE WHEN (ab[relationshi...


sumFriends: (cnt: org.apache.spark.sql.Column, relationshipCol: org.apache.spark.sql.Column)org.apache.spark.sql.Column


CASE WHEN (cd[relationship] = friend) THEN (CASE WHEN (bc[relationship] = friend) THEN (CASE WHEN (ab[relationship] = friend) THEN (0 + 1) ELSE 0 END + 1) ELSE CASE WHEN (ab[relationship] = friend) THEN (0 + 1) ELSE 0 END END + 1) ELSE CASE WHEN (bc[relationship] = friend) THEN (CASE WHEN (ab[relationship] = friend) THEN (0 + 1) ELSE 0 END + 1) ELSE CASE WHEN (ab[relationshi...

In [11]:
chain4.show
chainWith2Friends2.show()

+----------------+--------------+----------------+--------------+----------------+--------------+----------------+
|               a|            ab|               b|            bc|               c|            cd|               d|
+----------------+--------------+----------------+--------------+----------------+--------------+----------------+
|  [a, Alice, 34]|[a, b, friend]|    [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|[c, b, follow]|    [b, Bob, 36]|
|    [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|[c, b, follow]|    [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|
|[c, Charlie, 30]|[c, b, follow]|    [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|[c, b, follow]|    [b, Bob, 36]|
|  [f, Fanny, 36]|[f, c, follow]|[c, Charlie, 30]|[c, b, follow]|    [b, Bob, 36]|[b, c, follow]|[c, Charlie, 30]|
| [e, Esther, 32]|[e, f, follow]|  [f, Fanny, 36]|[f, c, follow]|[c, Charlie, 30]|[c, b, follow]|    [b, Bob, 36]|
| [e, Esther, 32]|[e, d, friend]|  [d, David, 29]|[d, a, friend]|  [a, Alice, 34