<img src="img/logocs.jpeg" width="200" align="left">
<img src="img/logops.jpg" width="200" align="right">

# <center>Introduction to Resilient Distributed Property Graph</center>

<img src="http://spark.apache.org/docs/latest/img/graphx_logo.png" width=300/>
#### Family Name: 
#### First Name: 


## Exploring GraphX
### Apache Spark's API for  graph-parallel processing

The purpose of this lab is to learn about the GraphX library  to build a simple multi directed graph with Scala and to explore some Graph class methods. 

First we to  import the following libraries:

- org.apache.spark._ 
- org.apache.spark.graphx._
- org.apache.spark.rdd.RDD 

In [1]:
import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD

Intitializing Scala interpreter ...

Spark Web UI available at http://mbp-de-nacera:4040
SparkContext available as 'sc' (version = 3.0.1, master = local[*], app id = local-1615882137275)
SparkSession available as 'spark'


import org.apache.spark._
import org.apache.spark.graphx._
import org.apache.spark.rdd.RDD


Now  we  first create the vertices and egdes of our graph as  <code>facebook_vertices</code> and  <code>facebook_edges</code> using <code>Array</code> variables.

In [2]:
val facebook_vertices = Array((1L, ("Billy Bill", "Person")), (2L, ("Jacob Johnson", "Person")), (3L, ("Andrew Smith", "Person")), (4L, ("Iron Man Fan Page", "Page")), (5L, ("Captain America Fan Page", "Page")))
val facebook_edges = Array(Edge(1L, 2L, "Friends"), Edge(1L, 3L, "Friends"), Edge(2L, 4L, "Follower"), Edge(2L, 5L, "Follower"), Edge(3L, 5L, "Follower"))


facebook_vertices: Array[(Long, (String, String))] = Array((1,(Billy Bill,Person)), (2,(Jacob Johnson,Person)), (3,(Andrew Smith,Person)), (4,(Iron Man Fan Page,Page)), (5,(Captain America Fan Page,Page)))
facebook_edges: Array[org.apache.spark.graphx.Edge[String]] = Array(Edge(1,2,Friends), Edge(1,3,Friends), Edge(2,4,Follower), Edge(2,5,Follower), Edge(3,5,Follower))


### A summary list of Graph class operators

In [None]:
class Graph[VD, ED] {
  // Information about the Graph ===================================================================
  val numEdges: Long
  val numVertices: Long
  val inDegrees: VertexRDD[Int]
  val outDegrees: VertexRDD[Int]
  val degrees: VertexRDD[Int]
  // Views of the graph as collections =============================================================
  val vertices: VertexRDD[VD]
  val edges: EdgeRDD[ED]
  val triplets: RDD[EdgeTriplet[VD, ED]]
  // Functions for caching graphs ==================================================================
  def persist(newLevel: StorageLevel = StorageLevel.MEMORY_ONLY): Graph[VD, ED]
  def cache(): Graph[VD, ED]
  def unpersistVertices(blocking: Boolean = false): Graph[VD, ED]
  // Change the partitioning heuristic  ============================================================
  def partitionBy(partitionStrategy: PartitionStrategy): Graph[VD, ED]
  // Transform vertex and edge attributes ==========================================================
  def mapVertices[VD2](map: (VertexId, VD) => VD2): Graph[VD2, ED]
  def mapEdges[ED2](map: Edge[ED] => ED2): Graph[VD, ED2]
  def mapEdges[ED2](map: (PartitionID, Iterator[Edge[ED]]) => Iterator[ED2]): Graph[VD, ED2]
  def mapTriplets[ED2](map: EdgeTriplet[VD, ED] => ED2): Graph[VD, ED2]
  def mapTriplets[ED2](map: (PartitionID, Iterator[EdgeTriplet[VD, ED]]) => Iterator[ED2])
    : Graph[VD, ED2]
  // Modify the graph structure ====================================================================
  def reverse: Graph[VD, ED]
  def subgraph(
      epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
      vpred: (VertexId, VD) => Boolean = ((v, d) => true))
    : Graph[VD, ED]
  def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
  def groupEdges(merge: (ED, ED) => ED): Graph[VD, ED]
  // Join RDDs with the graph ======================================================================
  def joinVertices[U](table: RDD[(VertexId, U)])(mapFunc: (VertexId, VD, U) => VD): Graph[VD, ED]
  def outerJoinVertices[U, VD2](other: RDD[(VertexId, U)])
      (mapFunc: (VertexId, VD, Option[U]) => VD2)
    : Graph[VD2, ED]
  // Aggregate information about adjacent triplets =================================================
  def collectNeighborIds(edgeDirection: EdgeDirection): VertexRDD[Array[VertexId]]
  def collectNeighbors(edgeDirection: EdgeDirection): VertexRDD[Array[(VertexId, VD)]]
  def aggregateMessages[Msg: ClassTag](
      sendMsg: EdgeContext[VD, ED, Msg] => Unit,
      mergeMsg: (Msg, Msg) => Msg,
      tripletFields: TripletFields = TripletFields.All)
    : VertexRDD[A]
  // Iterative graph-parallel computation ==========================================================
  def pregel[A](initialMsg: A, maxIterations: Int, activeDirection: EdgeDirection)(
      vprog: (VertexId, VD, A) => VD,
      sendMsg: EdgeTriplet[VD, ED] => Iterator[(VertexId, A)],
      mergeMsg: (A, A) => A)
    : Graph[VD, ED]
  // Basic graph algorithms ========================================================================
  def pageRank(tol: Double, resetProb: Double = 0.15): Graph[Double, Double]
  def connectedComponents(): Graph[VertexId, ED]
  def triangleCount(): Graph[Int, ED]
  def stronglyConnectedComponents(numIter: Int): Graph[VertexId, ED]
}

### Question 1:
Now, we need to create the object Graph. Create RDD vertices <code>facebook_RDD_vertices</code> and edges <code>facebook_RDD_edges</code> to be able to build the object Graph. Define a <code>default_user</code> user which will be defaulty connected to any edge with missing vertex.

As a reminder, we have a SparkContext called <code>sc</code>. What happens when <code>sc</code> is used?

In [3]:
//To DO

//val myFacebookGraph = Graph(facebook_RDD_vertices, facebook_RDD_edges,default_user)
val facebook_RDD_vertices = sc.parallelize(facebook_vertices)
val facebook_RDD_edges = sc.parallelize(facebook_edges)
val default_user = ("Missing", "Person")
val myFacebookGraph = Graph(facebook_RDD_vertices, facebook_RDD_edges,default_user)

facebook_RDD_vertices: org.apache.spark.rdd.RDD[(Long, (String, String))] = ParallelCollectionRDD[0] at parallelize at <console>:38
facebook_RDD_edges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[1] at parallelize at <console>:39
default_user: (String, String) = (Missing,Person)
myFacebookGraph: org.apache.spark.graphx.Graph[(String, String),String] = org.apache.spark.graphx.impl.GraphImpl@495078e1


Here's a visual representation to show what the graph should look like:

<img src = "img/rhkiopM.png">

### Question 2:
Now, get information about the Graph and vertices, and diffrent  views using vertices, edges and triplets methods. Compute the maximum and the minimum out and in degrees.

In [5]:
//To Do
//vertices
myFacebookGraph.vertices.foreach(print)
//edges
print('\n')
myFacebookGraph.edges.foreach(print)
print('\n')
//triplets
myFacebookGraph.triplets.foreach(print)

// Define a reduce operation to compute the highest and lowest degree vertex
def min(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
  if (a._2 < b._2) a else b
}

def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
  if (a._2 > b._2) a else b
}

//minimum in degrees
val minInDegree: (VertexId, Int) = myFacebookGraph.inDegrees.reduce(min)
print(minInDegree)


//maximum in degrees
val maxInDegree: (VertexId, Int)  = myFacebookGraph.inDegrees.reduce(max)
print(maxInDegree)

(3,(Andrew Smith,Person))(2,(Jacob Johnson,Person))(1,(Billy Bill,Person))(5,(Captain America Fan Page,Page))(4,(Iron Man Fan Page,Page))
Edge(1,2,Friends)Edge(1,3,Friends)Edge(2,4,Follower)Edge(2,5,Follower)Edge(3,5,Follower)
((1,(Billy Bill,Person)),(2,(Jacob Johnson,Person)),Friends)((2,(Jacob Johnson,Person)),(5,(Captain America Fan Page,Page)),Follower)((3,(Andrew Smith,Person)),(5,(Captain America Fan Page,Page)),Follower)((1,(Billy Bill,Person)),(3,(Andrew Smith,Person)),Friends)((2,(Jacob Johnson,Person)),(4,(Iron Man Fan Page,Page)),Follower)(3,1)(5,2)

min: (a: (org.apache.spark.graphx.VertexId, Int), b: (org.apache.spark.graphx.VertexId, Int))(org.apache.spark.graphx.VertexId, Int)
max: (a: (org.apache.spark.graphx.VertexId, Int), b: (org.apache.spark.graphx.VertexId, Int))(org.apache.spark.graphx.VertexId, Int)
minInDegree: (org.apache.spark.graphx.VertexId, Int) = (3,1)
maxInDegree: (org.apache.spark.graphx.VertexId, Int) = (5,2)


### Question 3:

Use the filter function to find persons  who follow the "Captain America Fan Page".

In [14]:
//To Do
//get ids for captain america nodes
val captainIds = myFacebookGraph.vertices.filter{case (id, (name, pos)) => name == "Captain America Fan Page"}.
                    map{case (id, (_, _)) => id}.collect()

//select the edges "follower"
val captainEdges = myFacebookGraph.edges.filter { case Edge(src, dst, prop) => captainIds.contains(dst)}

//print the src id of the vertices
val captainFollowers = captainEdges.map{case Edge(src, dst, prop) => src}.collect()
val captainVertices = myFacebookGraph.vertices.filter{case (id, (_, _)) => captainFollowers.contains(id)}
captainVertices.foreach(println)  

//Another way using join
val idForFanPage = myFacebookGraph.vertices.filter(x => x._2._1 == "Captain America Fan Page").map(x => x._1).collect()(0)
val personIds = myFacebookGraph.edges.filter(x => x.attr == "Follower" && x.dstId == idForFanPage).map(x => (x.srcId, x))
val followingUsers = personIds.join(myFacebookGraph.vertices).map(x => x._2._2._1).collect().mkString(",")
println(" \n Another way : Who follow Captain America Fan Page")
println(followingUsers)

(3,(Andrew Smith,Person))
(2,(Jacob Johnson,Person))
 
 Another way : Who follow Captain America Fan Page
Jacob Johnson,Andrew Smith


captainIds: Array[org.apache.spark.graphx.VertexId] = Array(5)
captainEdges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = MapPartitionsRDD[129] at filter at <console>:47
captainFollowers: Array[org.apache.spark.graphx.VertexId] = Array(2, 3)
captainVertices: org.apache.spark.graphx.VertexRDD[(String, String)] = VertexRDDImpl[132] at RDD at VertexRDD.scala:57
idForFanPage: org.apache.spark.graphx.VertexId = 5
personIds: org.apache.spark.rdd.RDD[(org.apache.spark.graphx.VertexId, org.apache.spark.graphx.Edge[String])] = MapPartitionsRDD[137] at map at <console>:56
followingUsers: String = Jacob Johnson,Andrew Smith


However, there is an easier way to create views using filter on triplets. 

In [16]:
//Using triplets
val followingUsers = myFacebookGraph.triplets.filter(x => x.dstAttr._1 == "Captain America Fan Page").filter(x => x.attr == "Follower" && x.srcAttr._2 == "Person").map(x => x.srcAttr._1).collect().mkString(", ")
println(followingUsers + " follow Captain America Fan Page")

Jacob Johnson, Andrew Smith follow Captain America Fan Page


followingUsers: String = Jacob Johnson, Andrew Smith


### Question 4:
Transform vertex and edge attributes using mapVertices, mapEdges or mapTriplets methods. For instance, convert edge attributes to friendof, followerof, and include user or page popularity to graph (popularity could be defined as the number of friends or followers).

In [17]:

//convert the vertex labels to lower case
val newGraph = myFacebookGraph.mapVertices((id, attr) => (attr._1.toLowerCase(), attr._2))

// newGraph.vertices.foreach(println)

//convert edge attributes to friendof, followerof, etc. 
val newGraph2 = myFacebookGraph.mapEdges(e => e.attr + " of")

// newGraph2.edges.foreach(println)

//do the same operation using triplets
val newGraph3 = myFacebookGraph.mapTriplets(triplet => triplet.attr + " of")
newGraph3.edges.foreach(println)

print('\n')

myFacebookGraph.edges.foreach(println)

Edge(2,4,Follower of)
Edge(2,5,Follower of)
Edge(3,5,Follower of)
Edge(1,2,Friends of)
Edge(1,3,Friends of)

Edge(1,3,Friends)
Edge(1,2,Friends)
Edge(2,5,Follower)
Edge(3,5,Follower)
Edge(2,4,Follower)


newGraph: org.apache.spark.graphx.Graph[(String, String),String] = org.apache.spark.graphx.impl.GraphImpl@10a134d2
newGraph2: org.apache.spark.graphx.Graph[(String, String),String] = org.apache.spark.graphx.impl.GraphImpl@21fd664c
newGraph3: org.apache.spark.graphx.Graph[(String, String),String] = org.apache.spark.graphx.impl.GraphImpl@15c73f0a


### Question 5
Modify the graph structure using join methods. Create another graph to be merged with the above graph.

In [19]:

//define another graph
val facebook_rdd_vertices2 = sc.parallelize(Array((1L, ("Andrea", "Person")), (2L, ("Tamara", "Person")), (3L, ("Ledia", "Person")), (4L, ("Iron Man Fan Page", "Page")), (5L, ("Captain America Fan Page", "Page")), (10L, ("Thor Fan page", "Page"))))
val facebook_rdd_edges2 = sc.parallelize(Array(Edge(1L, 2L, "Friends"), Edge(1L, 3L, "Friends"), Edge(2L, 4L, "Follower"), Edge(2L, 5L, "Follower"), Edge(3L, 5L, "Follower"), Edge(3L, 10L, "Follower"), Edge(4L, 10L, "Follower")))
val myFacebookGraph2 = Graph(facebook_rdd_vertices2, facebook_rdd_edges2,("Akash", "Person"))

//we could change names based on myFacebookGraph2
val joinedGraph = myFacebookGraph.joinVertices(myFacebookGraph2.vertices)(
  (id, property, new_property) => new_property)

print("Changed Graph")
joinedGraph.vertices.foreach(println)

                                            
//merge myFacebookGraph2 and joinedGraph
val mergedGraph = Graph(
    joinedGraph.vertices.union(myFacebookGraph2.vertices),
    joinedGraph.edges.union(myFacebookGraph2.edges)
)

print("Merged Graph")
mergedGraph.vertices.foreach(println)                                           

Changed Graph(4,(Iron Man Fan Page,Page))
(3,(Ledia,Person))
(1,(Andrea,Person))
(5,(Captain America Fan Page,Page))
(2,(Tamara,Person))
Merged Graph(1,(Andrea,Person))
(5,(Captain America Fan Page,Page))
(3,(Ledia,Person))
(10,(Thor Fan page,Page))
(2,(Tamara,Person))
(4,(Iron Man Fan Page,Page))


facebook_rdd_vertices2: org.apache.spark.rdd.RDD[(Long, (String, String))] = ParallelCollectionRDD[195] at parallelize at <console>:41
facebook_rdd_edges2: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[String]] = ParallelCollectionRDD[196] at parallelize at <console>:42
myFacebookGraph2: org.apache.spark.graphx.Graph[(String, String),String] = org.apache.spark.graphx.impl.GraphImpl@61d4e50b
joinedGraph: org.apache.spark.graphx.Graph[(String, String),String] = org.apache.spark.graphx.impl.GraphImpl@182499b3
mergedGraph: org.apache.spark.graphx.Graph[(String, String),String] = org.apache.spark.graphx.impl.GraphImpl@2de8014b
