<a href="https://colab.research.google.com/github/jalorenzo/SparkNotebookColab/blob/master/BDF_11_Graph_processing.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#00 - Configuration of Apache Spark on Collaboratory


###Installing Java, Spark, and Findspark


---


This code installs Apache Spark 3.0.1, Java 8, and [Findspark](https://github.com/minrk/findspark), a library that makes it easy for Python to find Spark.

In [None]:
import os

os.environ["SPARK_VERSION"] = "spark-3.2.3"
!apt-get update
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget  https://archive.apache.org/dist/spark/$SPARK_VERSION/$SPARK_VERSION-bin-hadoop2.7.tgz
!tar xf $SPARK_VERSION-bin-hadoop2.7.tgz
!echo $SPARK_VERSION-bin-hadoop2.7.tgz
!pip install -q findspark

In [None]:
!wget https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.2-s_2.12/graphframes-0.8.2-spark3.2-s_2.12.jar

### Set Environment Variables
Set the locations where Spark and Java are installed.

In [None]:
import os
os.environ["JAVA_HOME"] = "/usr/lib/jvm/java-8-openjdk-amd64"
os.environ["SPARK_HOME"] = "/content/spark/"
os.environ["DRIVE_DATA"] = "/content/gdrive/My Drive/Enseignement/2023-2024/ING3/HPDA/BigDataFrameworks/data/"

!rm /content/spark
!ln -s /content/$SPARK_VERSION-bin-hadoop2.7 /content/spark

!mv graphframes-0.8.2-spark3.2-s_2.12.jar /content/spark/jars/

!export SPARK_HOME=/content/spark
!export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin

!echo $SPARK_HOME
!env |grep  "DRIVE_DATA"

!ls -l /content/

### Start a SparkSession
This will start a local Spark session.

In [None]:
!python -V

import findspark
findspark.init()

from pyspark import SparkContext
sc = SparkContext.getOrCreate()

sc.addPyFile('/content/spark/jars/graphframes-0.8.2-spark3.2-s_2.12.jar')

# Example: shows the PySpark version
print("PySpark version {0}".format(sc.version))

# Example: parallelise an array and show the 2 first elements
sc.parallelize([2, 3, 4, 5, 6]).cache().take(2)

In [None]:
from pyspark.sql import SparkSession
# We create a SparkSession object (or we retrieve it if it is already created)
spark = SparkSession \
.builder \
.appName("My application") \
.config("packages","graphframes:graphframes-0.8.2-spark3.2-s_2.12") \
.master("local[4]") \
.getOrCreate()
# We get the SparkContext
sc = spark.sparkContext

In [None]:
# Mount Google Drive
from google.colab import drive
drive.mount('/content/gdrive')


---


# 11 - Graph processing

## GraphX: Graph processing with RDDs

Parallel graph programming using Spark

- Main abstraction: [*Graph*](http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.Graph)
    -   Directed multigraph with properties assigned to vertices and edges
    -   It is an extension of the RDDs
- It includes graph constructors, basic operators ( *reverse*, *subgraph*…) and graph algorithms ( *PageRank*, *Triangle Counting*…)
- Only availabe on Scala.

Documentation: [spark.apache.org/docs/latest/graphx-programming-guide.html](http://spark.apache.org/docs/latest/graphx-programming-guide.html)

API: https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.package

## Graphs in GraphX
<img src="http://persoal.citius.usc.es/tf.pena/TCDM/figs/grapxgraph.png" alt="Grafo en GraphX" style="width: 50px;"/>
(Source: M.S. Malak, R. East "Spark GraphX in action", Manning, 2016)

### Example of a simple graph
<img src="http://persoal.citius.usc.es/tf.pena/TCDM/figs/simpsonsgraph.png" alt="Grafo de los Simpson" style="width: 600px;"/>
(Source: P. Zecević, M. Bonaći "Spark in action", Manning, 2017)

## GraphFrames: : Graph processing with DataFrames

In Python we can use [*GraphFrames*](https://graphframes.github.io/graphframes/docs/_site/quick-start.html) which wraps GraphX algorithms under the DataFrames API, providing a Python interface.

- Support for multiple languages is on the works
    - For now,  available for Scala and Python
- Not yet integrated on Spark
    - Available as an external package (https://spark-packages.org/package/graphframes/graphframes)

More information:
- Project web: https://graphframes.github.io/graphframes/docs/_site/
- Python API : https://graphframes.github.io/graphframes/docs/_site/api/python/index.html


### Graphs using pyspark and GraphFrames

In [None]:
# The following example shows how to create a GraphFrame, query it, and run the PageRank algorithm.
# Source: https://graphframes.github.io/graphframes/docs/_site/quick-start.html

from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

from graphframes import *

# Create a Vertex DataFrame with unique ID column "id"
v = sqlContext.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = sqlContext.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame

g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
g.edges.filter("relationship = 'follow'").count()

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()

#Exercises

## Exercise 11.1:

A long time ago in a galaxy far, far away, the characters of the Star Wars franchise interacted with each other in an endless series of films. An ancient Jedi order, called the *Data Guardians of the Galaxy* (not affiliated to Marvel's homonym :) registered all those interactions and saved them on a digital file so that they could be studied by the forthcoming generations. This file was originally called (guess it) `sw.txt`, and you will find it in the `/data` directory.

Using pySpark, perform the following operations and answer the following questions:

1. Load the `$DRIVE_DATA/sw.txt` file. Take into account that it is a JSON file.
2. Using this information, create a graph of interactions between the Star Wars characters.
3. How many different characters are there?
4. How many interactions are there?
5. Who is the central character in Star Wars (the one who interacts in most scenes)?
6. Who is the character with the highest 'rank' in Star Wars (use the PageRank algorithm)?