# 1 Getting started with GraphX

## 1.1 Intro of GraphX

**GraphX is a component in Spark for graphs and graph-parallel computation**. At a high level, GraphX extends the Spark RDD by introducing a new Graph abstraction: **a directed multigraph with properties attached to each vertex and edge**. To support graph computation, GraphX exposes a set of fundamental operators (e.g., `subgraph, joinVertices, and aggregateMessages`) as well as an `optimized variant` of the [Pregel](https://spark.apache.org/docs/latest/graphx-programming-guide.html#pregel) API. In addition, GraphX includes a growing collection of graph algorithms and builders to simplify graph analytics tasks.

The official doc can be found [here](https://spark.apache.org/docs/latest/graphx-programming-guide.html#getting-started)


## 1.2 Installation
Although GraphX is a component of Spark (no need to install), but it uses a special data structure (**graphframes**) which is not included in Spark by default.
We need to import the lib `graphframes` which allows us to create dataframe for graphe.

The official doc of the data structure can be found [here](https://graphframes.github.io/graphframes/docs/_site/user-guide.html)

The jar file can be downloaded [here](http://spark-packages.org/package/graphframes/graphframes)

For scala api, it's quite simple, you can download the jar file, add the jar file to the context, Or ask the spark session to download it automatically by using below
`.config('spark.jars.packages','graphframes:graphframes:0.8.2-spark3.2-s_2.12')`.

For python api, you need to do the above steps, and you need to install a python wrapper in your virtual env.

```shell
pip install graphframes
```

> If you visit the pypi page of this package, it's a little outdated. But it's only a wrapper, for now I don't encounter any compatibility issues.

## 1.3 A simple example

In [1]:
from pyspark.sql import SparkSession
from graphframes import *
import os

In [2]:
local=True
if local:
    spark = SparkSession.builder\
        .master("local[4]")\
        .appName("RepartitionAndCoalesce")\
        .config("spark.executor.memory", "4g")\
        .config('spark.jars.packages','graphframes:graphframes:0.8.2-spark3.2-s_2.12') \
        .getOrCreate()
else:
    spark = SparkSession.builder\
        .master("k8s://https://kubernetes.default.svc:443")\
        .appName("RepartitionAndCoalesce")\
        .config("spark.kubernetes.container.image", "inseefrlab/jupyter-datascience:master")\
        .config("spark.kubernetes.authenticate.driver.serviceAccountName", os.environ['KUBERNETES_SERVICE_ACCOUNT'])\
        .config('spark.jars.packages','graphframes:graphframes:0.8.2-spark3.2-s_2.12') \
        .config("spark.executor.instances", "4")\
        .config("spark.executor.memory","2g")\
        .config("spark.kubernetes.namespace", os.environ['KUBERNETES_NAMESPACE'])\
        .getOrCreate()

# make the large dataframe show pretty
spark.conf.set("spark.sql.repl.eagerEval.enabled",True)

23/07/07 13:36:43 WARN Utils: Your hostname, pengfei-Virtual-Machine resolves to a loopback address: 127.0.1.1; using 10.50.2.80 instead (on interface eth0)
23/07/07 13:36:43 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
:: loading settings :: url = jar:file:/home/pengfei/opt/spark-3.3.0/jars/ivy-2.5.0.jar!/org/apache/ivy/core/settings/ivysettings.xml


Ivy Default Cache set to: /home/pengfei/.ivy2/cache
The jars for the packages stored in: /home/pengfei/.ivy2/jars
graphframes#graphframes added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-737adafe-8024-4dc5-b8a7-9a37155c045b;1.0
	confs: [default]
	found graphframes#graphframes;0.8.2-spark3.2-s_2.12 in spark-packages
	found org.slf4j#slf4j-api;1.7.16 in central
:: resolution report :: resolve 295ms :: artifacts dl 7ms
	:: modules in use:
	graphframes#graphframes;0.8.2-spark3.2-s_2.12 from spark-packages in [default]
	org.slf4j#slf4j-api;1.7.16 from central in [default]
	---------------------------------------------------------------------
	|                  |            modules            ||   artifacts   |
	|       conf       | number| search|dwnlded|evicted|| number|dwnlded|
	---------------------------------------------------------------------
	|      default     |   2   |   0   |   0   |   0   ||   2   |   0   |
	-------------------------------

23/07/07 13:36:45 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).


In [3]:
vertices = spark.createDataFrame([('1', 'Carter', 'Derrick', 50),
                                  ('2', 'May', 'Derrick', 26),
                                 ('3', 'Mills', 'Jeff', 80),
                                  ('4', 'Hood', 'Robert', 65),
                                  ('5', 'Banks', 'Mike', 93),
                                 ('98', 'Berg', 'Tim', 28),
                                 ('99', 'Page', 'Allan', 16)],
                                 ['id', 'name', 'firstname', 'age'])

edges = spark.createDataFrame([('1', '2', 'friend'),
                               ('2', '1', 'friend'),
                              ('3', '1', 'friend'),
                              ('1', '3', 'friend'),
                               ('2', '3', 'follows'),
                               ('3', '4', 'friend'),
                               ('4', '3', 'friend'),
                               ('5', '3', 'friend'),
                               ('3', '5', 'friend'),
                               ('4', '5', 'follows'),
                              ('98', '99', 'friend'),
                              ('99', '98', 'friend')],
                              ['src', 'dst', 'type'])

In [4]:
g = GraphFrame(vertices, edges)



In [5]:
# show the vertices (nodes)
g.vertices.show()

+---+------+---------+---+
| id|  name|firstname|age|
+---+------+---------+---+
|  1|Carter|  Derrick| 50|
|  2|   May|  Derrick| 26|
|  3| Mills|     Jeff| 80|
|  4|  Hood|   Robert| 65|
|  5| Banks|     Mike| 93|
| 98|  Berg|      Tim| 28|
| 99|  Page|    Allan| 16|
+---+------+---------+---+

+---+---+-------+
|src|dst|   type|
+---+---+-------+
|  1|  2| friend|
|  2|  1| friend|
|  3|  1| friend|
|  1|  3| friend|
|  2|  3|follows|
|  3|  4| friend|
|  4|  3| friend|
|  5|  3| friend|
|  3|  5| friend|
|  4|  5|follows|
| 98| 99| friend|
| 99| 98| friend|
+---+---+-------+



In [None]:
# show the edges (relations between nodes)
g.edges.show()

In [6]:
## Check the number of edges of each vertex
g.degrees.show()



+---+------+
| id|degree|
+---+------+
|  3|     7|
|  1|     4|
|  2|     3|
|  4|     3|
|  5|     3|
| 98|     2|
| 99|     2|
+---+------+



The GraphFrame we just created is a **directed graph**, and can be visualized as follows:
![graphx_graph_exp1.webp](../../../../images/graphx_graph_exp1.webp)

