<a href="https://colab.research.google.com/github/momo54/large_scale_data_management/blob/main/GraphFramesPageRank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

! RDF and  GraphFrames

GraphFrame is an additional package to perform graph processing in Spark. It is concurrent to GraphX, but available in Python.


launching in a terminal

```
pyspark --packages graphframes:graphframes:0.8.1-spark3.0-s_2.12
```

Valid configuration are listed in:
```
https://spark-packages.org/package/graphframes/graphframes
```


In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!pip install graphframes

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [14]:
!wget -nc https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.2-s_2.12/graphframes-0.8.2-spark3.2-s_2.12.jar
!cp graphframes-0.8.2-spark3.2-s_2.12.jar /usr/local/lib/python3.6/dist-packages/jars

File ‘graphframes-0.8.2-spark3.2-s_2.12.jar’ already there; not retrieving.



In [13]:
from pyspark.sql import SparkSession
from graphframes import GraphFrame

spark = SparkSession.builder.appName("Basics").getOrCreate()


In [5]:
spark = SparkSession.builder.master("local[*]").config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.2-s_2.12").getOrCreate()  

In [12]:
# checking that everything works...

# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
print(g.edges.filter("relationship = 'follow'").count())

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()


+---+--------+
| id|inDegree|
+---+--------+
|  b|       2|
|  c|       1|
+---+--------+

2
+---+------------------+
| id|          pagerank|
+---+------------------+
|  c|1.8994109890559092|
|  b|1.0905890109440908|
|  a|              0.01|
+---+------------------+



In [19]:
e.printSchema()

root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- relationship: string (nullable = true)



In [15]:
# using a more realistic RDF graph (only triples)

!wget -nc -q https://raw.githubusercontent.com/momo54/large_scale_data_management/main/small_page_links.nt

In [27]:
from pyspark.sql.types import StructType,StringType
schema=StructType() \
  .add("src",StringType(),True) \
  .add("relationship",StringType(),True) \
  .add("dst",StringType(),True) 

# reading Triples
# managing quads requires Reification :-/
edges=spark.read.format("csv") \
  .options(delimiter=" ") \
  .schema(schema) \
  .load(["multi.txt0.txt","catalog.txt0.txt"])

edges.take(1)

#generating Vertices from Edges...
vertices=edges.select('src') \
  .union(edges.select('dst')) \
  .withColumnRenamed('src', 'id')

vertices.take(1)

graph = GraphFrame(vertices, edges)

# Query: Get in-degree of each vertex.
graph.inDegrees.show()

+-------------+--------+
|           id|inDegree|
+-------------+--------+
| string_30778|       5|
| string_31272|       7|
| string_30882|       4|
|integer_32572|       7|
| string_31579|       7|
|integer_32509|       4|
|integer_32238|       5|
|integer_32591|       3|
|integer_32313|       4|
|   date_33185|       2|
|    Topic_308|       2|
|   User_27610|       1|
|     City_606|       1|
|   User_26491|       1|
|   User_28534|       3|
|   User_29145|       3|
|   User_25927|       2|
|    Topic_204|       1|
|Website_16323|       4|
|   User_27484|       4|
+-------------+--------+
only showing top 20 rows

