<a href="https://colab.research.google.com/github/momo54/large_scale_data_management/blob/main/GraphFramesPageRank.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

! RDF and  GraphFrames

GraphFrame is an additional package to perform graph processing in Spark. It is concurrent to GraphX, but available in Python.


launching in a terminal

```
pyspark --packages graphframes:graphframes:0.8.2-spark3.2-s_2.12

from graphframes.examples import Graphs
g = Graphs(sqlContext).friends()  # Get example graph

# Display the vertex and edge DataFrames
g.vertices.show()
```

Valid configuration are listed in:
```
https://spark-packages.org/package/graphframes/graphframes
```


In [1]:
!pip install pyspark

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [76]:
#visiblement faut pas faire ça !!
#!pip install graphframes

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


In [2]:
!wget -nc https://repos.spark-packages.org/graphframes/graphframes/0.8.2-spark3.2-s_2.12/graphframes-0.8.2-spark3.2-s_2.12.jar
!cp graphframes-0.8.2-spark3.2-s_2.12.jar /usr/local/lib/python3.7/dist-packages/pyspark/jars/
!ls /usr/local/lib/python3.7/dist-packages/pyspark/jars/graph*

File ‘graphframes-0.8.2-spark3.2-s_2.12.jar’ already there; not retrieving.

cp: cannot create regular file '/usr/local/lib/python3.6/dist-packages/pyspark/jars/': No such file or directory
/usr/local/lib/python3.7/dist-packages/pyspark/jars/graphframes-0.8.2-spark3.2-s_2.12.jar


In [3]:
from pyspark.sql import SparkSession
from graphframes import GraphFrame

spark = SparkSession.builder.config("spark.jars.packages", "graphframes:graphframes:0.8.2-spark3.2-s_2.12").getOrCreate()  


In [5]:
# checking that everything works...

# Create a Vertex DataFrame with unique ID column "id"
v = spark.createDataFrame([
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
], ["id", "name", "age"])
# Create an Edge DataFrame with "src" and "dst" columns
e = spark.createDataFrame([
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
], ["src", "dst", "relationship"])
# Create a GraphFrame
from graphframes import *
g = GraphFrame(v, e)

# Query: Get in-degree of each vertex.
g.inDegrees.show()

# Query: Count the number of "follow" connections in the graph.
print(g.edges.filter("relationship = 'follow'").count())

# Run PageRank algorithm, and show results.
results = g.pageRank(resetProbability=0.01, maxIter=20)
results.vertices.select("id", "pagerank").show()


  "DataFrame.sql_ctx is an internal property, and will be removed "


+---+--------+
| id|inDegree|
+---+--------+
|  b|       2|
|  c|       1|
+---+--------+

2
+---+------------------+
| id|          pagerank|
+---+------------------+
|  c|1.8994109890559092|
|  b|1.0905890109440908|
|  a|              0.01|
+---+------------------+



In [6]:
e.printSchema()

root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- relationship: string (nullable = true)



In [9]:
# using a more realistic RDF graph (only triples)

#!wget -nc -q https://raw.githubusercontent.com/momo54/large_scale_data_management/main/small_page_links.nt
!wget -nc https://raw.githubusercontent.com/momo54/large_scale_data_management/main/watdiv-100k.nt

--2022-10-31 10:47:39--  https://raw.githubusercontent.com/momo54/large_scale_data_management/main/watdiv-100k.nt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15098125 (14M) [text/plain]
Saving to: ‘watdiv-100k.nt’


2022-10-31 10:47:39 (172 MB/s) - ‘watdiv-100k.nt’ saved [15098125/15098125]



In [17]:
from pyspark.sql.types import StructType,StringType
schema=StructType() \
  .add("src",StringType(),True) \
  .add("relationship",StringType(),True) \
  .add("dst",StringType(),True) 

# reading Triples
# managing quads requires Reification :-/
edges=spark.read.format("csv") \
  .options(delimiter="\t") \
  .schema(schema) \
  .load(["watdiv-100k.nt"])
#  .load(["multi.txt0.txt","catalog.txt0.txt"])

#generating Vertices from Edges...
vertices=edges.select('src') \
  .union(edges.select('dst')) \
  .distinct() \
  .withColumnRenamed('src', 'id')

print(vertices.take(1))

graph = GraphFrame(vertices, edges)


vertices.show(10)

edges.select("relationship").distinct().show(100,truncate=200)
edges.filter("relationship='<http://purl.org/goodrelations/includes>'").show(10)

[Row(id='<http://db.uwaterloo.ca/~galuc/wsdbm/User0>')]
+--------------------+
|                  id|
+--------------------+
|<http://db.uwater...|
|<http://db.uwater...|
|<http://db.uwater...|
|<http://db.uwater...|
|<http://db.uwater...|
|<http://db.uwater...|
|<http://db.uwater...|
|<http://db.uwater...|
|<http://db.uwater...|
|<http://db.uwater...|
+--------------------+
only showing top 10 rows

+---------------------------------------------------+
|                                       relationship|
+---------------------------------------------------+
|                        <http://schema.org/expires>|
| <http://db.uwaterloo.ca/~galuc/wsdbm/purchaseDate>|
|                <http://schema.org/aggregateRating>|
|  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>|
|                   <http://schema.org/contactPoint>|
|   <http://db.uwaterloo.ca/~galuc/wsdbm/subscribes>|
|                       <http://schema.org/employee>|
|                       <http://schema.org/language>|
| 

In [18]:
subgraph=graph.filterEdges("relationship='<http://purl.org/goodrelations/includes>'").dropIsolatedVertices()
subgraph.triplets.show(truncate=200)



+------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
|                                             src|                                                                                                                                          edge|                                                 dst|
+------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
| {<http://db.uwaterloo.ca/~galuc/wsdbm/Offer73>}| {<http://db.uwaterloo.ca/~galuc/wsdbm/Offer73>, <http://purl.org/goodrelations/includes>, <http://db.uwaterloo.ca/~galuc/wsdbm/Product105> .}|{<http://db.uwaterloo.ca/~galuc/wsdbm/Product105> .}|
|{<http://db

In [19]:
offers = graph.find("(s)-[p]->(o)")\
  .filter("p.relationship='<http://purl.org/goodrelations/offers>'") 
offers.show(200,truncate=200)

+--------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
|                                                 s|                                                                                                                                           p|                                                 o|
+--------------------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------+
| {<http://db.uwaterloo.ca/~galuc/wsdbm/Retailer1>}| {<http://db.uwaterloo.ca/~galuc/wsdbm/Retailer1>, <http://purl.org/goodrelations/offers>, <http://db.uwaterloo.ca/~galuc/wsdbm/Offer367> .}|{<http://db.uwaterloo.ca/~galuc/wsdbm/Offer367> .}|
| {<http://db.uwater

In [20]:
includes = graph.find("(s)-[p]->(o)")\
  .filter("p.relationship='<http://purl.org/goodrelations/includes>'") 
includes.show(200,truncate=200)

+------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
|                                               s|                                                                                                                                             p|                                                   o|
+------------------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------+
| {<http://db.uwaterloo.ca/~galuc/wsdbm/Offer73>}| {<http://db.uwaterloo.ca/~galuc/wsdbm/Offer73>, <http://purl.org/goodrelations/includes>, <http://db.uwaterloo.ca/~galuc/wsdbm/Product105> .}|{<http://db.uwaterloo.ca/~galuc/wsdbm/Product105> .}|
|{<http://db

In [21]:
chain4 = graph.find("(a)-[offers]->(b);(b)-[includes]->(c)")\
  .filter("offers.relationship='<http://purl.org/goodrelations/offers>'") \
  .filter("includes.relationship='<http://purl.org/goodrelations/includes>'")
chain4.show(200)

+---+------+---+--------+---+
|  a|offers|  b|includes|  c|
+---+------+---+--------+---+
+---+------+---+--------+---+

