

**Lab 06**: Writing data in Neo4j

**Date**: September 23rd 2025

**Student Name**: Jose Angel Leon Perez

**Professor**: Pablo Camarillo Ramirez

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on storage solutions with Neo4j") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")
# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.neo4j#neo4j-connector-apache-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-3a805096-62c0-4b7f-9504-7ae1fb561ea1;1.0
	confs: [default]
	found org.neo4j#neo4j-connector-apache-spark_2.13;5.3.10_for_spark_3 in central
	found org.neo4j#neo4j-connector-apache-spark_2.13_common;5.3.10_for_spark_3 in central
	found org.neo4j#caniuse-core;1.3.0 in central
	found org.neo4j#caniuse-api;1.3.0 in central
	found org.jetbrains.kotlin#kotlin-stdlib;2.1.20 in central
	found org.jetbrains#annotations;13.0 in central
	found org.neo4j#caniuse-neo4j-detection;1.3.0 in central
	found org.neo4j.driver#neo4j-java-driver-slim;4.4.21 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found io.netty#netty-handler;4.1.

Dataset: Select a dataset in any public repository (e.g. Kaggle or Network Repository) containing relationships. In this section of the Notebook, you need to describe nodes and edges of the graph (a column representing the sourceand another column representing the destination of the relationships).

Dataset
Descripción

El dataset friends_table.csv contiene relaciones entre usuarios representados por identificadores numéricos.
Cada fila representa una relación de amistad entre dos nodos (usuarios):

Friend 1	Friend 2
1	555
1	921
1	213
2	752
3	705
...	...

Nodos (nodes_df) → los IDs únicos de usuarios.

Aristas (edges_df) → las relaciones (Friend 1, Friend 2) entre ellos.

In [None]:
Data Ingestion: This section should contain a code cell using PySpark to read the DataFrame.

In [6]:
df = spark.read.csv("/opt/spark/work-dir/friends_table.csv", header=True, inferSchema=True)

print("Vista previa del dataset:")
df.show(10)


                                                                                

Vista previa del dataset:
+--------+--------+
|Friend 1|Friend 2|
+--------+--------+
|       1|     555|
|       1|     921|
|       1|     213|
|       1|     184|
|       1|     242|
|       1|     402|
|       1|     399|
|       1|      84|
|       1|      55|
|       2|     752|
+--------+--------+
only showing top 10 rows


In [None]:
Transformations: This section should contain a code cell using PySpark transformations to generate (at least) two DataFrames: one for nodes and one for edges.

In [7]:
from pyspark.sql.functions import col


edges_df = df.withColumnRenamed("Friend 1", "source") \
             .withColumnRenamed("Friend 2", "target")

nodes_df = edges_df.select(col("source").alias("id")).union(
           edges_df.select(col("target").alias("id"))
          ).distinct()

print(f"Total de nodos: {nodes_df.count()}")
print(f"Total de aristas: {edges_df.count()}")

nodes_df.show(5)
edges_df.show(5)


                                                                                

Total de nodos: 1000
Total de aristas: 9402
+---+
| id|
+---+
|535|
| 71|
|705|
|744|
|679|
+---+
only showing top 5 rows
+------+------+
|source|target|
+------+------+
|     1|   555|
|     1|   921|
|     1|   213|
|     1|   184|
|     1|   242|
+------+------+
only showing top 5 rows


In [None]:
Writing Data in Neo4j: This section should contain code to persist nodes and edges DataFrames in Neo4j.

In [9]:
neo4j_url = "bolt://neo4j-iteso:7687"
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"


nodes_df.write \
    .format("org.neo4j.spark.DataSource") \
    .mode("Overwrite") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("labels", ":Person") \
    .option("node.keys", "id") \
    .save()

print("Nodos guardados en Neo4j correctamente.")


edges_df.write \
    .format("org.neo4j.spark.DataSource") \
    .mode("Overwrite") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("relationship", "FRIEND_WITH") \
    .option("relationship.save.strategy", "keys") \
    .option("relationship.source.labels", ":Person") \
    .option("relationship.source.save.mode", "match") \
    .option("relationship.source.node.keys", "source:id") \
    .option("relationship.target.labels", ":Person") \
    .option("relationship.target.save.mode", "match") \
    .option("relationship.target.node.keys", "target:id") \
    .save()

print("Relaciones guardadas en Neo4j correctamente.")



                                                                                

Nodos guardados en Neo4j correctamente.


[Stage 23:>                                                         (0 + 1) / 1]

Relaciones guardadas en Neo4j correctamente.


                                                                                

In [None]:
Querying the Graph: This section should contain code to read the graph from Neo4j and query data.

In [11]:
cypher_all_df = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query",
            """
            MATCH (a:Person)-[r:FRIEND_WITH]->(b:Person)
            RETURN a.id AS source, b.id AS target
            """) \
    .load()

print("Relaciones importadas desde Neo4j:")
cypher_all_df.show(10, truncate=False)

node_id = 1

cypher_friends_df = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query",
            f"""
            MATCH (p1:Person {{id: {node_id}}})-[:FRIEND_WITH]->(p2:Person)
            RETURN p2.id AS friend_id
            """) \
    .load()

print(f"Amigos del nodo {node_id}:")
cypher_friends_df.show(truncate=False)


🔎 Relaciones importadas desde Neo4j:


                                                                                

+------+------+
|source|target|
+------+------+
|801   |535   |
|942   |535   |
|796   |535   |
|687   |535   |
|848   |535   |
|887   |535   |
|146   |535   |
|206   |535   |
|2     |535   |
|558   |71    |
+------+------+
only showing top 10 rows
👥 Amigos del nodo 1:
+---------+
|friend_id|
+---------+
|55       |
|84       |
|399      |
|402      |
|242      |
|184      |
|213      |
|921      |
|555      |
+---------+

