# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Data pipeline with Neo4j

**Date**: October 2nd 2025

**Student Name**: Rodrigo Martín del Campo

**Professor**: Pablo Camarillo Ramirez

# Dataset description


The dataset I chose contains heroes and comics, and the relationship between them, it works with two CSV files:

- **`nodes.csv`** — columns: `node` (name of the hero or comic), `type` (`hero` or `comic`).
- **`edges.csv`** — columns: `hero`, `comic` (each row means that hero appears in that comic) `(:Hero)-[:APPEARS_IN]->(:Comic)` (where the hero is the source and comic the destination).



# Data ingestion

In [22]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on storage solutions with Neo4j") \
    .master("spark://7239a1f7373c:7077") \
    .config("spark.jars.packages", "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

In [16]:
# Build schema
# Import your module
from martindelcampo.spark_utils import SparkUtils 

base_path = "/opt/spark/work-dir/data/" 

schema_nodes = SparkUtils.generate_schema([
    ("node", "string"),  # id/name of the hero or comic
    ("type", "string")   # hero or comic
])

schema_edges = SparkUtils.generate_schema([
    ("hero", "string"),  # source column
    ("comic", "string")  # destination column
])

df_nodes = spark.read.schema(schema_nodes).option("header", True).csv(base_path + "marvel_network/nodes")
df_edges = spark.read.schema(schema_edges).option("header", True).csv(base_path + "marvel_network/edges")

df_nodes.show(5)
df_edges.show(5)   


                                                                                

+--------------------+-----+
|                node| type|
+--------------------+-----+
|             2001 10|comic|
|              2001 8|comic|
|              2001 9|comic|
|24-HOUR MAN/EMMANUEL| hero|
|3-D MAN/CHARLES CHAN| hero|
+--------------------+-----+
only showing top 5 rows
+--------------------+------+
|                hero| comic|
+--------------------+------+
|24-HOUR MAN/EMMANUEL|AA2 35|
|3-D MAN/CHARLES CHAN| AVF 4|
|3-D MAN/CHARLES CHAN| AVF 5|
|3-D MAN/CHARLES CHAN| COC 1|
|3-D MAN/CHARLES CHAN|H2 251|
+--------------------+------+
only showing top 5 rows


# Transformations

In [17]:
# Add the code for your transformations to create nodes and edges DataFrames HERE
# The transformations for the datesets I chose aren't really necessary since the edges csv already contains the source and destination columns (it's a very straightforward dataset)
# But I will do them anyway to show how to create nodes and edges DataFrames
from pyspark.sql.functions import col

# Hero Nodes
hero_nodes = df_nodes.where(col("type") == "hero").select(
    col("node").alias("id"),      
    col("node").alias("name")     
).dropDuplicates(["id"])

# Comic Nodes
comic_nodes = df_nodes.where(col("type") == "comic").select(
    col("node").alias("id"),      
    col("node").alias("title")   
).dropDuplicates(["id"])

# APPEARS_IN: relationships between Hero -> Comic
appears_in_edges = df_edges.select(
    col("hero").alias("src"),   # source node (Hero)
    col("comic").alias("dst")   # destination node (Comic)
)
appears_in_edges.show(truncate=False)

+--------------------+--------+
|src                 |dst     |
+--------------------+--------+
|24-HOUR MAN/EMMANUEL|AA2 35  |
|3-D MAN/CHARLES CHAN|AVF 4   |
|3-D MAN/CHARLES CHAN|AVF 5   |
|3-D MAN/CHARLES CHAN|COC 1   |
|3-D MAN/CHARLES CHAN|H2 251  |
|3-D MAN/CHARLES CHAN|H2 252  |
|3-D MAN/CHARLES CHAN|M/PRM 35|
|3-D MAN/CHARLES CHAN|M/PRM 36|
|3-D MAN/CHARLES CHAN|M/PRM 37|
|3-D MAN/CHARLES CHAN|WI? 9   |
|4-D MAN/MERCURIO    |CA3 36  |
|4-D MAN/MERCURIO    |CM 51   |
|4-D MAN/MERCURIO    |Q 14    |
|4-D MAN/MERCURIO    |Q 16    |
|4-D MAN/MERCURIO    |T 208   |
|4-D MAN/MERCURIO    |T 214   |
|4-D MAN/MERCURIO    |T 215   |
|4-D MAN/MERCURIO    |T 216   |
|4-D MAN/MERCURIO    |T 440   |
|8-BALL/             |SLEEP 1 |
+--------------------+--------+
only showing top 20 rows


# Writing Data in Neo4j

In [18]:
# Add the code to write a graph from PySpark's DataFrames to Neo4j
neo4j_url = "bolt://neo4j-iteso:7687"
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"

# Write the hero nodes
hero_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", ":Hero") \
  .option("node.keys", "id") \
  .save()

print(f"{hero_nodes.count()} Hero nodes wrote in Neo4j")


# 2) Write the comic nodes
comic_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", ":Comic") \
  .option("node.keys", "id") \
  .save()

print(f"{comic_nodes.count()} Comic nodes wrote in Neo4j")


# 3) Write the APPEARS_IN relationships
appears_in_edges.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("relationship", "APPEARS_IN") \
  .option("relationship.save.strategy", "keys") \
  .option("relationship.source.labels", ":Hero") \
  .option("relationship.source.save.mode", "match") \
  .option("relationship.source.node.keys", "src:id") \
  .option("relationship.target.labels", ":Comic") \
  .option("relationship.target.save.mode", "match") \
  .option("relationship.target.node.keys", "dst:id") \
  .save()

print(f"{appears_in_edges.count()} APPEARS_IN edges wrote in Neo4j")


                                                                                

6439 Hero nodes wrote in Neo4j


                                                                                

12651 Comic nodes wrote in Neo4j


                                                                                

65058 APPEARS_IN edges wrote in Neo4j


# Read and Query Graphs with PySpark

In [23]:
# Add the code to read a data frame from Neo4J and run a simple query to verify 
heroes_in_w2_10 = spark.read \
  .format("org.neo4j.spark.DataSource") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("query", """
      MATCH (h:Hero)-[:APPEARS_IN]->(c:Comic)
      WHERE c.id = 'W2 10'
      RETURN h.id AS hero
      ORDER BY hero
  """) \
  .load()

print("Heroes that appear in comic 'W2 10':")
heroes_in_w2_10.show(truncate=False)

Heroes that appear in comic 'W2 10':


[Stage 0:>                                                          (0 + 1) / 1]

+--------------------+
|hero                |
+--------------------+
|MCCABE, LINDSAY     |
|SPIDER-WOMAN/JESSICA|
|TAI                 |
|WOLVERINE/LOGAN     |
+--------------------+



                                                                                

In [24]:
sc.stop()