# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Data pipeline with Neo4j

**Date**: October 2nd 2025

**Student Name**: José Juan Díaz Campos

**Professor**: Pablo Camarillo Ramirez

# Dataset description

This dataset contains the social network of Star Wars characters extracted from movie scripts.

**Nodes (Characters):**
- name: Character name (e.g., "DARTH VADER", "LUKE")
- value: Number of scenes the character appeared in
- colour: Color for visualization

**Edges (Interactions):**
- source: Index of the first character in the interaction
- target: Index of the second character in the interaction  
- value: Number of scenes where both characters appeared together

The graph represents relationships where two characters speak together within the same scene.


# Data ingestion

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Star Wars Network") \
    .master("local[*]") \
    .config("spark.jars.packages", 
            "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.neo4j#neo4j-connector-apache-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4b6753ef-94b3-4e8d-b016-4f9263d14f0e;1.0
	confs: [default]
	found org.neo4j#neo4j-connector-apache-spark_2.13;5.3.10_for_spark_3 in central
	found org.neo4j#neo4j-connector-apache-spark_2.13_common;5.3.10_for_spark_3 in central
	found org.neo4j#caniuse-core;1.3.0 in central
	found org.neo4j#caniuse-api;1.3.0 in central
	found org.jetbrains.kotlin#kotlin-stdlib;2.1.20 in central
	found org.jetbrains#annotations;13.0 in central
	found org.neo4j#caniuse-neo4j-detection;1.3.0 in central
	found org.neo4j.driver#neo4j-java-driver-slim;4.4.21 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found io.netty#netty-handler;4.1.

In [2]:
# Build schema
# Import your module
##/opt/spark/work-dir/data/josejuandiaz-dataset/starwars-full-interactions.json

import json
from jjodiaz.spark_utils import SparkUtils
from pyspark.sql.functions import col
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"
neo4j_url = "bolt://neo4j-iteso:7687"


with open("/opt/spark/work-dir/data/josejuandiaz-dataset/starwars-full-interactions.json", "r") as f:
    data = json.load(f)

nodes_data = data["nodes"]
nodes_df = spark.createDataFrame(nodes_data)

links_data = data["links"]
edges_df = spark.createDataFrame(links_data)

print("Enlaces cargados:")


Enlaces cargados:


# Transformations

In [3]:
nodes_transformed = nodes_df.withColumn("id", col("name")) \
    .select("id", "name", "value", col("colour").alias("color"))

print("Nodos transformados:")
nodes_transformed.show(5)
nodes_transformed.printSchema()
nodes_list = [row.name for row in nodes_df.collect()]

edges_schema = SparkUtils.generate_schema([
    ("src", "string"),
    ("dst", "string"),
    ("weight", "int")
])

edges_data_transformed = []
for row in edges_df.collect():
    edges_data_transformed.append({
        "src": nodes_list[row.source],
        "dst": nodes_list[row.target],
        "weight": row.value
    })

edges_transformed = spark.createDataFrame(edges_data_transformed, schema=edges_schema)

print("\nEdges transformados:")
edges_transformed.show(10)
edges_transformed.printSchema()

print(f"\nTotal nodos: {nodes_transformed.count()}")
print(f"Total relaciones: {edges_transformed.count()}")


Nodos transformados:


                                                                                

+-----------+-----------+-----+-------+
|         id|       name|value|  color|
+-----------+-----------+-----+-------+
|    QUI-GON|    QUI-GON|   61|#4f4fb1|
|NUTE GUNRAY|NUTE GUNRAY|   24|#808080|
|       PK-4|       PK-4|    3|#808080|
|      TC-14|      TC-14|    4|#808080|
|    OBI-WAN|    OBI-WAN|  147|#48D1CC|
+-----------+-----------+-----+-------+
only showing top 5 rows
root
 |-- id: string (nullable = true)
 |-- name: string (nullable = true)
 |-- value: long (nullable = true)
 |-- color: string (nullable = true)


Edges transformados:
+-----------+-----------+------+
|        src|        dst|weight|
+-----------+-----------+------+
|NUTE GUNRAY|    QUI-GON|     1|
|       PK-4|      TC-14|     1|
|    OBI-WAN|      TC-14|     1|
|    QUI-GON|      TC-14|     1|
|    OBI-WAN|    QUI-GON|    26|
|NUTE GUNRAY|      TC-14|     1|
|     DOFINE|NUTE GUNRAY|     1|
|     DOFINE|      TC-14|     1|
|NUTE GUNRAY|       RUNE|     8|
|       RUNE|    TEY HOW|     2|
+-----------+----

# Writing Data in Neo4j

In [4]:
nodes_transformed.write \
    .format("org.neo4j.spark.DataSource") \
    .mode("Overwrite") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("labels", ":Character") \
    .option("node.keys", "id") \
    .save()
edges_transformed.write \
    .format("org.neo4j.spark.DataSource") \
    .mode("Append") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("relationship", "INTERACTS_WITH") \
    .option("relationship.save.strategy", "keys") \
    .option("relationship.source.labels", ":Character") \
    .option("relationship.source.save.mode", "match") \
    .option("relationship.source.node.keys", "src:id") \
    .option("relationship.target.labels", ":Character") \
    .option("relationship.target.save.mode", "match") \
    .option("relationship.target.node.keys", "dst:id") \
    .save()

                                                                                

# Read and Query Graphs with PySpark

In [5]:
# Add the code to read a data frame from Neo4J and run a simple query to verify # Consulta simple: Personajes con más interacciones
query = """
MATCH (c:Character)-[r:INTERACTS_WITH]->(c2:Character)
RETURN c.name as character1, c2.name as character2, r.weight as scenes
ORDER BY r.weight DESC
"""

result_df = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query", query) \
    .load()

print("Top 10 interacciones:")
result_df.limit(10).show(truncate=False)

Top 10 interacciones:
+----------+----------+------+
|character1|character2|scenes|
+----------+----------+------+
|HAN       |LEIA      |69    |
|HAN       |LEIA      |69    |
|HAN       |LEIA      |69    |
|C-3PO     |HAN       |54    |
|C-3PO     |HAN       |54    |
|C-3PO     |HAN       |54    |
|C-3PO     |LEIA      |49    |
|C-3PO     |LEIA      |49    |
|C-3PO     |LEIA      |49    |
|ANAKIN    |OBI-WAN   |46    |
+----------+----------+------+



In [6]:
sc.stop()