# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Data pipeline with Neo4j

**Date**: October 2nd 2025

**Student Name**: Antonia Horburger

**Professor**: Pablo Camarillo Ramirez

# Dataset description

Breaking Bad: https://www.kaggle.com/datasets/jishnukoliyadan/breaking-bad-network-analysis

Este conjunto de datos fue recolectado para realizar un análisis de redes de la serie de televisión Breaking Bad.

Los datos fueron obtenidos mediante web scraping de la página de fandom de Breaking Bad. El objetivo principal es construir un grafo en el que:

Nodos representan a los personajes de la serie. La lista de personajes se encuentra en el archivo character_df.csv (y su versión limpia character_df_cleaned.csv).

Aristas representan relaciones de co-ocurrencia entre personajes, es decir, si aparecen mencionados en el mismo párrafo de los resúmenes de episodios (Season_1.txt hasta Season_5B.txt). Estas relaciones incluyen un peso, que indica cuántas veces dos personajes co-aparecen en la narrativa.

# Data ingestion

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on SparkSQL") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")
spark.conf.set("spark.sql.shuffle.partitions", "5")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.neo4j#neo4j-connector-apache-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-edd67821-b92c-485d-942e-c85306ddf416;1.0
	confs: [default]
	found org.neo4j#neo4j-connector-apache-spark_2.13;5.3.10_for_spark_3 in central
	found org.neo4j#neo4j-connector-apache-spark_2.13_common;5.3.10_for_spark_3 in central
	found org.neo4j#caniuse-core;1.3.0 in central
	found org.neo4j#caniuse-api;1.3.0 in central
	found org.jetbrains.kotlin#kotlin-stdlib;2.1.20 in central
	found org.jetbrains#annotations;13.0 in central
	found org.neo4j#caniuse-neo4j-detection;1.3.0 in central
	found org.neo4j.driver#neo4j-java-driver-slim;4.4.21 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found io.netty#netty-handler;4.1.

In [2]:
from antoniahorburger.spark_utils import SparkUtils

characters_schema = SparkUtils.generate_schema([
    ("Season", "string"),
    ("Characters", "string"),
])

In [3]:
#Ingesta de datos
from pyspark.sql import functions as F

ruta_characters = "/opt/spark/work-dir/data/breakingbad/character_df_cleaned.csv"
ruta_summaries = "/opt/spark/work-dir/data/breakingbad/summaries/*.txt"

# Leer CSV de personajes con esquema
characters_df = (
    spark.read
         .option("header", True)
         .schema(characters_schema)
         .csv(ruta_characters)
)

characters_df.show(10, truncate=False)
characters_df.printSchema()

# Leer resúmenes como texto
summaries_df = (
    spark.read
         .text(ruta_summaries)
         .withColumnRenamed("value", "line")
)

summaries_df.show(10, truncate=False)
summaries_df.printSchema()

                                                                                

+--------+----------------+
|Season  |Characters      |
+--------+----------------+
|Season_1|Walter White    |
|Season_1|Skyler White    |
|Season_1|Jesse Pinkman   |
|Season_1|Hank Schrader   |
|Season_1|Marie Schrader  |
|Season_1|Walter White Jr.|
|Season_1|Krazy-8         |
|Season_1|Emilio          |
|Season_1|Gomez           |
|Season_1|Bogdan          |
+--------+----------------+
only showing top 10 rows
root
 |-- Season: string (nullable = true)
 |-- Characters: string (nullable = true)

+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

# Transformations

In [4]:
from pyspark.sql.functions import col, split, trim, when, size, element_at, lower, instr

#nodes de characters

names_col = split(col("Characters"), r"(?i)\s+as\s+")
characters_clean = (
    characters_df
      .withColumn(
          "CharacterName",
          when(size(names_col) >= 2, element_at(names_col, 2)) #necesario por el "as"
          .otherwise(element_at(names_col, 1))
      )
      .withColumn("CharacterName", trim(col("CharacterName")))
      .dropna(subset=["CharacterName"])
      .dropDuplicates(["CharacterName"])
)

character_nodes = characters_clean.select(
    col("CharacterName").alias("id"),
    col("Season")
).dropDuplicates(["id"])

character_nodes.show(10, truncate=False)
character_nodes.printSchema()

#edges

mentions = (
    summaries_df.alias("s")
      .join(
          characters_clean.alias("c"),
          instr(lower(col("s.line")), lower(col("c.CharacterName"))) > 0,
          "inner"
      )
      .select(col("s.line").alias("line"), col("c.CharacterName").alias("name"))
)

pairs = (
    mentions.alias("a")
      .join(
          mentions.alias("b"),
          (col("a.line") == col("b.line")) & (col("a.name") < col("b.name")),
          "inner"
      )
      .select(col("a.name").alias("src"), col("b.name").alias("dst"))
)

edges_df = pairs.groupBy("src", "dst").count().withColumnRenamed("count", "weight")

edges_df.show(10, truncate=False)
edges_df.printSchema()

                                                                                

+-------------------------+---------+
|id                       |Season   |
+-------------------------+---------+
|ABQ Detective            |Season_3 |
|APD Detective Tim Roberts|Season_2 |
|APD Officer              |Season_3 |
|ASAC George Merkert      |Season_2 |
|ASAC George Merket       |Season_5A|
|ASAC Ramey               |Season_5B|
|Addict                   |Season_2 |
|Agent Buddy              |Season_2 |
|Agent Van Oster          |Season_5B|
|Airport Traveler         |Season_5A|
+-------------------------+---------+
only showing top 10 rows
root
 |-- id: string (nullable = true)
 |-- Season: string (nullable = true)



                                                                                

+-------------+-----------+------+
|src          |dst        |weight|
+-------------+-----------+------+
|Ed           |Gonzo      |4     |
|Himself      |No-Doze    |1     |
|Doctor       |ER Doctor  |1     |
|Ed           |Gomez      |51    |
|Clovis       |Father     |1     |
|Father       |Gomez      |1     |
|DEA          |Father     |5     |
|Ed           |Skinny Pete|22    |
|Badger       |Combo      |6     |
|Carmen Molina|Student    |1     |
+-------------+-----------+------+
only showing top 10 rows
root
 |-- src: string (nullable = true)
 |-- dst: string (nullable = true)
 |-- weight: long (nullable = false)



# Writing Data in Neo4j

In [5]:
neo4j_url = "bolt://neo4j-iteso:7687"
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"

#nodes de characters
character_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", ":Character") \
  .option("node.keys", "id") \
  .save()

print(f"{character_nodes.count()} character nodes wrote in Neo4j")

#edges
edges_df.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("relationship", "CO_OCCURS_WITH") \
  .option("relationship.save.strategy", "keys") \
  .option("relationship.source.labels", ":Character") \
  .option("relationship.source.save.mode", "match") \
  .option("relationship.source.node.keys", "src:id") \
  .option("relationship.target.labels", ":Character") \
  .option("relationship.target.save.mode", "match") \
  .option("relationship.target.node.keys", "dst:id") \
  .option("relationship.properties", "weight") \
  .save()

print(f"{edges_df.count()} co-occurrence edges wrote in Neo4j")

385 character nodes wrote in Neo4j


                                                                                

977 co-occurrence edges wrote in Neo4j


# Read and Query Graphs with PySpark

In [6]:
# Leer nodos
characters_from_neo4j = spark.read \
  .format("org.neo4j.spark.DataSource") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", ":Character") \
  .load()

characters_from_neo4j.select("id", "Season").show(10, truncate=False)

# Leer relaciones especificando labels de origen/destino
edges_from_neo4j = spark.read \
  .format("org.neo4j.spark.DataSource") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("relationship", "CO_OCCURS_WITH") \
  .option("relationship.source.labels", ":Character") \
  .option("relationship.target.labels", ":Character") \
  .load()

+-------------------------+---------+
|id                       |Season   |
+-------------------------+---------+
|ABQ Detective            |Season_3 |
|APD Detective Tim Roberts|Season_2 |
|APD Officer              |Season_3 |
|ASAC George Merkert      |Season_2 |
|ASAC George Merket       |Season_5A|
|ASAC Ramey               |Season_5B|
|Addict                   |Season_2 |
|Agent Buddy              |Season_2 |
|Agent Van Oster          |Season_5B|
|Airport Traveler         |Season_5A|
+-------------------------+---------+
only showing top 10 rows


In [7]:
sc.stop()