# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Data pipeline with Neo4j

**Date**: October 2nd 2025

**Student Name**: Aura Melina Gutierrez Jimenez

**Professor**: Pablo Camarillo Ramirez

# Dataset description

El Dataset 'Spotify Artist Feature Collaboration Network' se compone de dos archivos principales: nodes.csv y edges.csv. El archivo nodes.csv contiene la descripción de cada nodo (artista), incluyendo su ID (llave primaria), nombre y otras propiedades. El archivo edges.csv, por su parte, define la relación COLLABORATES_WITH entre los artistas, proporcionando así la información de origen y destino de las conexiones.

# Data ingestion

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Lab06") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.neo4j#neo4j-connector-apache-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-9034bb84-049b-4d6a-8bea-b48151a57ce9;1.0
	confs: [default]
	found org.neo4j#neo4j-connector-apache-spark_2.13;5.3.10_for_spark_3 in central
	found org.neo4j#neo4j-connector-apache-spark_2.13_common;5.3.10_for_spark_3 in central
	found org.neo4j#caniuse-core;1.3.0 in central
	found org.neo4j#caniuse-api;1.3.0 in central
	found org.jetbrains.kotlin#kotlin-stdlib;2.1.20 in central
	found org.jetbrains#annotations;13.0 in central
	found org.neo4j#caniuse-neo4j-detection;1.3.0 in central
	found org.neo4j.driver#neo4j-java-driver-slim;4.4.21 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found io.netty#netty-handler;4.1.

In [2]:
from auragutierrez.spark_utils import SparkUtils
from pyspark.sql.functions import col

schema_nodes = SparkUtils.generate_schema([
    ("spotify_id", "string"),
    ("name", "string"),
    ("followers", "long"), 
    ("popularity", "int"),
    ("genres", "string"), 
    ("chart_hits", "string")
])

schema_edges = SparkUtils.generate_schema([
    ("id_0", "string"),
    ("id_1", "string")
])

df_nodes = spark.read \
    .schema(schema_nodes) \
    .option("header", True) \
    .csv("/opt/spark/work-dir/data/Spotify/nodes.csv")

# Lectura del archivo de Edges
df_edges = spark.read \
    .schema(schema_edges) \
    .option("header", True) \
    .csv("/opt/spark/work-dir/data/Spotify/edges.csv")

df_nodes.show(n=5)
df_edges.show(n=5)

                                                                                

+--------------------+------------------+---------+----------+--------------------+--------------------+
|          spotify_id|              name|followers|popularity|              genres|          chart_hits|
+--------------------+------------------+---------+----------+--------------------+--------------------+
|48WvrUGoijadXXCsG...|         Byklubben|     NULL|        24|['nordic house', ...|          ['no (3)']|
|4lDiJcOJ2GLCK6p9q...|          Kontra K|     NULL|        72|['christlicher ra...|['at (44)', 'de (...|
|652XIvIBNGg3C0KIG...|             Maxim|     NULL|        36|                  []|          ['de (1)']|
|3dXC1YPbnQPsfHPVk...|Christopher Martin|     NULL|        52|['dancehall', 'lo...|['at (1)', 'de (1)']|
|74terC9ol9zMo8rfz...|     Jakob Hellman|     NULL|        39|['classic swedish...|          ['se (6)']|
+--------------------+------------------+---------+----------+--------------------+--------------------+
only showing top 5 rows
+--------------------+---------

# Transformations

In [3]:
# Add the code for your transformations to create nodes and edges DataFrames HERE
artist_nodes = df_nodes.select(
    col("spotify_id").alias("id"),
    "name",
    "followers",
    "popularity",
    "genres",
    "chart_hits"
).dropDuplicates(["id"])
artist_nodes.show(n=5)

artist_edges = df_edges.select(
    col("id_0").alias("source"),
    col("id_1").alias("destination")
)
artist_edges.show(n=5)

                                                                                

+--------------------+----------+---------+----------+---------------+----------+
|                  id|      name|followers|popularity|         genres|chart_hits|
+--------------------+----------+---------+----------+---------------+----------+
|000BblCiHJeKvtiq5...| 51 Koodia|     NULL|        29|             []|      NULL|
|000WMX8CCUlKyWxaO...| John Wang|     NULL|         6|             []|      NULL|
|0016RK4nfua0DSs6F...| ZwartWerk|     NULL|        15|['vlaamse rap']|['be (1)']|
|001K60xGRAvZTXy1q...|JC Carroll|     NULL|         6|             []|      NULL|
|001KkBGwhLqBrFXyh...|   Nyquest|     NULL|        29|             []|      NULL|
+--------------------+----------+---------+----------+---------------+----------+
only showing top 5 rows
+--------------------+--------------------+
|              source|         destination|
+--------------------+--------------------+
|76M2Ekj8bG8W7X2nb...|7sfl4Xt5KmfyDs2T3...|
|0hk4xVujcyOr6USD9...|7Do8se3ZoaVqUt3wo...|
|38jpuy3yt3QIxQ8Fn

# Writing Data in Neo4j

In [5]:
neo4j_url = "bolt://neo4j-iteso:7687"
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"

artist_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", ":Artist") \
  .option("node.keys", "id") \
  .save()

print(f"{artist_nodes.count()} artist wrote in Neo4j")



156320 artist wrote in Neo4j


                                                                                

In [4]:
neo4j_url = "bolt://neo4j-iteso:7687"
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"

artist_edges.write \
    .format("org.neo4j.spark.DataSource") \
    .mode("Append") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("relationship", "COLLABORATES_WITH") \
    .option("relationship.save.strategy", "keys") \
    .option("relationship.source.labels", ":Artist") \
    .option("relationship.source.node.keys", "source:id") \
    .option("relationship.target.labels", ":Artist") \
    .option("relationship.target.node.keys", "destination:id") \
    .option("batch.size", "20000") \
    .save()

print(f"{artist_edges.count()} relationships wrote in Neo4j")

                                                                                

300386 relationships wrote in Neo4j


                                                                                

# Read and Query Graphs with PySpark

In [10]:
query_path = """
// Consulta: Busca el camino de colaboración más corto (hasta 4 saltos)
MATCH (a:Artist {name: 'Eminem'}), 
      (b:Artist {name: 'Rihanna'}),
      p = shortestPath((a)-[:COLLABORATES_WITH*..4]-(b))
RETURN a.name AS ArtistA, 
       b.name AS ArtistB, 
       length(p) AS PathLength, 
       p
"""

cypher_df = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query", query_path) \
    .load()

print("\n--- Resultado de la Consulta ---")
cypher_df.show(truncate=False)


--- Resultado de la Consulta ---
+-------+-------+----------+----------------------------------------------------+
|ArtistA|ArtistB|PathLength|p                                                   |
+-------+-------+----------+----------------------------------------------------+
|Eminem |Rihanna|1         |"path[(152577)<-[189996:COLLABORATES_WITH]-(58329)]"|
+-------+-------+----------+----------------------------------------------------+



In [4]:
sc.stop()