# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Data pipeline with Neo4j

**Date**: October 2nd 2025

**Student Name**: Jaime Enrique Galindo Villegas

**Professor**: Pablo Camarillo Ramirez

# Dataset description

Select a dataset in any public repository (e.g. Kaggle or Network Repository) containing relationships. In this section of the Notebook, you need to describe nodes and edges of the graph (a column representing the sourceand another column representing the destination of the relationships).

Mi dataset es "Spotify Dataset 2023", obtenido desde kaggle en la siguiente liga: [Spotify Dataset 2023](https://www.kaggle.com/datasets/tonygordonjr/spotify-dataset-2023?select=spotify_data_12_20_2023.csv)

Este dataset recopila información relevante de canciones, artistas y albumes de Spotify, permitiendo hacer relaciones entre estas entidades sobre artistas que escriben una cancion y canciones dentro de un album.

En este dataset tenemos varios archivos, los principales son: <br>
Para información relevante de los nodos artista, canción y album de los archivos:
- spotify_artist_data_2023
- spotify_tracks_data_2023
- spotify-albums_data_2023

Y podemos sacar relaciones entre artistas-albums y albums-canciones de los archivos:
- spotify_data_12_20_2023 
<br>
(También es posible sacar toda la información relevante de los nodos desde este último archivo.)


# Data ingestion

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on SparkSQL") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")
# Optimization (reduce the number of shuffle partitions)
spark.conf.set("spark.sql.shuffle.partitions", "5")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.neo4j#neo4j-connector-apache-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-4d31339c-3ba9-4fcb-8f2f-5ce2438e8cb8;1.0
	confs: [default]
	found org.neo4j#neo4j-connector-apache-spark_2.13;5.3.10_for_spark_3 in central
	found org.neo4j#neo4j-connector-apache-spark_2.13_common;5.3.10_for_spark_3 in central
	found org.neo4j#caniuse-core;1.3.0 in central
	found org.neo4j#caniuse-api;1.3.0 in central
	found org.jetbrains.kotlin#kotlin-stdlib;2.1.20 in central
	found org.jetbrains#annotations;13.0 in central
	found org.neo4j#caniuse-neo4j-detection;1.3.0 in central
	found org.neo4j.driver#neo4j-java-driver-slim;4.4.21 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found io.netty#netty-handler;4.1.

In [2]:
# Build schemas
# Import your module
from jaime_galindo.spark_utils import SparkUtils

# Artist schema:
artists_schema = SparkUtils.generate_schema([
('artist_id', 'string'),
('name', 'string'),
('artist_popularity', 'int'),
('followers', 'int')
])


# Tracks schema:
tracks_schema = SparkUtils.generate_schema([
('track_id', 'string'),
('track_name', 'string'),
('track_href', 'string'),
('track_popularity', 'int'),
('explicit', 'boolean'),
('duration_ms', 'int')	
])

# Album schema:
albums_schema = SparkUtils.generate_schema([
('album_id', 'string'),
('album_name', 'string'),
('release_date', 'timestamp'),
('album_popularity', 'int'),
('album_type', 'string'),
('label', 'string'),
('total_tracks', 'int')
])

# albums -> canciones
albums_tracks_schema = SparkUtils.generate_schema([
('album_id', 'string'),
('track_id', 'string'),
('track_number', 'int')
])

# artists -> canciones
artists_tracks_schema = SparkUtils.generate_schema([
('artist_id', 'string'),
('track_id', 'string')
])




# Transformations

### Nodes

In [3]:
# Add the code for your transformations to create nodes and edges DataFrames HERE
from pyspark.sql.functions import col, trim, lower

# Usaré el archivo de datos completo, lo leo y creo un dataset completo
full_df = spark.read \
    .option("header", True) \
    .option("escape", '"') \
    .csv("/opt/spark/work-dir/data/spotify_data/spotify_data_12_20_2023.csv")


# artista --------------------------------------------------------------------------------------------
# obtener las columnas que usaré del dataset
column_names = [field.name for field in artists_schema.fields]
df_temporal = full_df.select(column_names)

# Aplicar esquema a este nuevo dataset
for field in artists_schema.fields:
    df_temporal = df_temporal.withColumn(
        field.name,
        trim(col(field.name)).cast(field.dataType)
    )

# Normalizar df
df_norm = df_temporal.withColumn(
    "artist_id",
    lower(col("artist_id"))
)
# Obtener df final para artistas, sin duplicados
artists_df = df_norm.dropDuplicates(["artist_id"])

# cancion ---------------------------------------------------------------------------------------------
# obtener las columnas que usaré del dataset
column_names = [field.name for field in tracks_schema.fields]
df_temporal = full_df.select(column_names)

# Aplicar esquema a este nuevo dataset
for field in tracks_schema.fields:
    df_temporal = df_temporal.withColumn(
        field.name,
        trim(col(field.name)).cast(field.dataType)
    )

# Normalizar df
df_norm = df_temporal.withColumn(
    "track_id",
    lower(col("track_id"))
)

# Obtener df final para canciones, sin duplicados
tracks_df = df_norm.dropDuplicates(["track_id"])


# albums ---------------------------------------------------------------------------------------------
# obtener las columnas que usaré del dataset
column_names = [field.name for field in albums_schema.fields]
df_temporal = full_df.select(column_names)

# Aplicar esquema a este nuevo dataset
for field in albums_schema.fields:
    df_temporal = df_temporal.withColumn(
        field.name,
        trim(col(field.name)).cast(field.dataType)
    )

# Normalizar df
df_norm = df_temporal.withColumn(
    "album_id",
    lower(col("album_id"))
)

# Obtener df final para albums, sin duplicados
albums_df = df_norm.dropDuplicates(["album_id"])


                                                                                

### Edges

In [4]:
# albums -> canciones --------------------------------------------------------------------------------------------
# obtener las columnas que usaré del dataset
column_names = [field.name for field in albums_tracks_schema.fields]
df_temporal = full_df.select(column_names)

# Aplicar esquema a este nuevo dataset
for field in albums_tracks_schema.fields:
    df_temporal = df_temporal.withColumn(
        field.name,
        trim(col(field.name)).cast(field.dataType)
    )

# Normalizar df
df_norm = df_temporal \
    .withColumn("album_id", lower(col("album_id"))) \
    .withColumn("track_id", lower(col("track_id"))) \

# Obtener df final, sin canciones duplicadas, no puede haber una cancion en más de un album
albums_tracks_df = df_norm.dropDuplicates(["track_id"])

# Renombrar columnas para Neo4j
albums_tracks_df = albums_tracks_df \
    .withColumnRenamed("album_id", "src") \
    .withColumnRenamed("track_id", "dst") 


# artistas -> canciones --------------------------------------------------------------------------------------------
# obtener las columnas que usaré del dataset
column_names = [field.name for field in artists_tracks_schema.fields]
df_temporal = full_df.select(column_names)

# Aplicar esquema a este nuevo dataset
for field in artists_tracks_schema.fields:
    df_temporal = df_temporal.withColumn(
        field.name,
        trim(col(field.name)).cast(field.dataType)
    )

# Normalizar df
df_norm = df_temporal \
    .withColumn("artist_id", lower(col("artist_id"))) \
    .withColumn("track_id", lower(col("track_id"))) \

# Obtener df final, sin canciones duplicadas, por complejidad, solo puede haber una canción con un unico artista.
# En el dataset original vienen arrays con todos los creadores de la canción pero aparecen por nombre y no por ID, de modo que hay conflictos 
# debido a que algunos artistas comparten nombre y es imposible saber a cual artista se refiere.
artists_tracks_df = df_norm.dropDuplicates(["track_id"])

# Renombrar columnas para Neo4j
artists_tracks_df = artists_tracks_df \
    .withColumnRenamed("artist_id", "src") \
    .withColumnRenamed("track_id", "dst") 


# Writing Data in Neo4j

### Escribir nodos

In [5]:
# Add the code to write a graph from PySpark's DataFrames to Neo4j

neo4j_url = "bolt://neo4j-iteso:7687"
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"

artists_df.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", ":Artist") \
  .option("node.keys", "artist_id") \
  .option("batch.size", "10000") \
  .save()

print(f"{artists_df.count()} artists wrote in Neo4j")




31699 artists wrote in Neo4j


                                                                                

In [None]:

albums_df.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", ":Album") \
  .option("node.keys", "album_id") \
  .option("batch.size", "10000") \
  .save()

print(f"{albums_df.count()} albums wrote in Neo4j")




67991 albums wrote in Neo4j


                                                                                

In [7]:

tracks_df.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", ":Track") \
  .option("node.keys", "track_id") \
  .option("batch.size", "10000") \
  .save()

print(f"{tracks_df.count()} tracks wrote in Neo4j")




375141 tracks wrote in Neo4j


                                                                                

### Escribir relaciones

In [None]:

albums_tracks_df.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("relationship", "CONTAINS") \
  .option("relationship.save.strategy", "keys") \
  .option("relationship.source.labels", ":Album") \
  .option("relationship.source.save.mode", "match") \
  .option("relationship.source.node.keys", "src:album_id") \
  .option("relationship.target.labels", ":Track") \
  .option("relationship.target.save.mode", "match") \
  .option("relationship.target.node.keys", "dst:track_id") \
  .option("batch.size", "10000") \
  .save()

print(f"{albums_tracks_df.count()} contains edges wrote in Neo4j")




375141 contains edges wrote in Neo4j


                                                                                

In [9]:

artists_tracks_df.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("relationship", "PERFORMED") \
  .option("relationship.save.strategy", "keys") \
  .option("relationship.source.labels", ":Artist") \
  .option("relationship.source.save.mode", "match") \
  .option("relationship.source.node.keys", "src:artist_id") \
  .option("relationship.target.labels", ":Track") \
  .option("relationship.target.save.mode", "match") \
  .option("relationship.target.node.keys", "dst:track_id") \
  .option("batch.size", "10000") \
  .save()

print(f"{artists_tracks_df.count()} performed edges wrote in Neo4j")




375141 performed edges wrote in Neo4j


                                                                                

# Read and Query Graphs with PySpark

In [18]:
# Add the code to read a data frame from Neo4J and run a simple query to verify 
# Consulta hecha así para obtener varias canciones por cada album, si solo obtengo la relación, vienen cada album con una cancion ya que no recorre toda la información
contains_df = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query",
            """
				// Obtener 5 albumes
				MATCH (a:Album)
				WITH a LIMIT 5
				// hacer la relacion con sus canciones
				MATCH (a)-[r:CONTAINS]->(t:Track)
				// mostrar todo
				RETURN a, r, t
            """) \
    .load()

contains_df.show()


+--------------------+--------------------+--------------------+
|                   a|                   r|                   t|
+--------------------+--------------------+--------------------+
|{63660, [Album], ...|{285699, CONTAINS...|{385391, [Track],...|
|{63660, [Album], ...|{170385, CONTAINS...|{302032, [Track],...|
|{63661, [Album], ...|{302964, CONTAINS...|{402656, [Track],...|
|{63661, [Album], ...|{165109, CONTAINS...|{296756, [Track],...|
|{63661, [Album], ...|{69249, CONTAINS,...|{200896, [Track],...|
|{63662, [Album], ...|{280831, CONTAINS...|{380523, [Track],...|
|{63662, [Album], ...|{275689, CONTAINS...|{375381, [Track],...|
|{63662, [Album], ...|{184606, CONTAINS...|{316253, [Track],...|
|{63662, [Album], ...|{125314, CONTAINS...|{256961, [Track],...|
|{63662, [Album], ...|{37581, CONTAINS,...|{169228, [Track],...|
|{63662, [Album], ...|{17946, CONTAINS,...|{149593, [Track],...|
|{63663, [Album], ...|{321399, CONTAINS...|{421091, [Track],...|
|{63663, [Album], ...|{23

In [20]:
contains_df = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query",
            """
					MATCH p=()-[r:PERFORMED]->() 
               RETURN p 
            """) \
    .load()

contains_df.show()

+--------------------+
|                   p|
+--------------------+
|"path[(56761)-[37...|
|"path[(33475)-[37...|
|"path[(58495)-[37...|
|"path[(48743)-[37...|
|"path[(39790)-[37...|
|"path[(33631)-[37...|
|"path[(51046)-[37...|
|"path[(41715)-[37...|
|"path[(36970)-[37...|
|"path[(37029)-[37...|
|"path[(51752)-[37...|
|"path[(43356)-[37...|
|"path[(58639)-[37...|
|"path[(35172)-[37...|
|"path[(55031)-[37...|
|"path[(33434)-[37...|
|"path[(50713)-[37...|
|"path[(60004)-[37...|
|"path[(33231)-[37...|
|"path[(50809)-[37...|
+--------------------+
only showing top 20 rows


In [21]:
sc.stop()