# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Data pipeline with Neo4j

**Date**: October 2nd 2025

**Student Name**: Juan Bernardo Orozco Quirarte

**Professor**: Pablo Camarillo Ramirez

# Dataset description

## Dataset: car_sales_data.csv

### Columnas
- **Manufacturer**: fabricante del carro
- **Model**: modelo del carro
- **Engine_size**: tamaño del motor
- **Fuel_tipe**: tipo de combustible
- **Year_of_manufacture**: año de fabricación
- **Mileage**: kilometraje
- **Price**: precio

### Nodos
- **Car**: cada coche será un nodo
- **Manufacturer**: cada fabricante será un nodo

### Relaciones (Edges)
- **MANUFACTURED_BY**: cada carro tiene un fabricante


# Data ingestion

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on SparkSQL") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.neo4j#neo4j-connector-apache-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-e5043d26-e160-4e9d-923a-6082f4bcdbc5;1.0
	confs: [default]
	found org.neo4j#neo4j-connector-apache-spark_2.13;5.3.10_for_spark_3 in central
	found org.neo4j#neo4j-connector-apache-spark_2.13_common;5.3.10_for_spark_3 in central
	found org.neo4j#caniuse-core;1.3.0 in central
	found org.neo4j#caniuse-api;1.3.0 in central
	found org.jetbrains.kotlin#kotlin-stdlib;2.1.20 in central
	found org.jetbrains#annotations;13.0 in central
	found org.neo4j#caniuse-neo4j-detection;1.3.0 in central
	found org.neo4j.driver#neo4j-java-driver-slim;4.4.21 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found io.netty#netty-handler;4.1.

In [2]:
# Build schema
# Import your module
from lib.bernardoorozco.spark_utils import SparkUtils
from pyspark.sql.functions import monotonically_increasing_id

cars_schema_columns = [ 
     ("Manufacturer", "string"), 
     ("Model", "string"),
     ("Engine_size", "float"),
     ("Fuel_tipe", "string"),
     ("Year_of_manufacture", "int"),
     ("Mileage", "int"),
     ("Price", "int")
]
cars_schema = SparkUtils.generate_schema(cars_schema_columns)

# Leer CSV
df_cars = spark.read \
                .option("header", "true") \
                .schema(cars_schema) \
                .csv("/opt/spark/work-dir/data/archive/car_sales_data.csv")

# Generar ID único para cada carro
df_cars = df_cars.withColumn("car_id", monotonically_increasing_id())

df_cars.show(5)

[Stage 0:>                                                          (0 + 1) / 1]

+------------+----------+-----------+---------+-------------------+-------+-----+------+
|Manufacturer|     Model|Engine_size|Fuel_tipe|Year_of_manufacture|Mileage|Price|car_id|
+------------+----------+-----------+---------+-------------------+-------+-----+------+
|        Ford|    Fiesta|        1.0|   Petrol|               2002| 127300| 3074|     0|
|     Porsche|718 Cayman|        4.0|   Petrol|               2016|  57850|49704|     1|
|        Ford|    Mondeo|        1.6|   Diesel|               2014|  39190|24072|     2|
|      Toyota|      RAV4|        1.8|   Hybrid|               1988| 210814| 1705|     3|
|          VW|      Polo|        1.0|   Petrol|               2006| 127869| 4101|     4|
+------------+----------+-----------+---------+-------------------+-------+-----+------+
only showing top 5 rows


                                                                                

# Transformations

In [3]:
# Add the code for your transformations to create nodes and edges DataFrames HERE
from pyspark.sql.functions import col

df_car_nodes=df_cars.select(
    col("car_id"),
    col("Model"),
    col("Engine_size"),
    col("Fuel_tipe"),
    col("Year_of_manufacture"),
    col("Mileage"),
    col("Price")
).dropDuplicates().limit(10000)

df_manufacturers_nodes = df_cars.select(col("Manufacturer")).distinct()

df_edges=df_cars.select(
    col("car_id").alias("from"),
    col("Manufacturer").alias("to")
).limit(10000)

df_car_nodes.show(n=5)
df_manufacturers_nodes.show(n=5)
df_edges.show(n=5)


                                                                                

+------+------+-----------+---------+-------------------+-------+-----+
|car_id| Model|Engine_size|Fuel_tipe|Year_of_manufacture|Mileage|Price|
+------+------+-----------+---------+-------------------+-------+-----+
|    95|Passat|        1.4|   Diesel|               2010|  64359|15563|
|   143|Passat|        2.0|   Petrol|               2013|  38210|25482|
|   308| Prius|        1.0|   Hybrid|               2007|  61357|12446|
|   406|Fiesta|        1.0|   Petrol|               2006| 113610| 4460|
|   600|Mondeo|        1.8|   Diesel|               2014|  38215|25681|
+------+------+-----------+---------+-------------------+-------+-----+
only showing top 5 rows
+------------+
|Manufacturer|
+------------+
|         BMW|
|          VW|
|     Porsche|
|      Toyota|
|        Ford|
+------------+

+----+-------+
|from|     to|
+----+-------+
|   0|   Ford|
|   1|Porsche|
|   2|   Ford|
|   3| Toyota|
|   4|     VW|
+----+-------+
only showing top 5 rows


# Writing Data in Neo4j

In [4]:
# Add the code to write a graph from PySpark's DataFrames to Neo4j
neo4j_url = "bolt://neo4j-iteso:7687"
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"

df_car_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", "Car") \
  .option("node.keys", "car_id") \
  .save()

print(f"{df_car_nodes.count()} car nodes wrote in Neo4j")


df_manufacturers_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", "Manufacturer") \
  .option("node.keys", "Manufacturer") \
  .save()

print(f"{df_manufacturers_nodes.count()} manufacturers nodes wrote in Neo4j")

df_edges.write \
    .format("org.neo4j.spark.DataSource") \
    .mode("Overwrite") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("relationship", "MANUFACTURED_BY") \
    .option("relationship.save.strategy", "keys") \
    .option("relationship.source.labels", ":Car") \
    .option("relationship.source.save.mode", "match") \
    .option("relationship.source.node.keys", "from:car_id") \
    .option("relationship.target.labels", ":Manufacturer") \
    .option("relationship.target.save.mode", "match") \
    .option("relationship.target.node.keys", "to:Manufacturer") \
    .save()


print(f"{df_edges.count()} MANUFACTURED_BY edges wrote in Neo4j")

                                                                                

10000 car nodes wrote in Neo4j
5 manufacturers nodes wrote in Neo4j


[Stage 28:>                                                         (0 + 1) / 1]

10000 MANUFACTURED_BY edges wrote in Neo4j


                                                                                

# Read and Query Graphs with PySpark

In [9]:
# Add the code to read a data frame from Neo4J and run a simple query to verify 
cypher_df = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query",
            """
            MATCH (c:Car)-[r:MANUFACTURED_BY]->(m:Manufacturer)
            RETURN c.Model AS Car_Model, c.Year_of_manufacture AS Year, m.Manufacturer AS Manufacturer
            """) \
    .load()

cypher_df.show(10)

+---------+----+------------+
|Car_Model|Year|Manufacturer|
+---------+----+------------+
|       M5|2000|         BMW|
|       M5|2002|         BMW|
|       Z4|1994|         BMW|
|       M5|1993|         BMW|
|       M5|2010|         BMW|
|       Z4|2008|         BMW|
|       M5|1989|         BMW|
|       Z4|1999|         BMW|
|       M5|2005|         BMW|
|       Z4|2000|         BMW|
+---------+----+------------+
only showing top 10 rows


In [12]:
cypher_df2 = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query",
            """
            MATCH (c:Car)-[r:MANUFACTURED_BY]->(m:Manufacturer)
            WHERE m.Manufacturer IN ["Toyota"]
            RETURN c.Model AS Car_Model, c.Year_of_manufacture AS Year, m.Manufacturer AS Manufacturer
            """) \
    .load()

cypher_df2.limit(20).show(10)

+---------+----+------------+
|Car_Model|Year|Manufacturer|
+---------+----+------------+
|    Yaris|2010|      Toyota|
|    Yaris|2004|      Toyota|
|    Yaris|2017|      Toyota|
|    Prius|2013|      Toyota|
|    Yaris|2018|      Toyota|
|     RAV4|1998|      Toyota|
|     RAV4|2019|      Toyota|
|    Yaris|2018|      Toyota|
|    Yaris|2007|      Toyota|
|    Yaris|2001|      Toyota|
+---------+----+------------+
only showing top 10 rows


In [21]:
cypher_df3 = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query",
            """
           MATCH (c:Car)-[r:MANUFACTURED_BY]->(m:Manufacturer)
            WHERE c.Year_of_manufacture = 2007
            RETURN c.Model AS Car_Model, c.Year_of_manufacture AS Year, c.Fuel_tipe AS Fuel, c.Mileage, c.Price, m.Manufacturer AS Manufacturer
            """) \
    .load()

cypher_df3.limit(20).show(10)

+---------+----+------+---------+-------+------------+
|Car_Model|Year|  Fuel|c.Mileage|c.Price|Manufacturer|
+---------+----+------+---------+-------+------------+
|       Z4|2007|Petrol|   105271|  12046|         BMW|
|       M5|2007|Petrol|    78470|  39655|         BMW|
|       M5|2007|Petrol|    86927|  35075|         BMW|
|       Z4|2007|Petrol|    48480|  16920|         BMW|
|   Passat|2007|Diesel|    50734|  14863|          VW|
|   Passat|2007|Diesel|    86556|  12424|          VW|
|   Passat|2007|Diesel|   110459|   9207|          VW|
|     Golf|2007|Petrol|   124107|   6339|          VW|
|     Golf|2007|Diesel|    83934|   8954|          VW|
|     Polo|2007|Petrol|    15348|  11757|          VW|
+---------+----+------+---------+-------+------------+
only showing top 10 rows


In [22]:
sc.stop()