# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 06**: Writing Data in Neo4j

**Date**: October 5 2025

**Student Name**: Andre Jair Sanchez Contreras

**Professor**: Pablo Camarillo Ramirez

# Dataset description

I chose this dataset because I find the structure of Amazon datasets very interesting. They are very large and provide a lot of information. Obviously, for this lab, I chose a small one with 250 customer order datasets to conduct the lab.
this dataset contains information from the following columns:
Order ID - Unique identifier for each order (e.g., ORD0001).

Date - Date of the order.

Product - Name of the product purchased.

Category - Product category (Electronics, Clothing, Home Appliances, etc.).

Price - Price of a single unit of the product.

Quantity - Number of units purchased in the order.

Total Sales - Total revenue from the order (Price × Quantity).

Customer Name - Name of the customer.

Customer Location - City where the customer is based.

Payment Method - Mode of payment (Credit Card, Debit Card, PayPal, etc.).

Status - Order status (Completed, Pending, or Cancelled). 

Link:https://www.kaggle.com/datasets/zahidmughal2343/amazon-sales-2025?select=amazon_sales_data+2025.csv

# Data ingestion

In [1]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

USE_LOCAL = True  

neo4j_pkg = "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3"

if USE_LOCAL:
    
    spark = (SparkSession.builder
        .appName("Amazon → Neo4j (LOCAL)")
        .master("local[*]")                         
        .config("spark.jars.packages", neo4j_pkg)   
        .config("spark.ui.port", "4040")
        .getOrCreate())
else:
    
    spark = (SparkSession.builder
        .appName("Amazon → Neo4j (CLUSTER)")
        .master("spark://spark-master:7077")
        .config("spark.jars.packages", neo4j_pkg)
        .config("spark.ui.port", "4040")
        .getOrCreate())

sc = spark.sparkContext
sc.setLogLevel("ERROR")
spark.conf.set("spark.sql.shuffle.partitions", "5")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.neo4j#neo4j-connector-apache-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-134f1854-6197-40f7-9980-19d880bb83f4;1.0
	confs: [default]
	found org.neo4j#neo4j-connector-apache-spark_2.13;5.3.10_for_spark_3 in central
	found org.neo4j#neo4j-connector-apache-spark_2.13_common;5.3.10_for_spark_3 in central
	found org.neo4j#caniuse-core;1.3.0 in central
	found org.neo4j#caniuse-api;1.3.0 in central
	found org.jetbrains.kotlin#kotlin-stdlib;2.1.20 in central
	found org.jetbrains#annotations;13.0 in central
	found org.neo4j#caniuse-neo4j-detection;1.3.0 in central
	found org.neo4j.driver#neo4j-java-driver-slim;4.4.21 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found io.netty#netty-handler;4.1.

In [4]:
from pcamarillor.spark_utils import SparkUtils

schema = SparkUtils.generate_schema([
    ("Order ID", "string"),
    ("Date", "string"),              
    ("Product", "string"),
    ("Category", "string"),
    ("Price", "double"),
    ("Quantity", "int"),
    ("Total Sales", "double"),
    ("Customer Name", "string"),
    ("Customer Location", "string"),
    ("Payment Method", "string"),
    ("Status", "string"),
])

csv_path = "/opt/spark/work-dir/data/Amazon/amazon_sales_data_2025.csv"

df_raw = (spark.read
    .option("header", "true")
    .schema(schema)
    .csv(csv_path))

df = df_raw.selectExpr(
    "`Order ID`          as order_id",
    "Product             as product",
    "Category            as category",
    "Price               as price",
    "Quantity            as quantity",
    "`Total Sales`       as total_sales",
    "`Customer Name`     as customer_name",
    "`Customer Location` as city",
    "`Payment Method`    as payment",
    "Status              as status"
)

df.show(5, False)
df.printSchema()



+--------+-------------+-----------+-----+--------+-----------+-------------+-------------+-----------+---------+
|order_id|product      |category   |price|quantity|total_sales|customer_name|city         |payment    |status   |
+--------+-------------+-----------+-----+--------+-----------+-------------+-------------+-----------+---------+
|ORD0001 |Running Shoes|Footwear   |60.0 |3       |180.0      |Emma Clark   |New York     |Debit Card |Cancelled|
|ORD0002 |Headphones   |Electronics|100.0|4       |400.0      |Emily Johnson|San Francisco|Debit Card |Pending  |
|ORD0003 |Running Shoes|Footwear   |60.0 |2       |120.0      |John Doe     |Denver       |Amazon Pay |Cancelled|
|ORD0004 |Running Shoes|Footwear   |60.0 |3       |180.0      |Olivia Wilson|Dallas       |Credit Card|Pending  |
|ORD0005 |Smartwatch   |Electronics|150.0|3       |450.0      |Emma Clark   |New York     |Debit Card |Pending  |
+--------+-------------+-----------+-----+--------+-----------+-------------+-----------

                                                                                

# Transformations

In [7]:
# Add the code for your transformations to create nodes and edges DataFrames HERE

from pyspark.sql import functions as X


def hcol(*cols):
    return X.sha2(X.concat_ws("|", *cols), 256)

# --- NODES ---
customer_nodes = (df
    .select(
        hcol("customer_name","city").alias("customer_id"),
        X.col("customer_name").alias("name"),
        X.col("city")
    )
    .dropDuplicates(["customer_id"])
)
customer_nodes.show(3, False)

city_nodes = (df
    .select(F.col("city").alias("name"))
    .dropDuplicates()
    .withColumn("city_id", F.sha2("name", 256))
    .select("city_id","name")
)
city_nodes.show(3, False)

product_nodes = (df
    .select(
        hcol("product","category").alias("product_id"),
        X.col("product").alias("name"),
        X.col("category"),
        X.col("price").cast("double").alias("price")
    )
    .dropDuplicates(["product_id"])
)
product_nodes.show(3, False)

order_nodes = (df
    .select(
        X.col("order_id"),
        X.col("status"),
        X.col("total_sales").cast("double").alias("total_sales"),
        X.col("quantity").cast("int").alias("quantity")
    )
    .dropDuplicates(["order_id"])
)
order_nodes.show(3, False)

payment_nodes = (df
    .select(X.col("payment").alias("name"))
    .dropDuplicates()
    .withColumn("payment_id", X.sha2("name",256))
    .select("payment_id","name")
)
payment_nodes.show(3, False)

# --- EDGES ---
#Customer -> Order
placed_edges = (df
    .select(
        hcol("customer_name","city").alias("customer_id"),
        X.col("order_id")
    )
    .dropDuplicates()
)
placed_edges.show(3, False)

#Order -> Product
contains_edges = (df
    .select(
        X.col("order_id"),
        hcol("product","category").alias("product_id"),
        X.col("quantity").cast("int").alias("qty")
    )
)
contains_edges.show(3, False)

#Order -> PaymentMethod
paid_with_edges = (df
    .select(
        X.col("order_id"),
        X.sha2(F.col("payment"),256).alias("payment_id")
    )
    .dropDuplicates()
)
paid_with_edges.show(3, False)

#Customer -> City
located_in_edges = (customer_nodes.alias("c")
    .join(city_nodes.alias("ci"), X.col("c.city") == X.col("ci.name"))
    .select(X.col("c.customer_id"), X.col("ci.city_id"))
    .dropDuplicates()
)
located_in_edges.show(3, False)

+----------------------------------------------------------------+-------------+-----------+
|customer_id                                                     |name         |city       |
+----------------------------------------------------------------+-------------+-----------+
|04849a78e8e956f1cf1b8a6f774f351ec5d9266d5dd34cee5c180daba6f4eb47|Olivia Wilson|Chicago    |
|072f68ad17fe33c3b02eac04ebe844f11e830b1232c8eb3b355c9b2055b996a5|Jane Smith   |Los Angeles|
|07758a51dc0f7fac61534c6fee7453994165100f15f05c613a5d87141248b285|Daniel Harris|Houston    |
+----------------------------------------------------------------+-------------+-----------+
only showing top 3 rows
+----------------------------------------------------------------+-------+
|city_id                                                         |name   |
+----------------------------------------------------------------+-------+
|fcdeb8c07d4a0e1d3b453067c2e819ba1caaf77a2a8a4acd990746e1bf242ec6|Denver |
|4eb1b68df1329b7210000f09

# Writing Data in Neo4j

In [9]:
# Add the code to write a graph from PySpark's DataFrames to Neo4j
#Conexion
neo4j_url   = "bolt://neo4j-iteso:7687"  
neo4j_user  = "neo4j"
neo4j_passw = "neo4j@1234"

#Nodos

#Customer
customer_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("labels", ":Customer") \
  .option("node.keys", "customer_id") \
  .save()

print(f"{customer_nodes.count()} Customer nodes wrote in Neo4j")

#City
city_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("labels", ":City") \
  .option("node.keys", "city_id") \
  .save()

print(f"{city_nodes.count()} City nodes wrote in Neo4j")

#Product
product_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("labels", ":Product") \
  .option("node.keys", "product_id") \
  .save()

print(f"{product_nodes.count()} Product nodes wrote in Neo4j")

#Order
order_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("labels", ":Order") \
  .option("node.keys", "order_id") \
  .save()

print(f"{order_nodes.count()} Order nodes wrote in Neo4j")

#PaymentMethod
payment_nodes.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("labels", ":PaymentMethod") \
  .option("node.keys", "payment_id") \
  .save()

print(f"{payment_nodes.count()} PaymentMethod nodes wrote in Neo4j")


#Relaciones

#(Customer)-[:PLACED]->(Order)
placed_edges.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("relationship", "PLACED") \
  .option("relationship.save.strategy", "keys") \
  .option("relationship.source.labels", ":Customer") \
  .option("relationship.source.node.keys", "customer_id:customer_id") \
  .option("relationship.target.labels", ":Order") \
  .option("relationship.target.node.keys", "order_id:order_id") \
  .save()

print(f"{placed_edges.count()} PLACED edges wrote in Neo4j")

#(Order)-[:CONTAINS {qty}]->(Product)
contains_edges.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("relationship", "CONTAINS") \
  .option("relationship.properties", "qty:qty") \
  .option("relationship.save.strategy", "keys") \
  .option("relationship.source.labels", ":Order") \
  .option("relationship.source.node.keys", "order_id:order_id") \
  .option("relationship.target.labels", ":Product") \
  .option("relationship.target.node.keys", "product_id:product_id") \
  .save()

print(f"{contains_edges.count()} CONTAINS edges wrote in Neo4j")

#(Order)-[:PAID_WITH]->(PaymentMethod)
paid_with_edges.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("relationship", "PAID_WITH") \
  .option("relationship.save.strategy", "keys") \
  .option("relationship.source.labels", ":Order") \
  .option("relationship.source.node.keys", "order_id:order_id") \
  .option("relationship.target.labels", ":PaymentMethod") \
  .option("relationship.target.node.keys", "payment_id:payment_id") \
  .save()

print(f"{paid_with_edges.count()} PAID_WITH edges wrote in Neo4j")

#(Customer)-[:LOCATED_IN]->(City)
located_in_edges.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("relationship", "LOCATED_IN") \
  .option("relationship.save.strategy", "keys") \
  .option("relationship.source.labels", ":Customer") \
  .option("relationship.source.node.keys", "customer_id:customer_id") \
  .option("relationship.target.labels", ":City") \
  .option("relationship.target.node.keys", "city_id:city_id") \
  .save()

print(f"{located_in_edges.count()} LOCATED_IN edges wrote in Neo4j")


92 Customer nodes wrote in Neo4j
10 City nodes wrote in Neo4j
10 Product nodes wrote in Neo4j
250 Order nodes wrote in Neo4j
5 PaymentMethod nodes wrote in Neo4j
250 PLACED edges wrote in Neo4j
250 CONTAINS edges wrote in Neo4j
250 PAID_WITH edges wrote in Neo4j
92 LOCATED_IN edges wrote in Neo4j


# Read and Query Graphs with PySpark

In [12]:
# Add the code to read a data frame from Neo4J and run a simple query to verify 
#Ciudades con mas ventas
cypher_city = spark.read \
  .format("org.neo4j.spark.DataSource") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passw) \
  .option("query", """
    MATCH (c:Customer)-[:LOCATED_IN]->(ci:City)
    MATCH (c)-[:PLACED]->(o:Order)
    RETURN ci.name AS city, sum(o.total_sales) AS revenue
    ORDER BY revenue DESC
  """) \
  .load()

cypher_city.show(truncate=False)



+-------------+-------+
|city         |revenue|
+-------------+-------+
|Miami        |31700.0|
|Denver       |29785.0|
|Houston      |28390.0|
|Dallas       |27145.0|
|Seattle      |26890.0|
|Boston       |26170.0|
|Chicago      |20810.0|
|New York     |18940.0|
|Los Angeles  |17820.0|
|San Francisco|16195.0|
+-------------+-------+



In [None]:
sc.stop()