# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Data pipeline with Neo4j

**Date**: October 2nd 2025

**Student Name**: Fernando Ramos

**Professor**: Pablo Camarillo Ramirez

# Dataset description
This dataset I will be using contains email communication details between employees of the same organization.

**Nodes:**
- Person: Represents employees with properties:
  - id (email address)
  - name
  - department
  - seniority

**Edges (Relationships):**
- SENT: Represents email communication 
  - Source: From_Name, From_Email
  - Destination: To_Name, To_Name
  - Properties: topic, date, sentiment, device, within_work_hours

# Data ingestion

In [1]:
import findspark
findspark.init()

In [2]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import avg, trim, col, count, isnull, when, lit, concat, round, asc, desc
from datetime import datetime
from fernandoramos.spark_utils import SparkUtils

spark = SparkSession.builder \
    .appName("Examples on storage solutions with PosgreSQL") \
    .master("spark://d3eb0343c341:7077") \
    .config("spark.jars.packages", "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.neo4j#neo4j-connector-apache-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-fd5188fa-9055-417d-ab1c-ca2f3d6914b3;1.0
	confs: [default]
	found org.neo4j#neo4j-connector-apache-spark_2.13;5.3.10_for_spark_3 in central
	found org.neo4j#neo4j-connector-apache-spark_2.13_common;5.3.10_for_spark_3 in central
	found org.neo4j#caniuse-core;1.3.0 in central
	found org.neo4j#caniuse-api;1.3.0 in central
	found org.jetbrains.kotlin#kotlin-stdlib;2.1.20 in central
	found org.jetbrains#annotations;13.0 in central
	found org.neo4j#caniuse-neo4j-detection;1.3.0 in central
	found org.neo4j.driver#neo4j-java-driver-slim;4.4.21 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found io.netty#netty-handler;4.1.

## Dataset
This dataset I will be using contains email communication between employees of the same organization.

**Nodes:**
- Person: Represents employees with properties:
  - id (email address - unique identifier)
  - name
  - department
  - seniority
  - department_color
  - department_aura

**Edges (Relationships):**
- SENT: Represents email communication 
  - Source: From_Name (sender)
  - Destination: To_Name (recipient)
  - Properties: topic, date, sentiment, device, within_work_hours

In [3]:
# User (node) schema
person_schema_columns = [
    ("Name", "string"),
    ("Department", "string"),
    ("email", "string"),
    ("Base 64 Image", "string"),
    ("Seniority", "string"),
    ("Department color", "string"),
    ("Department aura", "string")
]
person_schema = SparkUtils.generate_schema(person_schema_columns)

# Mail (edge) schema
sent_schema_columns = [
    ("Email id", "int"),
    ("From Name", "string"),
    ("From seniority", "string"),
    ("From Department", "string"),
    ("To Name", "string"),
    ("To seniority", "string"),
    ("To Department", "string"),
    ("Email topic", "string"),
    ("Date", "date"),
    ("Sentiment", "string"),
    ("Is opened?", "string"),
    ("Device", "string"),
    ("Within work hours", "string"),
    ("Within workdays", "string")
]
sent_schema = SparkUtils.generate_schema(sent_schema_columns)

base_path = "/opt/spark/work-dir/data/mailing/"
df_person = spark.read \
           .option("header", "true") \
           .schema(person_schema) \
           .csv(base_path + "users.csv")
df_sent = spark.read \
           .option("header", "true") \
           .schema(sent_schema) \
           .csv(base_path + "mails.csv")

# Transformations

In [4]:
# These mapping will be used to join with person data 
# and retrieve their email addresses through names
from_mapping = df_person.select(
    col("Name").alias("From Name"),
    col("email").alias("from_email")
)
to_mapping = df_person.select(
    col("Name").alias("To Name"),
    col("email").alias("to_email")
)

person_nodes = df_person.select(
    col("email").alias("id"),
    col("Name").alias("name"),
    col("Department").alias("department"),
    col("Seniority").alias("seniority")
)

sent_edges = df_sent \
    .join(from_mapping, "From Name") \
    .join(to_mapping, "To Name") \
    .select(
        col("from_email").alias("src"),
        col("to_email").alias("dst"),
        col("Email topic").alias("topic"),
        col("Date").alias("date"),
        col("Sentiment").alias("sentiment"),
        col("Is opened?").alias("is_opened"),
        col("Device").alias("device"),
        col("Within work hours").alias("within_work_hours"),
        col("Within workdays").alias("within_workdays")
    )

# Writing Data in Neo4j

In [6]:
# Add the code to write a graph from PySpark's DataFrames to Neo4j
# Person nodes
person_nodes.write \
    .format("org.neo4j.spark.DataSource") \
    .option("url", "bolt://neo4j-iteso:7687") \
    .option("authentication.basic.username", "neo4j") \
    .option("authentication.basic.password", "neo4j@1234") \
    .option("labels", ":Person") \
    .option("node.keys", "id") \
    .mode("Overwrite") \
    .save()

# SENT relationships
sent_edges.write \
    .format("org.neo4j.spark.DataSource") \
    .option("url", "bolt://neo4j-iteso:7687") \
    .option("authentication.basic.username", "neo4j") \
    .option("authentication.basic.password", "neo4j@1234") \
    .option("relationship", "SENT") \
    .option("relationship.save.strategy", "keys") \
    .option("relationship.source.labels", ":Person") \
    .option("relationship.source.node.keys", "src:id") \
    .option("relationship.source.save.mode", "overwrite") \
    .option("relationship.target.labels", ":Person") \
    .option("relationship.target.node.keys", "dst:id") \
    .option("relationship.target.save.mode", "overwrite") \
    .mode("Overwrite") \
    .save()

                                                                                

# Read and Query Graphs with PySpark

In [7]:
# Retrieve top 10 email senders
query_top_senders = """
MATCH (sender:Person)-[r:SENT]->(receiver:Person)
RETURN sender.name AS sender_name, 
       sender.department AS department,
       count(r) AS emails_sent
ORDER BY emails_sent DESC
"""

df_top_senders = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", "bolt://neo4j-iteso:7687") \
    .option("authentication.basic.username", "neo4j") \
    .option("authentication.basic.password", "neo4j@1234") \
    .option("query", query_top_senders) \
    .load()

df_top_senders.show(10)

+--------------------+--------------------+-----------+
|         sender_name|          department|emails_sent|
+--------------------+--------------------+-----------+
|Constancia Di Bar...|Executive Management|         46|
|        Reina Trobey|Information Techn...|         21|
|      Marilyn Seeman|Information Techn...|         14|
|      Emmey Matoshin|               Sales|         11|
|        Leona McAree|Executive Management|         10|
|       Faythe Vassel|               Legal|         10|
|     Maurine Golding|               Legal|         10|
|       Jorey Deguara| Product development|          7|
|      Mata McGifford|    Customer Service|          7|
|        Traci Habbal|           Marketing|          6|
+--------------------+--------------------+-----------+
only showing top 10 rows


In [8]:
# Retrieve sentiment analysis
query_sentiment = """
MATCH (sender:Person)-[r:SENT]->(receiver:Person)
RETURN r.sentiment AS sentiment,
       count(r) AS count
ORDER BY count DESC
"""

df_sentiment = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", "bolt://neo4j-iteso:7687") \
    .option("authentication.basic.username", "neo4j") \
    .option("authentication.basic.password", "neo4j@1234") \
    .option("query", query_sentiment) \
    .load()

df_sentiment.show()

+---------+-----+
|sentiment|count|
+---------+-----+
|  neutral|  159|
| positive|   80|
| negative|   25|
+---------+-----+



In [9]:
sc.stop()