# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Data pipeline with Neo4j

**Date**: October 2nd 2025

**Student Name**: Arantxa Angulo

**Professor**: Pablo Camarillo Ramirez

# Dataset description

**Context**

On the data team at Stack Overflow, we spend a lot of time and energy thinking about tech ecosystems and how technologies are related to each other. One way to get at this idea of relationships between technologies is tag correlations, how often technology tags at Stack Overflow appear together relative to how often they appear separately. One place we see developers using tags at Stack Overflow is on their Developer Stories, or professional profiles/CVs/resumes. If we are interested in how technologies are connected and how they are used together, developers' own descriptions of their work and careers is a great place to get that.

**Content**

A network of technology tags from Developer Stories on the Stack Overflow online developer community website.

This is organized as two tables:

* stack_network_links contains links of the network, the source and target tech tags plus the value of the the link between each pair
* stack_network_nodes contains nodes of the network, the name of each node, which group that node belongs to (calculated via a cluster walktrap), and a node size based on how often that technology tag is used

# Data ingestion

In [7]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on SparkSQL") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

:: loading settings :: url = jar:file:/opt/spark/jars/ivy-2.5.3.jar!/org/apache/ivy/core/settings/ivysettings.xml
Ivy Default Cache set to: /root/.ivy2.5.2/cache
The jars for the packages stored in: /root/.ivy2.5.2/jars
org.neo4j#neo4j-connector-apache-spark_2.13 added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent-65212807-7116-4da5-841b-f13676ecc3e0;1.0
	confs: [default]
	found org.neo4j#neo4j-connector-apache-spark_2.13;5.3.10_for_spark_3 in central
	found org.neo4j#neo4j-connector-apache-spark_2.13_common;5.3.10_for_spark_3 in central
	found org.neo4j#caniuse-core;1.3.0 in central
	found org.neo4j#caniuse-api;1.3.0 in central
	found org.jetbrains.kotlin#kotlin-stdlib;2.1.20 in central
	found org.jetbrains#annotations;13.0 in central
	found org.neo4j#caniuse-neo4j-detection;1.3.0 in central
	found org.neo4j.driver#neo4j-java-driver-slim;4.4.21 in central
	found org.reactivestreams#reactive-streams;1.0.4 in central
	found io.netty#netty-handler;4.1.

In [8]:
# Build schema
# Import your module
from arantxa.spark_utils import SparkUtils

nodes_schema = SparkUtils.generate_schema([
    ("name", "string"),
    ("group", "int"),
    ("nodesize", "float")
])

# Define schema for links/edges
links_schema = SparkUtils.generate_schema([
    ("source", "string"),
    ("target", "string"),
    ("value", "float")
])

# Read the data
df_nodes = spark.read \
    .option("header", "true") \
    .schema(nodes_schema) \
    .csv("/opt/spark/work-dir/data/stack_network_nodes.csv")

df_links = spark.read \
    .option("header", "true") \
    .schema(links_schema) \
    .csv("/opt/spark/work-dir/data/stack_network_links.csv")

print("Nodes DataFrame:")
df_nodes.show(5, truncate=False)

print("Links DataFrame:")
df_links.show(5, truncate=False)

Nodes DataFrame:


                                                                                

+---------+-----+--------+
|name     |group|nodesize|
+---------+-----+--------+
|html     |6    |272.45  |
|css      |6    |341.17  |
|hibernate|8    |29.83   |
|spring   |8    |52.84   |
|ruby     |3    |70.14   |
+---------+-----+--------+
only showing top 5 rows
Links DataFrame:
+----------------+------+---------+
|source          |target|value    |
+----------------+------+---------+
|azure           |.net  |20.933193|
|sql-server      |.net  |32.322525|
|asp.net         |.net  |48.40703 |
|entity-framework|.net  |24.370903|
|wpf             |.net  |32.350925|
+----------------+------+---------+
only showing top 5 rows


# Transformations

In [9]:
# Add the code for your transformations to create nodes and edges DataFrames HERE
from pyspark.sql.functions import col

# Create nodes DataFrame
tech_nodes = df_nodes.select(
    col("name").alias("id"),      # Node identifier
    col("group"),                # Category group
    col("nodesize").alias("size") # Relative size
).dropDuplicates(["id"])         # Remove duplicates

print("Tech Nodes:")
tech_nodes.show(5, truncate=False)

# Create edges DataFrame
tech_edges = df_links.select(
    col("source").alias("src"),   # Source node
    col("target").alias("dst"),   # Target node  
    col("value").alias("weight")  # Relationship weight
).dropDuplicates()               # Remove duplicates

print("Tech Edges:")
tech_edges.show(5, truncate=False)

print(f"Total nodes: {tech_nodes.count()}")
print(f"Total edges: {tech_edges.count()}")

Tech Nodes:


                                                                                

+--------+-----+-----+
|id      |group|size |
+--------+-----+-----+
|qt      |1    |10.53|
|iphone  |4    |15.29|
|unix    |5    |15.67|
|devops  |9    |9.81 |
|embedded|1    |13.27|
+--------+-----+-----+
only showing top 5 rows
Tech Edges:
+-------------+----------------+---------+
|src          |dst             |weight   |
+-------------+----------------+---------+
|sql          |asp.net         |21.672264|
|wpf          |entity-framework|24.2282  |
|c            |python          |22.320433|
|python       |r               |28.535748|
|ruby-on-rails|ruby            |95.36131 |
+-------------+----------------+---------+
only showing top 5 rows
Total nodes: 115
Total edges: 490


# Writing Data in Neo4j

In [10]:
# Add the code to write a graph from PySpark's DataFrames to Neo4j
# Writing Data in Neo4j
neo4j_url = "bolt://neo4j-iteso:7687"
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"

# Write nodes to Neo4j
tech_nodes.write \
    .format("org.neo4j.spark.DataSource") \
    .mode("Overwrite") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("labels", ":Tech") \
    .option("node.keys", "id") \
    .save()

print(f"{tech_nodes.count()} tech nodes written to Neo4j")

# Write edges to Neo4j
tech_edges.write \
    .format("org.neo4j.spark.DataSource") \
    .mode("Overwrite") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("relationship", "CO_OCCURS") \
    .option("relationship.save.strategy", "keys") \
    .option("relationship.source.labels", ":Tech") \
    .option("relationship.source.save.mode", "match") \
    .option("relationship.source.node.keys", "src:id") \
    .option("relationship.target.labels", ":Tech") \
    .option("relationship.target.save.mode", "match") \
    .option("relationship.target.node.keys", "dst:id") \
    .save()

print(f"{tech_edges.count()} CO_OCCURS relationships written to Neo4j")

                                                                                

115 tech nodes written to Neo4j


                                                                                

490 CO_OCCURS relationships written to Neo4j


# Read and Query Graphs with PySpark

In [13]:
# Add the code to read a data frame from Neo4J and run a simple query to verify 

# Simple query: Get all technology tags
tech_tags = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query", """
        MATCH (t:Tech)
        RETURN t.id AS technology_name
        ORDER BY t.id
    """) \
    .load()

print("Technology tags in Neo4j (first 10):")
tech_tags.show(10, truncate=False)

Technology tags in Neo4j (first 10):
+-------------------+
|technology_name    |
+-------------------+
|.net               |
|agile              |
|ajax               |
|amazon-web-services|
|android            |
|android-studio     |
|angular            |
|angular2           |
|angularjs          |
|apache             |
+-------------------+
only showing top 10 rows


In [14]:
sc.stop()