# <center> <img src="../../img/ITESOLogo.png" alt="ITESO" width="480" height="130"> </center>
# <center> **Departamento de Electrónica, Sistemas e Informática** </center>
---
## <center> Computer Systems Engineering  </center>
---
### <center> Big Data Processing </center>
---
#### <center> **Autumn 2025** </center>
---

**Lab 05**: Data pipeline with Neo4j

**Date**: October 2nd 2025

**Student Name**: Luis Roberto Chavez Mancilla

**Professor**: Pablo Camarillo Ramirez

# Dataset description
- Este dataset modela un grafo de tecnologias a partir de stories de desarrolladores en Stack Overflow. Cada *Node* corresponde a un tag de una tecnologia (js, java, css, docker) y cada *Edge* indica que dos tecnologias co ocurren o son mencionadas juntas en los perfiles. El peso de la arista es un valor continuio que refleja la fuerza de esa co-ocurrencia.

#### El scheme usado:
Nodos: 
- `name` es un string con el nombre del tag (key del nodo)
- `group` un int con el identificador de la comunidad
- `nodesize` un double con la frecuencia de uso del tag
Aristas:
- `source` un string del nombre del tag de origen
- `target` string con el nombre del tag destino
- `value` double con el peso de co ocurrencia entre ambos tags

Un value alto entre css y html indica que suelen aparecer juntos con mucha frecuencia en los perfiles.
Un nodesize alto sugiere una tecnologia muy mencionada
Los group ayudan a identificar grupos de tecnologias (por ejemplo: data o cloud)

# Data ingestion

In [94]:
import findspark
findspark.init()

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Examples on SparkSQL") \
    .master("spark://spark-master:7077") \
    .config("spark.jars.packages", "org.neo4j:neo4j-connector-apache-spark_2.13:5.3.10_for_spark_3") \
    .config("spark.ui.port", "4040") \
    .getOrCreate()

sc = spark.sparkContext
sc.setLogLevel("ERROR")

# Optimization (reduce the number of shuffle partitions)
# spark.conf.set("spark.sql.shuffle.partitions", "5") # INVESTIGAR SI DEJAR ESTA LINEA Y pq?

## Generate schema

In [95]:
# Build schema
from robertoman.spark_utils import SparkUtils

nodes_schema = SparkUtils.generate_schema(
    [
        ("name", "string"),
        ("group", "int"),
        ("nodesize", "double"),
    ]
)

links_schema = SparkUtils.generate_schema(
    [
        ("source", "string"),
        ("target", "string"),
        ("value", "double"),
    ]
)

In [96]:
base_path = "/opt/spark/work-dir/data/"

# esquema tipo string
nodes_df = (
    spark.read.option("header", "true")
    .schema(nodes_schema)
    .csv(base_path + "stackoverflow/stack_network_nodes.csv")
)

links_df = (
    spark.read.option("header", "true")
    .schema(links_schema)
    .csv(base_path + "stackoverflow/stack_network_links.csv")
)

print("----NODES-----")
nodes_df.printSchema()
nodes_df.show(10, truncate=False)

print("----LINKS----")
links_df.printSchema()
links_df.show(10, truncate=False)

----NODES-----
root
 |-- name: string (nullable = true)
 |-- group: integer (nullable = true)
 |-- nodesize: double (nullable = true)



[Stage 0:>                                                          (0 + 1) / 1]

+-------------+-----+--------+
|name         |group|nodesize|
+-------------+-----+--------+
|html         |6    |272.45  |
|css          |6    |341.17  |
|hibernate    |8    |29.83   |
|spring       |8    |52.84   |
|ruby         |3    |70.14   |
|ruby-on-rails|3    |55.31   |
|ios          |4    |87.46   |
|swift        |4    |63.62   |
|html5        |6    |140.18  |
|c            |1    |189.83  |
+-------------+-----+--------+
only showing top 10 rows
----LINKS----
root
 |-- source: string (nullable = true)
 |-- target: string (nullable = true)
 |-- value: double (nullable = true)

+----------------+------+------------------+
|source          |target|value             |
+----------------+------+------------------+
|azure           |.net  |20.933192346640457|
|sql-server      |.net  |32.322524219339904|
|asp.net         |.net  |48.40702996199019 |
|entity-framework|.net  |24.37090250532431 |
|wpf             |.net  |32.35092522005943 |
|linq            |.net  |20.501743858149066|
|wc

                                                                                

# Transformations
- En nuestro caso el data set que escogimos estaba ya pre estructurado como un grafo en dos archivos. Uno de nodes stack_network_nodes.csv con las columnas `name`, `group`, y `nodesize`, para los edges stack_network_links.csv con las columnas `source`, `target`, y `value` peso de co-ocurrencia.
- Por lo tanto no es necesario hacer la relacion entre nodos y edges como lo vimos en el ejemplo de la lectura 13. 
- En mi caso las "tranformaciones" que aplico son solo de tipado correcto en los schemas, por eso tenemos los dos data frames requeridos uno de nodos y otro de aristas, con tipos correctos y sin inconsistencias, sin tener que construir relaciones desde otras tablas, ya que el propio dataset de stack overflow ya está segmentado como grafo.

In [97]:
# Add the code for your transformations to create nodes and edges DataFrames HERE

# Writing Data in Neo4j

In [98]:
# Add the code to write a graph from PySpark's DataFrames to Neo4j
# Config Neo4j
neo4j_url = "bolt://neo4j-iteso:7687"
neo4j_user = "neo4j"
neo4j_passwd = "neo4j@1234"

#Nodes - Tech (key = name, )
nodes_df.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("labels", ":Tech") \
  .option("node.keys", "name") \
  .save()

print(f"{nodes_df.count()} Tech nodes wrote in Neo4j")

# Edges - TECH_NEIGHBOUR o una co ocurrencia (source -> target) y guardamos value como propiedad
links_df.write \
  .format("org.neo4j.spark.DataSource") \
  .mode("Overwrite") \
  .option("url", neo4j_url) \
  .option("authentication.basic.username", neo4j_user) \
  .option("authentication.basic.password", neo4j_passwd) \
  .option("relationship", "TECH_NEIGHBOUR") \
  .option("relationship.save.strategy", "keys") \
  .option("relationship.source.labels", ":Tech") \
  .option("relationship.source.save.mode", "match") \
  .option("relationship.source.node.keys", "source:name") \
  .option("relationship.target.labels", ":Tech") \
  .option("relationship.target.save.mode", "match") \
  .option("relationship.target.node.keys", "target:name") \
  .save()

print(f"{links_df.count()} relationships wrote in Neo4j")

115 Tech nodes wrote in Neo4j
490 relationships wrote in Neo4j


# Read and Query Graphs with PySpark

In [99]:
# Add the code to read a data frame from Neo4J and run a simple query to verify 
top_pairs_df = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query", """
        MATCH (a:Tech)-[r:TECH_NEIGHBOUR]-(b:Tech)
        WHERE a.name < b.name
        RETURN a.name AS tech_a, b.name AS tech_b, r.value AS weight
        ORDER BY weight DESC
    """) \
    .load()

top_pairs_df.show(truncate=False)


+-----------+-------------+------------------+
|tech_a     |tech_b       |weight            |
+-----------+-------------+------------------+
|css        |html         |126.57112712972764|
|css        |html         |126.57112712972764|
|hibernate  |spring       |103.26828446355263|
|hibernate  |spring       |103.26828446355263|
|ruby       |ruby-on-rails|95.36131071220332 |
|ruby       |ruby-on-rails|95.36131071220332 |
|ios        |swift        |87.21964246099864 |
|ios        |swift        |87.21964246099864 |
|css        |html5        |87.13826986156899 |
|css        |html5        |87.13826986156899 |
|c          |c++          |80.89104614147385 |
|c          |c++          |80.89104614147385 |
|asp.net    |c#           |80.4485421720991  |
|asp.net    |c#           |80.4485421720991  |
|objective-c|swift        |79.08853577916759 |
|objective-c|swift        |79.08853577916759 |
|ios        |objective-c  |78.75928046651394 |
|ios        |objective-c  |78.75928046651394 |
|css        |

In [None]:
tech = "javascript"

neighbors_df = spark.read \
    .format("org.neo4j.spark.DataSource") \
    .option("url", neo4j_url) \
    .option("authentication.basic.username", neo4j_user) \
    .option("authentication.basic.password", neo4j_passwd) \
    .option("query", f"""
        MATCH (:Tech {{name: '{tech}'}})-[r:APPEARS_WITH]-(n:Tech)
        RETURN n.name AS neighbor, r.value AS weight
        ORDER BY weight DESC
    """) \
    .load()

neighbors_df.show(truncate=False)


+---------+------------------+
|neighbor |weight            |
+---------+------------------+
|css      |75.53660009612221 |
|css      |75.53660009612221 |
|html     |59.75548884052987 |
|html     |59.75548884052987 |
|jquery   |57.84183152642191 |
|jquery   |57.84183152642191 |
|php      |47.3281575555596  |
|php      |47.3281575555596  |
|html5    |47.00636375705097 |
|html5    |47.00636375705097 |
|node.js  |42.73172932305638 |
|node.js  |42.73172932305638 |
|angularjs|39.37662666227728 |
|angularjs|39.37662666227728 |
|reactjs  |33.56735910485145 |
|reactjs  |33.56735910485145 |
|ajax     |24.39914442262329 |
|ajax     |24.39914442262329 |
|sass     |23.782469883653217|
|sass     |23.782469883653217|
+---------+------------------+
only showing top 20 rows


In [101]:
sc.stop()