# Working with Graph Data in Spark

In this chapter we introduce another form of data that is increasingly relevant to data analysis applications: Graph data.



## What is a Graph?

![](graphics/third-party/graph-example.png) _Source: [Wilson Mar: Graph Database Introduction](https://wilsonmar.github.io/neo4j/https://wilsonmar.github.io/neo4j/)_ 

A **graph** is a mathematical representation of a **network**: A set of **nodes** (or **vertices**) connected by a set of **edges** (or **links**). Nodes can represent any kind of entity, while edges represent relationships between entities. Both nodes and edges can have attached attributes. 

To introduce the concept, let's have a look at some graph data representing a **social network**.

In [None]:
node_data = [
  ("a", "Alice", 34),
  ("b", "Bob", 36),
  ("c", "Charlie", 30),
  ("d", "David", 29),
  ("e", "Esther", 32),
  ("f", "Fanny", 36),
  ("g", "Gabby", 60), 
]

In [None]:
edge_data = [
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend"),
  ("g", "a", "follow")
]

For this small data example where performance does not matter, we use `pandas`, `matplotlib` and the **graph library** [`NetworkX`](http://networkx.github.io) to assemble and visualize the graph.

In [None]:
import pandas

In [None]:
node_frame = pandas.DataFrame(
    node_data,
    columns=["id", "name", "age"],
).set_index("id")
node_frame

In [None]:
edge_frame = pandas.DataFrame(
    edge_data,
    columns=["source", "target", "relationship"]
)
edge_frame

Now we construct the graph using a graph data structure from `NetworkX`. The data contains a **directed graph** - an edge has a specific direction from a source to a target node. **Undirected** graphs, in which the direction of the relationship does not matter, also exist.

In [None]:
import networkx

In [None]:
graph = networkx.DiGraph() # directed graph

In [None]:
for i, row in node_frame.iterrows():
    graph.add_node(i, **row)

In [None]:
for _, row in edge_frame.iterrows():
    graph.add_edge(row["source"], row["target"], relationship=row["relationship"])

In [None]:
import matplotlib.pyplot as plt

Now a graph layout algorithm calculates positions of the nodes so that the graph is well-arranged and readable in a drawing.

In [None]:
pos = networkx.spring_layout(graph)
networkx.draw_networkx_nodes(graph, pos, cmap=plt.get_cmap('jet'), node_size = 500)
networkx.draw_networkx_labels(graph, pos)
networkx.draw_networkx_edges(graph, pos, arrows=True)
plt.show()

## Graph Data in Spark with GraphFrames

While libraries like `NetworkX` can deal with graph data sets of thousands of nodes and edges in memory, interesting graphs can be in the billions of nodes and edges: Consider for example the **web graph** of all web pages connected by hyperlinks. For dealing with massive graphs, distributed systems for graph processing have been developed, also on the basis of Spark.

[**GraphX**](http://spark.apache.org/graphx/) is the official Apache Spark component for handling graph data. As of now however, the GraphX API is not available in PySpark. We therefore rely on an external package, **[`GraphFrames`](https://github.com/graphframes/graphframes)** that aims to combine the advantages of Spark DataFrames and GraphX algorithms. 

In [None]:
import findspark
findspark.init()

In [None]:
import pyspark

During creation of the Spark SQL session we configure PySpark to use an external package for GraphFrames.

In [None]:
spark = pyspark.sql.SparkSession \
                    .builder \
                    .appName("Graph Frames Test") \
                    .config("spark.jars.packages", "graphframes:graphframes:0.8.0-spark3.0-s_2.12") \
                    .getOrCreate()

In [None]:
import graphframes

A graph can now be assembled from two DataFrames: One for the set of nodes (with attributes) and one for the set of edges (defined by source node, target node and attributes).

In [None]:
node_frame = spark.createDataFrame(
    node_data, 
    ["id", "name", "age"]
)

In [None]:
edge_frame = spark.createDataFrame(
    edge_data,
    ["src", "dst", "relationship"]
)

In [None]:
graph = graphframes.GraphFrame(node_frame, edge_frame)

The `vertices` and `edges` attributes point to regular Spark SQL DataFrames.

In [None]:
graph.vertices.show()

In [None]:
graph.edges.show()

In [None]:
graph.edges.filter("relationship = 'follow'").count()

`GraphFrames` also exposes a set of parallelized [graph algorithms](https://graphframes.github.io/graphframes/docs/_site/user-guide.html). Take for instance the calculation of **node degree** - the number of in- or outgoing edges attached to a node.

In [None]:
graph.inDegrees.show()

In [None]:
graph.outDegrees.show()

In [None]:
graph.degrees.show()

Another frequently needed set of algorithms are for **traversal** or **search** of the graph. This method performs **breadth-first search**:

In [None]:
filteredPaths = graph.bfs(
    fromExpr = "name = 'Esther'",
    toExpr = "age < 32",
    edgeFilter = "relationship != 'friend'",
    maxPathLength = 3
).show()


Among the more complex graph analytics algorithms we find **PageRank** - the algorithm that made Google's search engine famous. It outputs a **centrality score** for each node, quantifying the importance of nodes by their position in the network.

In [None]:
%%time
result = graph.pageRank(resetProbability=0.15,  maxIter=5)
result.vertices.show()

We also get an API for filtering the graph data by attributes, for example for creating a **subgraph** of specific nodes and edges.

In [None]:
old_friends_graph = graph\
    .filterEdges("relationship = 'friend'")\
    .filterVertices("age > 30")\
    .dropIsolatedVertices()

In [None]:
old_friends_graph.edges.show()

## Open-ended Exercise: Understanding the GitHub Developers Network

As an exercise, consider the following real-world graph data set: The network of GitHub accounts, with edges showing who follows who.

- Use Spark SQL and GraphFrames to parse the CSV data into a Graph object
- Perform exploratory data analysis to understand the properties of the graph
- Can you determine which account has the largest followership?
- Who is the most important developer on GitHub? And by which measure?
- ...

In [None]:
data_path = "../.assets/data/github_network"

In [None]:
ls {data_path}

## References

- [Graph Analysis with GraphFrames](https://docs.databricks.com/spark/latest/graph-analysis/graphframes/graph-analysis-tutorial.html)

---
_This notebook is licensed under a [Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International (CC BY-NC-SA 4.0)](https://creativecommons.org/licenses/by-nc-sa/4.0/). Copyright © 2018-2025 [Point 8 GmbH](https://point-8.de)_