# Module 3: Centrality Algorithms



<img src="images/Centrality-Algo-Icon.png" alt="Centrality" width="120" style="float:right"/>

Centrality algorithms are used to understand the roles of particular nodes in a graph and their impact on that network. They’re useful because they identify the most important nodes and help us understand group dynamics such as credibility, accessibility, the speed at which things spread, and bridges between groups.

In this notebook we'll learn how to use these algorithms in Spark and Neo4j. Before we get started let's import those libraries:

In [1]:
from pyspark.sql.types import *
from graphframes import *
from neo4j import GraphDatabase
from pyspark.sql import functions as F
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

import pandas as pd

## Connect to Spark and Neo4j

Let's create connections to Spark and Neo4j. The following code will create a SparkContext that we'll use to connect to Spark:

In [2]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

And let's create a connection to the database. 

In [3]:
user = "neo4j"
password = "neo"
driver = GraphDatabase.driver("bolt://graph-algorithms-in-practice-neo4j", auth=(user, password))

## The Social Graph

The examples in this notebooks are run against a small Twitter-like graph.

### Importing the Data into Apache Spark

In [4]:
def create_social_graph():
    v = spark.read.csv("../data/social-nodes.csv", header=True)
    e = spark.read.csv("../data/social-relationships.csv", header=True)
    return GraphFrame(v, e)

In [5]:
g = create_social_graph()

### Importing the Data into Neo4j

In [6]:
with driver.session() as session:    
    result = session.run("""
    WITH "file:///social-nodes.csv" AS uri
    LOAD CSV WITH HEADERS FROM uri AS row
    MERGE (:User {id: row.id})
    """)
    display(result.summary().counters)
    
    result = session.run("""
    WITH "file:///social-relationships.csv" AS uri
    LOAD CSV WITH HEADERS FROM uri AS row
    MATCH (source:User {id: row.src})
    MATCH (destination:User {id: row.dst})
    MERGE (source)-[:FOLLOWS]->(destination)
    """)
    display(result.summary().counters)

{'labels_added': 9, 'nodes_created': 9, 'properties_set': 9}

{'relationships_created': 16}

## Degree Centrality

Degree Centrality is the simplest of the centrality algorithms. It counts the number of incoming and outgoing relationships from a node, and is used to find popular nodes in a graph.

In [7]:
total_degree = g.degrees
in_degree = g.inDegrees
out_degree = g.outDegrees
(total_degree.join(in_degree, "id", how="left")
 .join(out_degree, "id", how="left")
 .fillna(0)
 .sort("inDegree", ascending=False)
 .show())

+-------+------+--------+---------+
|     id|degree|inDegree|outDegree|
+-------+------+--------+---------+
|   Doug|     6|       5|        1|
|  Alice|     7|       3|        4|
|Bridget|     5|       2|        3|
|Michael|     5|       2|        3|
|   Mark|     3|       1|        2|
|Charles|     2|       1|        1|
|  David|     2|       1|        1|
|    Amy|     1|       1|        0|
|  James|     1|       0|        1|
+-------+------+--------+---------+



Doug is the most popular user in our Twitter graph, with five followers (in-links). All other users in that part of the graph follow him and he only follows one person back. In the real Twitter network, celebrities have high follower counts but tend to follow few people. We could therefore consider Doug a celebrity!

## Closeness Centrality

Closeness Centrality is a way of detecting nodes that are able to spread information efficiently through a subgraph.

In [8]:
query = """
CALL algo.closeness.stream("User", "FOLLOWS")
YIELD nodeId, centrality
RETURN algo.getNodeById(nodeId).id AS user, centrality
ORDER BY centrality DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,centrality,user
0,1.0,Alice
1,1.0,Doug
2,1.0,David
3,0.714286,Bridget
4,0.714286,Michael
5,0.666667,Amy
6,0.666667,James
7,0.625,Charles
8,0.625,Mark


Ideally we’d like to get an indication of closeness across the whole graph, and in the next two sections we’ll learn about a few variations of the Closeness Centrality algorithm that do this.

Stanley Wasserman and Katherine Faust came up with an improved formula for calculating closeness for graphs with multiple subgraphs without connections between those groups.

In [24]:
query = """
CALL algo.closeness.stream("User", "FOLLOWS", {improved: true})
YIELD nodeId, centrality
RETURN algo.getNodeById(nodeId).id AS user, centrality
ORDER BY centrality DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,centrality,user
0,0.5,Doug
1,0.5,Alice
2,0.357143,Michael
3,0.357143,Bridget
4,0.3125,Charles
5,0.3125,Mark
6,0.125,David
7,0.083333,Amy
8,0.083333,James


The results are now more representative of the closeness of nodes to the entire graph.

## Betweenness Centrality

Sometimes the most important cog in the system is not the one with the most overt power or the highest status. Sometimes it’s the middlemen that connect groups or the brokers who the most control over resources or the flow of information.

In [26]:
query = """
CALL algo.betweenness.stream("User", "FOLLOWS")
YIELD nodeId, centrality
RETURN algo.getNodeById(nodeId).id AS user, centrality
ORDER BY centrality DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,centrality,user
0,10.0,Alice
1,7.0,Doug
2,7.0,Mark
3,1.0,David
4,0.0,Charles
5,0.0,Michael
6,0.0,Amy
7,0.0,James
8,0.0,Bridget


Alice is the main broker in this network, but Mark and Doug aren’t far behind. In the smaller subgraph all shortest paths go through David, so he is important for information flow among those nodes.

## PageRank

PageRank is the best known of the centrality algorithms. It measures the transitive (or directional) influence of nodes. All the other centrality algorithms we discuss measure the direct influence of a node, whereas PageRank considers the influence of a node’s neighbors, and their neighbors.

In [25]:
results = g.pageRank(resetProbability=0.15, maxIter=20)
results.vertices.sort("pagerank", ascending=False).show()

+-------+-------------------+
|     id|           pagerank|
+-------+-------------------+
|   Doug| 2.2865372087512252|
|   Mark| 2.1424484186137263|
|  Alice|  1.520330830262095|
|Michael| 0.7274429252585624|
|Bridget| 0.7274429252585624|
|Charles| 0.5213852310709753|
|    Amy| 0.5097143486157744|
|  David|0.36655842368870073|
|  James| 0.1981396884803788|
+-------+-------------------+



As we might expect, Doug has the highest PageRank because he is followed by all other users in his subgraph. Although Mark only has one follower, that follower is Doug, so Mark is also considered important in this graph. It’s not only the number of followers that is important, but also the importance of those followers.