# Module 4: Community Detection Algorithms

Community formation is common in all types of networks, and identifying them is essential for evaluating group behavior and emergent phenomena. The general principle in finding communities is that its members will have more relationships within the group than with nodes outside their group. Identifying these related sets reveals clusters of nodes, isolated groups, and network structure.

In this notebook we'll learn how to use these algorithms in Spark and Neo4j. Before we get started let's import those libraries:

In [5]:
from pyspark.sql.types import *
from graphframes import *
from neo4j import GraphDatabase
from pyspark.sql import functions as F
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

import pandas as pd

## Connect to Spark and Neo4j

Let's create connections to Spark and Neo4j. The following code will create a SparkContext that we'll use to connect to Spark:

In [1]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

And let's create a connection to the database. 

In [4]:
user = "neo4j"
password = "neo"
driver = GraphDatabase.driver("bolt://data-science-training-neo4j", auth=(user, password))

## The Software Dependency Graph

Dependency graphs are particularly well suited for demonstrating the sometimes subtle differences between community detection algorithms because they tend to be more connected and hierarchical.


### Importing the Data into Apache Spark

In [8]:
def create_software_graph():
    nodes = spark.read.csv("data/sw-nodes.csv", header=True)
    relationships = spark.read.csv("data/sw-relationships.csv", header=True)
    return GraphFrame(nodes, relationships)

In [10]:
g = create_software_graph()

### Importing the Data into Neo4j

In [6]:
user = "neo4j"
password = "neo"
driver = GraphDatabase.driver("bolt://localhost", auth=(user, password))

In [22]:
with driver.session() as session:
    result = session.run("""
    LOAD CSV WITH HEADERS FROM "file:///sw-nodes.csv" AS row
    MERGE (:Library {id: row.id})
    """)
    display(result.summary().counters)
    
    result = session.run("""
    LOAD CSV WITH HEADERS FROM "file:///sw-relationships.csv" AS row
    MATCH (source:Library {id: row.src})
    MATCH (destination:Library {id: row.dst})
    MERGE (source)-[:DEPENDS_ON]->(destination)
    """)
    display(result.summary().counters)

{}

{}

## Triangle Count

In [11]:
result = g.triangleCount()
(result.sort("count", ascending=False)
.filter('count > 0')
.show())

+-----+---------------+
|count|             id|
+-----+---------------+
|    1|            six|
|    1|      ipykernel|
|    1|        jupyter|
|    1|python-dateutil|
|    1|    jpy-console|
|    1|     matplotlib|
+-----+---------------+



A triangle in this graph would indicate that two of a node’s neighbors are also neighbors. Six of our libraries participate in such triangles.

What if we want to know which nodes are in those triangles? That’s where a triangle stream comes in. 

In [27]:
query = """
CALL algo.triangle.stream("Library","DEPENDS_ON")
YIELD nodeA, nodeB, nodeC
RETURN algo.getNodeById(nodeA).id AS nodeA,
       algo.getNodeById(nodeB).id AS nodeB,
       algo.getNodeById(nodeC).id AS nodeC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,nodeA,nodeB,nodeC
0,six,python-dateutil,matplotlib
1,python-dateutil,six,matplotlib
2,matplotlib,six,python-dateutil
3,jupyter,jpy-console,ipykernel
4,jpy-console,jupyter,ipykernel
5,ipykernel,jupyter,jpy-console


We can see these triangles in the diagram below:

![](images/triangles.svg)

In [33]:
query = """
CALL algo.triangleCount.stream('Library', 'DEPENDS_ON')
YIELD nodeId, triangles, coefficient
WHERE coefficient > 0
RETURN algo.getNodeById(nodeId).id AS library, coefficient
ORDER BY coefficient DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,coefficient,library
0,1.0,ipykernel
1,0.333333,six
2,0.333333,python-dateutil
3,0.333333,jupyter
4,0.333333,jpy-console
5,0.166667,matplotlib


ipykernel has a score of 1, which means that all ipykernel’s neighbors are neighbors of each other.

## Strongly Connected Components

The Strongly Connected Components (SCC) algorithm is one of the earliest graph algorithms. SCC finds sets of connected nodes in a directed graph where each node is reachable in both directions from any other node in the same set.

In [37]:
result = g.stronglyConnectedComponents(maxIter=10)
(result.sort("component")
 .groupby("component")
 .agg(F.collect_list("id").alias("libraries"))
 .show(truncate=False))

+-------------+-----------------+
|component    |libraries        |
+-------------+-----------------+
|180388626432 |[jpy-core]       |
|223338299392 |[spacy]          |
|498216206336 |[numpy]          |
|523986010112 |[six]            |
|549755813888 |[pandas]         |
|558345748480 |[nbconvert]      |
|661424963584 |[ipykernel]      |
|721554505728 |[jupyter]        |
|764504178688 |[jpy-client]     |
|833223655424 |[pytz]           |
|910533066752 |[python-dateutil]|
|936302870528 |[pyspark]        |
|944892805120 |[matplotlib]     |
|1099511627776|[jpy-console]    |
|1279900254208|[py4j]           |
+-------------+-----------------+



In [38]:
query = """
CALL algo.scc.stream("Library", "DEPENDS_ON")
YIELD nodeId, partition
RETURN partition, collect(algo.getNodeById(nodeId).id) AS libraries
ORDER BY size(libraries) DESC
"""


with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,libraries,partition
0,[py4j],8
1,[nbconvert],11
2,[numpy],2
3,[pyspark],5
4,[jpy-core],14
5,[jpy-client],13
6,[pytz],4
7,[spacy],7
8,[pandas],1
9,[jpy-console],10


## Connected Components

In [57]:
# sc.setCheckpointDir("/tmp")
result = g.connectedComponents()
(result.sort("component")
 .groupby("component")
 .agg(F.collect_list("id").alias("libraries"))
 .show(truncate=False))

+------------+------------------------------------------------------------------+
|component   |libraries                                                         |
+------------+------------------------------------------------------------------+
|180388626432|[jpy-core, nbconvert, ipykernel, jupyter, jpy-client, jpy-console]|
|223338299392|[spacy, numpy, six, pandas, pytz, python-dateutil, matplotlib]    |
|936302870528|[pyspark, py4j]                                                   |
+------------+------------------------------------------------------------------+



In [43]:
query = """
CALL algo.unionFind.stream("Library", "DEPENDS_ON")
YIELD nodeId,setId
RETURN setId, collect(algo.getNodeById(nodeId).id) AS libraries
ORDER BY size(libraries) DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,libraries,setId
0,"[six, pandas, numpy, python-dateutil, pytz, ma...",1
1,"[jupyter, jpy-console, nbconvert, ipykernel, j...",9
2,"[pyspark, py4j]",5


## Label Propagation

The Label Propagation algorithm (LPA) is a fast algorithm for finding communities in a graph. In LPA, nodes select their group based on their direct neighbors. This process is well suited to networks where groupings are less clear and weights can be used to help a node determine which community to place itself within.

In [45]:
result = g.labelPropagation(maxIter=10)
(result
 .sort("label")
 .groupby("label")
 .agg(F.collect_list("id"))
 .show(truncate=False))

+-------------+-----------------------------------+
|label        |collect_list(id)                   |
+-------------+-----------------------------------+
|180388626432 |[jpy-core, jpy-console, jupyter]   |
|223338299392 |[matplotlib, spacy]                |
|498216206336 |[python-dateutil, numpy, six, pytz]|
|549755813888 |[pandas]                           |
|558345748480 |[nbconvert, ipykernel, jpy-client] |
|936302870528 |[pyspark]                          |
|1279900254208|[py4j]                             |
+-------------+-----------------------------------+



Compared to Connected Components, we have more clusters of libraries in this example. LPA is less strict than Connected Components with respect to how it determines clusters. Two neighbors (directly connected nodes) may be found to be in different clusters using Label Propagation.

In [47]:
query = """
CALL algo.labelPropagation.stream("Library", "DEPENDS_ON", { 
 iterations: 10 
})
YIELD nodeId, label
RETURN label, 
       collect(algo.getNodeById(nodeId).id) AS libraries
ORDER BY size(libraries) DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,label,libraries
0,14,"[jupyter, jpy-console, nbconvert, jpy-client, ..."
1,0,"[six, python-dateutil, matplotlib]"
2,8,"[pyspark, py4j]"
3,2,"[numpy, spacy]"
4,4,"[pandas, pytz]"
5,12,[ipykernel]


## Louvain Modularity

The Louvain Modularity algorithm finds clusters by comparing community density as it assigns nodes to different groups. You can think of this as a “what if ” analysis to try various groupings with the goal of reaching a global optimum.

In [53]:
query = """
CALL algo.louvain.stream("Library", "DEPENDS_ON", {
  includeIntermediateCommunities: true
})
YIELD nodeId, communities, community
RETURN algo.getNodeById(nodeId).id AS libraries, community, communities
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,communities,community,libraries
0,"[0, 0]",0,six
1,"[1, 0]",0,pandas
2,"[2, 0]",0,numpy
3,"[0, 0]",0,python-dateutil
4,"[1, 0]",0,pytz
5,"[3, 1]",1,pyspark
6,"[0, 0]",0,matplotlib
7,"[2, 0]",0,spacy
8,"[3, 1]",1,py4j
9,"[4, 2]",2,jupyter


In [55]:
query = """
CALL algo.louvain("Library", "DEPENDS_ON", {
  writeProperty: 'community',
  includeIntermediateCommunities: true,
  intermediateCommunitiesWriteProperty: 'communities'
})
"""

with driver.session() as session:
    result = session.run(query)
    display(result.summary().counters)

{}

In [56]:
query = """
MATCH (l:Library)
RETURN l.communities[-1] AS community, collect(l.id) AS libraries
ORDER BY size(libraries) DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,community,libraries
0,0,"[six, pandas, numpy, python-dateutil, pytz, ma..."
1,2,"[jupyter, jpy-console, nbconvert, ipykernel, j..."
2,1,"[pyspark, py4j]"
