# Module 4: Community Detection Algorithms


<img src="images/Community-Algo-Icon.png" alt="Community Detection" width="120" style="float:right"/>

Community formation is common in all types of networks, and identifying them is essential for evaluating group behavior and emergent phenomena. The general principle in finding communities is that its members will have more relationships within the group than with nodes outside their group. Identifying these related sets reveals clusters of nodes, isolated groups, and network structure.

In this notebook we'll learn how to use these algorithms in Spark and Neo4j. Before we get started let's import those libraries:

In [48]:
from pyspark.sql.types import *
from graphframes import *
from neo4j import GraphDatabase
from pyspark.sql import functions as F
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

import pandas as pd

## Connect to Spark and Neo4j

Let's create connections to Spark and Neo4j. The following code will create a SparkContext that we'll use to connect to Spark:

In [49]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext.getOrCreate()
spark = SparkSession(sc)

And let's create a connection to the database. 

In [50]:
user = "neo4j"
password = "neo"
driver = GraphDatabase.driver("bolt://graph-algorithms-in-practice-neo4j", auth=(user, password))

## The Software Dependency Graph

Dependency graphs are particularly well suited for demonstrating the sometimes subtle differences between community detection algorithms because they tend to be more connected and hierarchical.


### Importing the Data into Apache Spark

In [51]:
def create_software_graph():
    nodes = spark.read.csv("../data/sw-nodes.csv", header=True)
    relationships = spark.read.csv("../data/sw-relationships.csv", header=True)
    return GraphFrame(nodes, relationships)

In [52]:
g = create_software_graph()

### Importing the Data into Neo4j

We can import Neo4j by running the following code:

In [53]:
with driver.session() as session:
    result = session.run("""
    LOAD CSV WITH HEADERS FROM "file:///sw-nodes.csv" AS row
    MERGE (:Library {id: row.id})
    """)
    display(result.summary().counters)
    
    result = session.run("""
    LOAD CSV WITH HEADERS FROM "file:///sw-relationships.csv" AS row
    MATCH (source:Library {id: row.src})
    MATCH (destination:Library {id: row.dst})
    MERGE (source)-[:DEPENDS_ON]->(destination)
    """)
    display(result.summary().counters)

{'labels_added': 15, 'nodes_created': 15, 'properties_set': 15}

{'relationships_created': 18}

## Triangle Count

Triangle Count determines the number of triangles passing through each node in the graph. A triangle is a set of three nodes,where each node has a relationship to all other nodes.

In [54]:
result = g.triangleCount()
(result.sort("count", ascending=False)
.filter('count > 0')
.show())

+-----+---------------+
|count|             id|
+-----+---------------+
|    1|            six|
|    1|        jupyter|
|    1|      ipykernel|
|    1|python-dateutil|
|    1|     matplotlib|
|    1|    jpy-console|
+-----+---------------+



A triangle in this graph would indicate that two of a node’s neighbors are also neighbors. Six of our libraries participate in such triangles.

What if we want to know which nodes are in those triangles? That’s where a triangle stream comes in. 

In [55]:
query = """
CALL algo.triangle.stream("Library","DEPENDS_ON")
YIELD nodeA, nodeB, nodeC
RETURN algo.getNodeById(nodeA).id AS nodeA,
       algo.getNodeById(nodeB).id AS nodeB,
       algo.getNodeById(nodeC).id AS nodeC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,nodeA,nodeB,nodeC
0,six,python-dateutil,matplotlib
1,python-dateutil,six,matplotlib
2,matplotlib,six,python-dateutil
3,jupyter,jpy-console,ipykernel
4,ipykernel,jupyter,jpy-console
5,jpy-console,jupyter,ipykernel


We can see these triangles in the diagram below:

![](images/triangles.svg)

The Clustering Coefficient algorithm measures how tightly a group is clustered compared to how tightly it could be clustered. The algorithm uses Triangle Count in its calculations, which provides a ratio of existing triangles to possible relationships.

In [56]:
query = """
CALL algo.triangleCount.stream('Library', 'DEPENDS_ON')
YIELD nodeId, triangles, coefficient
WHERE coefficient > 0
RETURN algo.getNodeById(nodeId).id AS library, coefficient
ORDER BY coefficient DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,coefficient,library
0,1.0,ipykernel
1,0.333333,six
2,0.333333,python-dateutil
3,0.333333,jupyter
4,0.333333,jpy-console
5,0.166667,matplotlib


ipykernel has a score of 1, which means that all ipykernel’s neighbors are neighbors of each other.

## Strongly Connected Components

The Strongly Connected Components (SCC) algorithm is one of the earliest graph algorithms. SCC finds sets of connected nodes in a directed graph where each node is reachable in both directions from any other node in the same set.

In [57]:
result = g.stronglyConnectedComponents(maxIter=10)
result.show()

+---------------+-------------+
|             id|    component|
+---------------+-------------+
|python-dateutil| 910533066752|
|        pyspark| 936302870528|
|     matplotlib| 944892805120|
|       jpy-core| 180388626432|
|          spacy| 223338299392|
|    jpy-console|1099511627776|
|           py4j|1279900254208|
|          numpy| 498216206336|
|            six| 523986010112|
|         pandas| 549755813888|
|      nbconvert| 558345748480|
|      ipykernel| 661424963584|
|        jupyter| 721554505728|
|     jpy-client| 764504178688|
|           pytz| 833223655424|
+---------------+-------------+



In [58]:
(result.sort("component")
 .groupby("component")
 .agg(F.collect_list("id").alias("libraries"))
 .show(truncate=False))

+-------------+-----------------+
|component    |libraries        |
+-------------+-----------------+
|180388626432 |[jpy-core]       |
|223338299392 |[spacy]          |
|498216206336 |[numpy]          |
|523986010112 |[six]            |
|549755813888 |[pandas]         |
|558345748480 |[nbconvert]      |
|661424963584 |[ipykernel]      |
|721554505728 |[jupyter]        |
|764504178688 |[jpy-client]     |
|833223655424 |[pytz]           |
|910533066752 |[python-dateutil]|
|936302870528 |[pyspark]        |
|944892805120 |[matplotlib]     |
|1099511627776|[jpy-console]    |
|1279900254208|[py4j]           |
+-------------+-----------------+



In [59]:
query = """
CALL algo.scc.stream("Library", "DEPENDS_ON")
YIELD nodeId, partition
RETURN partition, collect(algo.getNodeById(nodeId).id) AS libraries
ORDER BY size(libraries) DESC
"""


with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,libraries,partition
0,[jpy-client],0
1,[jpy-core],1
2,[six],2
3,[pandas],3
4,[numpy],4
5,[python-dateutil],5
6,[pytz],6
7,[pyspark],7
8,[matplotlib],8
9,[spacy],9


As with the Spark example, all nodes are in their own community. Let's create a circular dependency so that we end up with some nodes in the same community:

In [60]:
query = """
MATCH (py4j:Library {id: "py4j"})
MATCH (pyspark:Library {id: "pyspark"})
MERGE (extra:Library {id: "extra"})
MERGE (py4j)-[:DEPENDS_ON]->(extra)
MERGE (extra)-[:DEPENDS_ON]->(pyspark)
"""

with driver.session() as session:
    result = session.run(query)
    display(result.summary().counters)

{'labels_added': 1, 'relationships_created': 2, 'nodes_created': 1, 'properties_set': 1}

And now let's run the algorithm again:

In [61]:
query = """
CALL algo.scc.stream("Library", "DEPENDS_ON")
YIELD nodeId, partition
RETURN partition, collect(algo.getNodeById(nodeId).id) AS libraries
ORDER BY size(libraries) DESC
"""


with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,libraries,partition
0,"[extra, pyspark, py4j]",2
1,[jpy-client],0
2,[jpy-core],1
3,[six],3
4,[pandas],4
5,[numpy],5
6,[python-dateutil],6
7,[pytz],7
8,[matplotlib],9
9,[spacy],10


In [62]:
query = """
MATCH (extra:Library {id: "extra"})
DETACH DELETE extra
"""

with driver.session() as session:
    result = session.run(query)
    display(result.summary().counters)

{'nodes_deleted': 1, 'relationships_deleted': 2}

## Connected Components

The Connected Components algorithm (sometimes called Union Find or Weakly Connected Components) finds sets of connected nodes in an undirected graph where each node is reachable from any other node in the same set.

In [63]:
sc.setCheckpointDir("/tmp")
result = g.connectedComponents()
(result.sort("component")
 .groupby("component")
 .agg(F.collect_list("id").alias("libraries"))
 .show(truncate=False))

+------------+------------------------------------------------------------------+
|component   |libraries                                                         |
+------------+------------------------------------------------------------------+
|180388626432|[jpy-core, nbconvert, ipykernel, jupyter, jpy-client, jpy-console]|
|223338299392|[spacy, numpy, six, pandas, pytz, python-dateutil, matplotlib]    |
|936302870528|[pyspark, py4j]                                                   |
+------------+------------------------------------------------------------------+



In [64]:
query = """
CALL algo.unionFind.stream("Library", "DEPENDS_ON")
YIELD nodeId,setId
RETURN setId, collect(algo.getNodeById(nodeId).id) AS libraries
ORDER BY size(libraries) DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,libraries,setId
0,"[six, pandas, numpy, python-dateutil, pytz, ma...",3
1,"[jpy-client, jpy-core, jupyter, jpy-console, n...",11
2,"[pyspark, py4j]",7


## Label Propagation

The Label Propagation algorithm (LPA) is a fast algorithm for finding communities in a graph. In LPA, nodes select their group based on their direct neighbors. This process is well suited to networks where groupings are less clear and weights can be used to help a node determine which community to place itself within.

In [65]:
result = g.labelPropagation(maxIter=10)
result.show()

+---------------+-------------+
|             id|        label|
+---------------+-------------+
|python-dateutil| 498216206336|
|        pyspark| 936302870528|
|     matplotlib| 223338299392|
|       jpy-core| 180388626432|
|          spacy| 223338299392|
|    jpy-console| 180388626432|
|           py4j|1279900254208|
|          numpy| 498216206336|
|            six| 498216206336|
|         pandas| 549755813888|
|      nbconvert| 558345748480|
|      ipykernel| 558345748480|
|        jupyter| 180388626432|
|     jpy-client| 558345748480|
|           pytz| 498216206336|
+---------------+-------------+



In [66]:
(result
 .sort("label")
 .groupby("label")
 .agg(F.collect_list("id"))
 .show(truncate=False))

+-------------+-----------------------------------+
|label        |collect_list(id)                   |
+-------------+-----------------------------------+
|180388626432 |[jpy-core, jpy-console, jupyter]   |
|223338299392 |[matplotlib, spacy]                |
|498216206336 |[python-dateutil, numpy, six, pytz]|
|549755813888 |[pandas]                           |
|558345748480 |[nbconvert, ipykernel, jpy-client] |
|936302870528 |[pyspark]                          |
|1279900254208|[py4j]                             |
+-------------+-----------------------------------+



Compared to Connected Components, we have more clusters of libraries in this example. LPA is less strict than Connected Components with respect to how it determines clusters. Two neighbors (directly connected nodes) may be found to be in different clusters using Label Propagation.

We can run the same algorithm on Neo4j, using the following code:

In [67]:
query = """
CALL algo.labelPropagation.stream("Library", "DEPENDS_ON", { 
 iterations: 10
})
YIELD nodeId, label
RETURN label, 
       collect(algo.getNodeById(nodeId).id) AS libraries
ORDER BY size(libraries) DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,label,libraries
0,2,"[six, pandas, python-dateutil, matplotlib, spacy]"
1,1,"[jpy-client, jpy-core, nbconvert]"
2,14,"[jupyter, jpy-console, ipykernel]"
3,10,"[pyspark, py4j]"
4,4,[numpy]
5,6,[pytz]


## Louvain Modularity

The Louvain Modularity algorithm finds clusters by comparing community density as it assigns nodes to different groups. You can think of this as a “what if ” analysis to try various groupings with the goal of reaching a global optimum.

In [68]:
query = """
CALL algo.louvain.stream("Library", "DEPENDS_ON", {
  includeIntermediateCommunities: true
})
YIELD nodeId, communities, community
RETURN algo.getNodeById(nodeId).id AS libraries, community, communities
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,communities,community,libraries
0,"[0, 0]",0,jpy-client
1,"[0, 0]",0,jpy-core
2,"[1, 1]",1,six
3,"[2, 1]",1,pandas
4,"[3, 1]",1,numpy
5,"[1, 1]",1,python-dateutil
6,"[2, 1]",1,pytz
7,"[4, 2]",2,pyspark
8,"[1, 1]",1,matplotlib
9,"[3, 1]",1,spacy


In [69]:
query = """
CALL algo.louvain("Library", "DEPENDS_ON", {
  writeProperty: 'community',
  includeIntermediateCommunities: true,
  intermediateCommunitiesWriteProperty: 'communities'
})
"""

with driver.session() as session:
    result = session.run(query)
    display(result.summary().counters)

{}

In [71]:
query = """
MATCH (l:Library)
RETURN l.communities[0] AS community, collect(l.id) AS libraries
ORDER BY size(libraries) DESC
"""

with driver.session() as session:
    rows = session.run(query)
    df = pd.DataFrame([dict(record) for record in rows])

display(df)

Unnamed: 0,community,libraries
0,0,"[jpy-client, jpy-core, nbconvert]"
1,1,"[six, python-dateutil, matplotlib]"
2,5,"[jupyter, jpy-console, ipykernel]"
3,2,"[pandas, pytz]"
4,3,"[numpy, spacy]"
5,4,"[pyspark, py4j]"
